-
Notifications
You must be signed in to change notification settings - Fork 916
Description
At present in the implementation of PyBuffer::get we use PyBuf_FULL_RO as the flags, which requests the full information from the exporter so we can verify type and layout information.
This is great for safety, however it was pointed out to me recently that it's possible in pure-Python to cast a non-u8 array to u8, which is potentially helpful for writing arbitrary buffers to a bytes slice:
>>> from numpy import array
# observing the underlying bytes in float32 / float64 arrays
>>> print(bytes(array([1.0, 2.0, 3.0], dtype="float32")))
b'\x00\x00\x80?\x00\x00\x00@\x00\x00@@'
>>> print(bytes(array([1.0, 2.0, 3.0], dtype="float64")))
b'\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@'
# even object arrays support this
>>> print(bytes(array([object(), object(), object()])))
b'\xc0\xa6w\xfe[{\x00\x00\x80\xa6w\xfe[{\x00\x00\xe0\xa6w\xfe[{\x00\x00'Reinterpreting an array of one type as another is probably in general as bad as a transmute, so I think this is fundamentally an unsafe operation. In particular, the object pointers as bytes is likely to be unhelpful because these won't be stable from one execution of a program to another. Even worse, modifying such an array of pointers at the u8 level is very likely to be a way to create segfaults. This can even be done in pure-Python.
x = array(["foo"], dtype="object")
# corrupt the first byte of the address of "foo" object
memoryview(x).cast("B")[0] = 123
# segfaults
print(x[0])That said, the floating-point cast above can probably be seen as equivalent to to_ne_bytes if you squint, so this isn't always unsafe.
I think there's a reasonable case that we should expose some ways to do this carefully:
- It's IMO a valid use-case to want to be able to read an arbitrary buffer as a
u8slice so that it can be sent to some writer - Arguably the safety requirements / documentation we can put on these methods can help users navigate the footguns better than in pure-Python where it seems easy to trip oneself up when reinterpreting buffers in this way. Indeed we could make the case that not having any way to do this encourages users to do workarounds in (dynamic) Python, at safety and performance cost.
That said, I'm not entirely sure what API we need. Some ideas:
- We could maybe have
PyAnyBufferwhich doesn't check content format and defers this to runtime - We could have a special case for an (
unsafe)PyBuffer<u8>::get_as_byteswhich could do this reinterpretation of other buffers as bytes at runtime.- Not sure how this interacts with assumptions currently in the
PyBuffer<u8>implementation for things like strides and offsets with itemsize != 8.
- Not sure how this interacts with assumptions currently in the
- Maybe we want a more general (
unsafe) way to allow any buffer to be interpreted asPyBuffer<T>as long as the layout is sufficiently compatible.
... it also feels like there could be some safe subset of which types can be viewed as bytes buffers, such as floats, which might be a variant of the proposed PyBuffer<u8>::get_as_bytes.
I'm going to continue reflecting here to see which of these options feel good to me, all input welcome.