Controlling `PyBuffer` requests

At present in the implementation of [`PyBuffer::get`](https://docs.rs/pyo3/latest/pyo3/buffer/struct.PyBuffer.html#method.get) we use [`PyBuf_FULL_RO`](https://docs.python.org/3/c-api/buffer.html#c.PyBUF_FULL_RO) as the flags, which requests the full information from the exporter so we can verify type and layout information.

This is great for safety, however it was pointed out to me recently that it's possible in pure-Python to cast a non-u8 array to u8, which is potentially helpful for writing arbitrary buffers to a bytes slice:

```python
>>> from numpy import array

# observing the underlying bytes in float32 / float64 arrays
>>> print(bytes(array([1.0, 2.0, 3.0], dtype="float32")))
b'\x00\x00\x80?\x00\x00\x00@\x00\x00@@'
>>> print(bytes(array([1.0, 2.0, 3.0], dtype="float64")))
b'\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@'

# even object arrays support this
>>> print(bytes(array([object(), object(), object()])))
b'\xc0\xa6w\xfe[{\x00\x00\x80\xa6w\xfe[{\x00\x00\xe0\xa6w\xfe[{\x00\x00'
```

Reinterpreting an array of one type as another is probably in general as bad as a transmute, so I think this is fundamentally an `unsafe` operation. In particular, the object pointers as bytes is likely to be unhelpful because these won't be stable from one execution of a program to another. Even worse, modifying such an array of pointers at the u8 level is very likely to be a way to create segfaults. This can even be done in pure-Python.

```python
x = array(["foo"], dtype="object")

# corrupt the first byte of the address of "foo" object
memoryview(x).cast("B")[0] = 123
# segfaults
print(x[0])
```

That said, the floating-point cast above can probably be seen as equivalent to `to_ne_bytes` if you squint, so this isn't always unsafe.

----

I think there's a reasonable case that we should expose some ways to do this carefully:
- It's IMO a valid use-case to want to be able to read an arbitrary buffer as a `u8` slice so that it can be sent to some writer
- Arguably the safety requirements / documentation we can put on these methods can help users navigate the footguns better than in pure-Python where it seems easy to trip oneself up when reinterpreting buffers in this way. Indeed we could make the case that not having any way to do this encourages users to do workarounds in (dynamic) Python, at safety and performance cost.

That said, I'm not entirely sure what API we need. Some ideas:
- We could maybe have `PyAnyBuffer` which doesn't check content format and defers this to runtime
- We could have a special case for an (`unsafe`) `PyBuffer<u8>::get_as_bytes` which could do this reinterpretation of other buffers as bytes at runtime.
  - Not sure how this interacts with assumptions currently in the `PyBuffer<u8>` implementation for things like strides and offsets with itemsize != 8.
- Maybe we want a more general (`unsafe`) way to allow any buffer to be interpreted as `PyBuffer<T>` as long as the layout is sufficiently compatible.

... it also feels like there could be some safe subset of which types can be viewed as bytes buffers, such as floats, which might be a variant of the proposed `PyBuffer<u8>::get_as_bytes`.

I'm going to continue reflecting here to see which of these options feel good to me, all input welcome.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Controlling `PyBuffer` requests #5245

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Controlling PyBuffer requests #5245

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Controlling `PyBuffer` requests #5245