-
Notifications
You must be signed in to change notification settings - Fork 116
Open
Description
Problem
Texera processes data in relational tables, where values are currently inlined within tuples. Consider the case where a user uploads a file and wants to process it through multiple operators in Texera. The straightforward way to represent the file is as a binary attribute (i.e., a cell in a tuple).
User A's use case:
User A uploads an 8 GB dataset and wants to process it with an R UDF operator in Texera. One way to do so is use a file scan operator to embed the file in a tuple as a binary attribute and passes it downstream (e.g., scan →R UDF). However, this fails for at least four reasons:
- Java byte[] constraint - JVM arrays are indexed by int, so a single array is limited to ~2 GB. An 8 GB payload cannot be stored in a single tuple field.
- Operator messaging - Texera uses Akka for inter-operator communication, which ultimately serializes data into byte arrays or equivalent formats (Java serialization, Protobuf, Kryo). These inherit the same <2 GB ceiling.
- Parquet cell size constraint (via Apache Iceberg) - Parquet requires each cell value to fit within a single byte[], which is limited to <2 GB in the Java implementation.
- Arrow limitation - Apache Arrow also enforces a 2 GB maximum per array buffer.
Design
- We’ll externalize large objects using pointers. The main motivation is the R UDF operator. It relies on Arrow for serialization/deserialization, and Arrow can only handle objects smaller than 2GB.
- Assumptions: no fault tolerance for now, read-only usage, and each pointer represents a single cell.
- We’ll introduce a BigObjectManager module to manage the object lifecycle. The details need to be refined. The tentative APIs are:
- create(bigObjectFile) -> ptr
- open(ptr) -> bigObjectFileStream
- delete(ptr)
- The lifecycle of a big object follows the workflow execution lifecycle:
- Created when an operator outputs a value larger than 2GB (or above a threshold).
- Deleted when the workflow execution is cleaned up.
- We don't need to consider reference count in the current design if we bundle the lifecycle of a big object with the workflow execution.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels