Design of Handling Big Object (Version 0)

# Problem

Texera processes data in relational tables, where values are currently inlined within tuples. Consider the case where a user uploads a file and wants to process it through multiple operators in Texera. The straightforward way to represent the file is as a binary attribute (i.e., a cell in a tuple).

User A's use case:
User A uploads an 8 GB dataset and wants to process it with an R UDF operator in Texera. One way to do so is use a file scan operator to embed the file in a tuple as a binary attribute and passes it downstream (e.g., scan →R UDF). However, this fails for at least four reasons:

1. Java byte[] constraint - JVM arrays are indexed by int, so a single array is limited to ~2 GB. An 8 GB payload cannot be stored in a single tuple field.
2. Operator messaging - Texera uses Akka for inter-operator communication, which ultimately serializes data into byte arrays or equivalent formats (Java serialization, Protobuf, Kryo). These inherit the same <2 GB ceiling.
3. Parquet cell size constraint (via Apache Iceberg) - Parquet requires each cell value to fit within a single byte[], which is limited to <2 GB in the Java implementation.
4. Arrow limitation - Apache Arrow also enforces a 2 GB maximum per array buffer.

# Design

- We’ll externalize large objects using pointers. The main motivation is the R UDF operator. It relies on Arrow for serialization/deserialization, and Arrow can only handle objects smaller than 2GB.
- Assumptions: no fault tolerance for now, read-only usage, and each pointer represents a single cell.
- We’ll introduce a BigObjectManager module to manage the object lifecycle. The details need to be refined. The tentative APIs are:
  - create(bigObjectFile) -> ptr
  - open(ptr) -> bigObjectFileStream
  - delete(ptr)
- The lifecycle of a big object follows the workflow execution lifecycle:
  - Created when an operator outputs a value larger than 2GB (or above a threshold).
  - Deleted when the workflow execution is cleaned up.
- We don't need to consider reference count in the current design if we bundle the lifecycle of a big object with the workflow execution.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design of Handling Big Object (Version 0) #3787

Problem

Design

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Design of Handling Big Object (Version 0) #3787

Description

Problem

Design

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions