Skip to content

Design of Handling Big Object (Version 0) #3787

@kunwp1

Description

@kunwp1

Problem

Texera processes data in relational tables, where values are currently inlined within tuples. Consider the case where a user uploads a file and wants to process it through multiple operators in Texera. The straightforward way to represent the file is as a binary attribute (i.e., a cell in a tuple).

User A's use case:
User A uploads an 8 GB dataset and wants to process it with an R UDF operator in Texera. One way to do so is use a file scan operator to embed the file in a tuple as a binary attribute and passes it downstream (e.g., scan →R UDF). However, this fails for at least four reasons:

  1. Java byte[] constraint - JVM arrays are indexed by int, so a single array is limited to ~2 GB. An 8 GB payload cannot be stored in a single tuple field.
  2. Operator messaging - Texera uses Akka for inter-operator communication, which ultimately serializes data into byte arrays or equivalent formats (Java serialization, Protobuf, Kryo). These inherit the same <2 GB ceiling.
  3. Parquet cell size constraint (via Apache Iceberg) - Parquet requires each cell value to fit within a single byte[], which is limited to <2 GB in the Java implementation.
  4. Arrow limitation - Apache Arrow also enforces a 2 GB maximum per array buffer.

Design

  • We’ll externalize large objects using pointers. The main motivation is the R UDF operator. It relies on Arrow for serialization/deserialization, and Arrow can only handle objects smaller than 2GB.
  • Assumptions: no fault tolerance for now, read-only usage, and each pointer represents a single cell.
  • We’ll introduce a BigObjectManager module to manage the object lifecycle. The details need to be refined. The tentative APIs are:
    • create(bigObjectFile) -> ptr
    • open(ptr) -> bigObjectFileStream
    • delete(ptr)
  • The lifecycle of a big object follows the workflow execution lifecycle:
    • Created when an operator outputs a value larger than 2GB (or above a threshold).
    • Deleted when the workflow execution is cleaned up.
  • We don't need to consider reference count in the current design if we bundle the lifecycle of a big object with the workflow execution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions