Skip to content

Conversation

@carloea2
Copy link
Contributor

@carloea2 carloea2 commented Jan 31, 2026

What changes were proposed in this PR?

This new PR introduces a PythonTemplateBuilder mechanism to create Texera’s Python native operators. It refactors how Python code is created using a new template concept, addressing prior issues with string formatting. Previously, creating Python-based operators is via raw string formatting, which is fragile: user text can contain {}, %, quotes, or newlines that break formatting. This PR makes codegen deterministic and safer by treating interpolated values as data segments.

Design

Diagram 1 (compile-time pyb expansion and validation)

This diagram describes the Scala compile-time flow when a developer writes a pyb"..." template: the pyb macro receives the literal parts and argument trees, verifies that literal segments are safe, classifies each interpolated argument (plain text vs. encodable vs. nested builder), and applies boundary validation to ensure encodable content cannot “break out” of its intended Python context. Each argument is evaluated once, runtime guards are injected when a nested builder is spliced in, and the pieces are concatenated into a PythonTemplateBuilder, which compacts adjacent text chunks and renders an encode() output where encodable values become decode-at-runtime segments before the generated Python is embedded into the operator payload.

sequenceDiagram
    participant Dev as Scala code
    participant SC as StringContext
    participant M as pyb macro
    participant EI as EncodableInspector
    participant BV as BoundaryValidator
    participant PTB as PythonTemplateBuilder

    Dev->>SC: pyb"t0 $a0 t1 $a1 t2"
    SC->>M: parts + arg trees
    M->>M: verify literal parts
    M->>EI: classify args
    loop each direct encodable arg
        M->>BV: validateCompileTime(left,right,prefixLine)
        BV-->>M: ok / abort
    end
    M->>M: eval each arg once into __pyb_argN
    loop each nested builder arg
        M->>BV: runtimeChecksForNestedBuilder(ctx,__pyb_argN)
        BV-->>M: injected guard if unsafe
    end
    M->>PTB: concat parts + __pyb_argN
    PTB-->>Dev: returns PythonTemplateBuilder
    PTB->>PTB: compact adjacent Text chunks
    PTB->>PTB: render Encode (encodable -> decode(base64))
    PTB-->>Dev: encode() returns python source string
    Dev->>Dev: embed generated python into operator payload

Loading

Diagram 2 (end-to-end runtime flow: UI → descriptor → worker decoding with cache)

This diagram illustrates the end-to-end pipeline from UI input to execution: the UI submits parameters (including user-controlled strings) to the Scala descriptor, where pyb expansion and PythonTemplateBuilder assembly produce a deterministic Python source string in “encode mode.” The encoded Python is embedded into the workflow plan payload, dispatched by the workflow service to the Python worker, and executed by the operator; during execution, the operator uses PythonTemplateDecoder to recover user text by decoding each encoded segment. An LRU cache (size 256) backs the decoder so repeated encoded strings decode once and subsequently reuse cached UTF-8 strings, reducing overhead while preserving strict decoding semantics.

sequenceDiagram
    autonumber
    participant UI as UI Web
    participant DESC as Descriptor (Scala)
    participant MAC as pyb macro (compile time)
    participant PTB as PythonTemplateBuilder
    participant PLAN as Plan payload
    participant SVC as Workflow service
    participant WK as Python worker
    participant OP as Python Operator
    participant DEC as PythonTemplateDecoder
    participant CACHE as lru_cache 256

    note over DESC,PTB: PyB related (Scala compile time codegen)
    UI->>DESC: submit params + code strings
    DESC->>MAC: pyb interpolation expands
    MAC-->>DESC: expanded builder + validation logic
    DESC->>PTB: assemble chunks (Text + Value)
    PTB-->>DESC: rendered python source (encode mode)

    note over DESC,WK: Plan + dispatch
    DESC->>PLAN: embed python source into payload
    PLAN->>SVC: submit workflow plan
    SVC->>WK: dispatch operator payload

    note over WK,DEC: Python runtime (worker executes generated source)
    WK->>OP: start operator with python source

    loop each encoded segment
        OP->>DEC: decode(base64)

        DEC->>CACHE: lookup(base64)
        alt cache hit
            CACHE-->>DEC: cached str
        else cache miss
            CACHE-->>DEC: miss
            DEC->>DEC: base64 decode + utf8 strict
            DEC->>CACHE: store(base64,str)
        end

        DEC-->>OP: recovered user text
    end

    OP-->>WK: execution continued
Loading

Diagram 3 (test harness: generate code, reject raw-invalid, py_compile)

This diagram shows the automated verification path for Python native operators: ScalaTest uses ClassGraph to discover every PythonOperatorDescriptor, instantiates each descriptor, inject invalid raw strings into class fields marked with Json properties and calls generatePythonCode() to produce the final Python source string. The test asserts that no “RawInvalid” marker appears in the generated output (indicating unsafe raw text did not leak), writes the source to a temporary source.py, and runs python -m py_compile to ensure the code is syntactically valid and compilable. Any raw-invalid leakage, compile error, or timeout causes the test to fail, enforcing consistent template-based code generation across operators.

sequenceDiagram
  autonumber
  participant TS as ScalaTest
  participant CG as ClassGraph scanner
  participant DESC as PythonOperatorDescriptor
  participant GEN as generatePythonCode
  participant SPEC as PythonCodeRawInvalidTextSpec
  participant PY as python -m py_compile
  participant FS as temp file (source.py)

  TS->>CG: scan descriptors in packages
  CG-->>TS: list of PythonOperatorDescriptor classes

  loop each descriptor class
    TS->>DESC: instantiate descriptor
    TS->>GEN: call generatePythonCode(descriptor)
    GEN-->>TS: python source string

    TS->>SPEC: assert RawInvalid marker not present
    alt marker leaked
      SPEC-->>TS: FAIL (invalid raw text leaked)
    else marker clean
      SPEC-->>TS: OK
      TS->>FS: write source to temp file
      TS->>PY: py_compile(temp file)
      alt compile error or timeout
        PY-->>TS: FAIL (compile/timeout)
      else compile ok
        PY-->>TS: PASS
      end
    end
  end
Loading

As a developer, how to use pyb to create your python-based operators

  1. Use EncodableString for any UI/user-controlled text

Before (raw String)

@JsonSchemaTitle("Ground Truth Attribute Column")
@AutofillAttributeName
var groundTruthAttribute: String = ""

@JsonSchemaTitle("Selected Features")
@AutofillAttributeNameList
var selectedFeatures: List[String] = _

After (EncodableString)

import org.apache.texera.amber.pybuilder.PyStringTypes.EncodableString

@JsonSchemaTitle("Ground Truth Attribute Column")
@AutofillAttributeName
var groundTruthAttribute: EncodableString = ""

@JsonSchemaTitle("Selected Features")
@AutofillAttributeNameList
var selectedFeatures: List[EncodableString] = _

  1. Write Python using pyb"""...""" and interpolate values with $param

Before (string interpolation with manual quoting)

val code =
  s"""
     |y_train = self.dataset[\"$groundTruthAttribute\"]
     |""".stripMargin

After (template + data: no manual quoting)

import org.apache.texera.amber.pybuilder.PythonTemplateBuilder.PythonTemplateBuilderStringContext

val code = pyb"""
  |y_train = self.dataset[$groundTruthAttribute]
  |""".encode //Automatic stripMargin applied inside the builder

  1. For optional arguments, represent them as small pyb fragments, then put them in the code template

Before (manual string concatenation + quote juggling)

val colorArg   = if (color.nonEmpty) s", color='$color'" else ""
val patternArg = if (pattern.nonEmpty) s", pattern_shape='$pattern'" else ""

val fig = s"fig = px.timeline(table, x_start='start', x_end='finish', y='task'$colorArg$patternArg)"

After (optional fragments are builders too)

val colorArg   = if (color.nonEmpty) pyb", color=$color" else pyb"""
val patternArg = if (pattern.nonEmpty) pyb", pattern_shape=$pattern" else pyb"""

val fig = pyb"""fig = px.timeline(table, x_start=$start, x_end=$finish, y=$task$colorArg$patternArg)"""

  1. Return .encode from generatePythonCode()

Before (returns raw string)

override def generatePythonCode(): String = {
  val finalCode =
    s"""
       |from pytexera import *
       |y_train = self.dataset[\"$groundTruthAttribute\"]
       |""".stripMargin
  finalCode
}

After (returns encoded output from builder)

override def generatePythonCode(): String = {
  val finalCode = pyb"""
    |from pytexera import *
    |y_train = self.dataset[$groundTruthAttribute]
    |"""
  finalCode.encode
}

  1. Try to avoid the use of s"...", .format, or % formatting for Python codegen

Before (s / String.format / .format patterns)

// s"..."
return s"""table[\"${ele.attribute}\"].values.shape[0]"""

// String.format / "{}" placeholders
workflowParam = workflowParam + String.format("%s = {},", ele.parameter.getName)
portParam = portParam + String.format("%s(table['%s'].values[i]),", ele.parameter.getType, ele.attribute)

After (pyb templates end-to-end)

return pyb"""table[${ele.attribute}].values.shape[0]"""

workflowParam = pyb"$workflowParam${ele.parameter.getName} = {},"
portParam = pyb"$portParam${ele.parameter.getType}(table[${ele.attribute}].values[i]),"

  1. Develop the unit tests in the new way

Before (expects quoted literals like 'start')

assert(
  opDesc.createPlotlyFigure().plain.contains(
    "fig = px.timeline(table, x_start='start', x_end='finish', y='task' , color='color' )"
  )
)

After (expects template output using variables, no embedded quotes)

assert(
  opDesc.createPlotlyFigure().plain.contains(
    "fig = px.timeline(table, x_start=start, x_end=finish, y=task , color=color )"
  )
)

Any related issues, documentation, discussions?

No

How was this PR tested?

The PR includes a comprehensive set of tests to ensure the new functionality works and that it doesn’t break existing workflows:

Unit Tests for PythonTemplateBuilder: New unit tests were added to verify that PythonTemplateBuilder correctly classifies and encodes segments. For example, tests likely feed in code strings with various edge cases (braces, percentage signs, quotes, etc.) and assert that the builder produces the expected spec output.

Unit Tests for PythonCodeRawInvalidTextSpec: 2 new unit test to instantiate each Python Native Operator, and call generatePythonCode method and checks the python code compiles and the string format is consistent.

Was this PR authored or co-authored using generative AI tooling?

Reviewed by ChatGPT 5.2

@github-actions github-actions bot added engine dependencies Pull requests that update a dependency file fix python ci changes related to CI common labels Jan 31, 2026
@carloea2 carloea2 changed the title refactor(backend) Introducing python template builder refactor(backend): Introducing python template builder Jan 31, 2026
@chenlica chenlica requested a review from bobbai00 February 1, 2026 02:00
@bobbai00 bobbai00 changed the title refactor(backend): Introducing python template builder feat(backend): introduce python UDF code template builder for Python based operators Feb 1, 2026
Copy link
Contributor

@bobbai00 bobbai00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add more details to the PR description. Specifically, you can briefly talk about the motivation of this PR, and how developer should develop new Python-based operators using the python template builder

@bobbai00 bobbai00 changed the title feat(backend): introduce python UDF code template builder for Python based operators feat(backend): introduce python code template builder for creating Python based operators Feb 5, 2026
@bobbai00
Copy link
Contributor

bobbai00 commented Feb 5, 2026

@carloea2 In the PR description,

  1. in section "How to develop new Python-based operators", please have a simple example for each of the item (i.e. 1,2,3,4,5)

  2. Move the "Design" to front, under "What changes were proposed in this PR?", for each diagram, please add a short description.

@carloea2
Copy link
Contributor Author

carloea2 commented Feb 6, 2026

Done, thanks.

@bobbai00 bobbai00 merged commit cfdad43 into apache:main Feb 9, 2026
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci changes related to CI common dependencies Pull requests that update a dependency file engine fix python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants