feat(backend): introduce python code template builder for creating Python based operators #4189
+4,268
−795
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this PR?
This new PR introduces a PythonTemplateBuilder mechanism to create Texera’s Python native operators. It refactors how Python code is created using a new template concept, addressing prior issues with string formatting. Previously, creating Python-based operators is via raw string formatting, which is fragile: user text can contain
{},%, quotes, or newlines that break formatting. This PR makes codegen deterministic and safer by treating interpolated values as data segments.Design
Diagram 1 (compile-time
pybexpansion and validation)This diagram describes the Scala compile-time flow when a developer writes a
pyb"..."template: thepybmacro receives the literal parts and argument trees, verifies that literal segments are safe, classifies each interpolated argument (plain text vs. encodable vs. nested builder), and applies boundary validation to ensure encodable content cannot “break out” of its intended Python context. Each argument is evaluated once, runtime guards are injected when a nested builder is spliced in, and the pieces are concatenated into aPythonTemplateBuilder, which compacts adjacent text chunks and renders anencode()output where encodable values become decode-at-runtime segments before the generated Python is embedded into the operator payload.sequenceDiagram participant Dev as Scala code participant SC as StringContext participant M as pyb macro participant EI as EncodableInspector participant BV as BoundaryValidator participant PTB as PythonTemplateBuilder Dev->>SC: pyb"t0 $a0 t1 $a1 t2" SC->>M: parts + arg trees M->>M: verify literal parts M->>EI: classify args loop each direct encodable arg M->>BV: validateCompileTime(left,right,prefixLine) BV-->>M: ok / abort end M->>M: eval each arg once into __pyb_argN loop each nested builder arg M->>BV: runtimeChecksForNestedBuilder(ctx,__pyb_argN) BV-->>M: injected guard if unsafe end M->>PTB: concat parts + __pyb_argN PTB-->>Dev: returns PythonTemplateBuilder PTB->>PTB: compact adjacent Text chunks PTB->>PTB: render Encode (encodable -> decode(base64)) PTB-->>Dev: encode() returns python source string Dev->>Dev: embed generated python into operator payloadDiagram 2 (end-to-end runtime flow: UI → descriptor → worker decoding with cache)
This diagram illustrates the end-to-end pipeline from UI input to execution: the UI submits parameters (including user-controlled strings) to the Scala descriptor, where
pybexpansion andPythonTemplateBuilderassembly produce a deterministic Python source string in “encode mode.” The encoded Python is embedded into the workflow plan payload, dispatched by the workflow service to the Python worker, and executed by the operator; during execution, the operator usesPythonTemplateDecoderto recover user text by decoding each encoded segment. An LRU cache (size 256) backs the decoder so repeated encoded strings decode once and subsequently reuse cached UTF-8 strings, reducing overhead while preserving strict decoding semantics.sequenceDiagram autonumber participant UI as UI Web participant DESC as Descriptor (Scala) participant MAC as pyb macro (compile time) participant PTB as PythonTemplateBuilder participant PLAN as Plan payload participant SVC as Workflow service participant WK as Python worker participant OP as Python Operator participant DEC as PythonTemplateDecoder participant CACHE as lru_cache 256 note over DESC,PTB: PyB related (Scala compile time codegen) UI->>DESC: submit params + code strings DESC->>MAC: pyb interpolation expands MAC-->>DESC: expanded builder + validation logic DESC->>PTB: assemble chunks (Text + Value) PTB-->>DESC: rendered python source (encode mode) note over DESC,WK: Plan + dispatch DESC->>PLAN: embed python source into payload PLAN->>SVC: submit workflow plan SVC->>WK: dispatch operator payload note over WK,DEC: Python runtime (worker executes generated source) WK->>OP: start operator with python source loop each encoded segment OP->>DEC: decode(base64) DEC->>CACHE: lookup(base64) alt cache hit CACHE-->>DEC: cached str else cache miss CACHE-->>DEC: miss DEC->>DEC: base64 decode + utf8 strict DEC->>CACHE: store(base64,str) end DEC-->>OP: recovered user text end OP-->>WK: execution continuedDiagram 3 (test harness: generate code, reject raw-invalid,
py_compile)This diagram shows the automated verification path for Python native operators: ScalaTest uses ClassGraph to discover every
PythonOperatorDescriptor, instantiates each descriptor, inject invalid raw strings into class fields marked withJsonproperties and callsgeneratePythonCode()to produce the final Python source string. The test asserts that no “RawInvalid” marker appears in the generated output (indicating unsafe raw text did not leak), writes the source to a temporarysource.py, and runspython -m py_compileto ensure the code is syntactically valid and compilable. Any raw-invalid leakage, compile error, or timeout causes the test to fail, enforcing consistent template-based code generation across operators.sequenceDiagram autonumber participant TS as ScalaTest participant CG as ClassGraph scanner participant DESC as PythonOperatorDescriptor participant GEN as generatePythonCode participant SPEC as PythonCodeRawInvalidTextSpec participant PY as python -m py_compile participant FS as temp file (source.py) TS->>CG: scan descriptors in packages CG-->>TS: list of PythonOperatorDescriptor classes loop each descriptor class TS->>DESC: instantiate descriptor TS->>GEN: call generatePythonCode(descriptor) GEN-->>TS: python source string TS->>SPEC: assert RawInvalid marker not present alt marker leaked SPEC-->>TS: FAIL (invalid raw text leaked) else marker clean SPEC-->>TS: OK TS->>FS: write source to temp file TS->>PY: py_compile(temp file) alt compile error or timeout PY-->>TS: FAIL (compile/timeout) else compile ok PY-->>TS: PASS end end endAs a developer, how to use
pybto create your python-based operatorsEncodableStringfor any UI/user-controlled textBefore (raw
String)After (
EncodableString)pyb"""..."""and interpolate values with$paramBefore (string interpolation with manual quoting)
After (template + data: no manual quoting)
pybfragments, then put them in the code templateBefore (manual string concatenation + quote juggling)
After (optional fragments are builders too)
.encodefromgeneratePythonCode()Before (returns raw string)
After (returns encoded output from builder)
s"...",.format, or%formatting for Python codegenBefore (
s/String.format/.formatpatterns)After (
pybtemplates end-to-end)Before (expects quoted literals like
'start')assert( opDesc.createPlotlyFigure().plain.contains( "fig = px.timeline(table, x_start='start', x_end='finish', y='task' , color='color' )" ) )After (expects template output using variables, no embedded quotes)
assert( opDesc.createPlotlyFigure().plain.contains( "fig = px.timeline(table, x_start=start, x_end=finish, y=task , color=color )" ) )Any related issues, documentation, discussions?
No
How was this PR tested?
The PR includes a comprehensive set of tests to ensure the new functionality works and that it doesn’t break existing workflows:
Unit Tests for PythonTemplateBuilder: New unit tests were added to verify that PythonTemplateBuilder correctly classifies and encodes segments. For example, tests likely feed in code strings with various edge cases (braces, percentage signs, quotes, etc.) and assert that the builder produces the expected spec output.
Unit Tests for PythonCodeRawInvalidTextSpec: 2 new unit test to instantiate each Python Native Operator, and call
generatePythonCodemethod and checks the python code compiles and the string format is consistent.Was this PR authored or co-authored using generative AI tooling?
Reviewed by ChatGPT 5.2