Skip to content

Qwen3-TTS: end-to-end XNNPACK text-to-speech bring-up#18508

Draft
seyeong-han wants to merge 5 commits intopytorch:mainfrom
seyeong-han:qwen3-tts-xnnpack-bringup
Draft

Qwen3-TTS: end-to-end XNNPACK text-to-speech bring-up#18508
seyeong-han wants to merge 5 commits intopytorch:mainfrom
seyeong-han:qwen3-tts-xnnpack-bringup

Conversation

@seyeong-han
Copy link
Contributor

Summary

Add a complete Qwen3-TTS text-to-speech pipeline for ExecuTorch with XNNPACK backend, from model export through C++ inference to WAV output.

  • Single-PTE multi-method export (export_unified.py): packages encode_text, talker, code_predictor, codec_embed, cp_head, cp_generate, and decode_audio into one model.pte with 8da4w quantization and optional embedding quantization (4w/8w)
  • Fused sampling-aware cp_generate v2: collapses the 15-step host-side sub-code loop into a single XNNPACK graph call with inverse-CDF top-k(50) sampling, reducing per-step codegen cost by ~15-20%
  • Persistent SynthesisSession API: keeps the runner warm across sequential prompts with per-session RNG, detailed SynthesisTiming breakdowns (prompt prep, talker prefill, codegen, decode-audio), and automatic fast-path/legacy-fallback gating based on exported contract metadata
  • Multi-prompt warm benchmarking CLI: --prompts_path, --repeat, --seed, --disable_fused_cp_generate flags for honest generation-only latency measurement without startup tax
  • Contract tests: metadata, quality, runner, and prompt-flow test suites validating the export/runner API surface

Architecture

Text → encode_text → talker (prefill + autoregressive) → cp_generate (fused) → decode_audio → WAV
                                                          ↑ fallback: code_predictor + cp_head loop

Benchmark (Apple M-series, XNNPACK 8da4w, warm single process)

Path Generation time Codegen Audio
Legacy (host loop) 23.9s 21.1s 11.5s
Fused cp_generate v2 19.6s 17.0s 10.8s

Commits (review order)

  1. 53ab54c — Initial XNNPACK bring-up: model wrappers, weight conversion, decode-only runner
  2. aa37d0f — Multi-bucket decoder, talker export, streaming decode scaffolding
  3. 510c0ff — Unified single-PTE export and C++ runner with text synthesis
  4. e3ddd29 — Align text synthesis with MLX reference semantics (English prefix, sampling, EOS)
  5. 498b6d2 — Fused cp_generate v2, SynthesisSession API, warm benchmark tooling

Test plan

  • python -m unittest passes all 4 test modules (28 tests)
  • C++ runner builds with cmake --build cmake-out/examples/models/qwen3-tts
  • Fresh model.pte exported from updated export_unified.py (no missing ops)
  • Legacy path produces clear speech: --text "..." --max_new_tokens 247
  • Fused fast path produces clear speech: --text "..." --top_k 50
  • Multi-prompt warm benchmark runs end-to-end with --prompts_path

Generated with assistance from Claude.

Made with Cursor

Implement a conversion/export/runtime path for the Qwen3-TTS speech
tokenizer decoder with XNNPACK on CPU: weight conversion from HF
snapshots, static-shape export, codec generation helper, and a C++
runner that decodes codec ids to WAV output.

Made-with: Cursor
Decoder performance: export at multiple fixed codes_len buckets
(75/150/300/600/1200) instead of a single 1200. The runner selects the
smallest bucket that fits the input, reducing vocoder padding waste from
13x to 1.6x for typical inputs. Measured 10.5x decode speedup (32.4s →
3.1s for 91 codes, 8da4w XNNPACK CPU).

Talker export: reuse the existing Llama/Qwen3 infrastructure to export
the talker backbone (28-layer transformer) and code predictor (5-layer)
as .pte models with static KV cache and 8da4w quantization. Weight
conversion maps HF talker checkpoint to Meta/Llama format. Talker runs
at 64ms/step, code predictor at 7.2ms/step on CPU.

Streaming decode: interleave code generation with incremental vocoder
decoding in 25-code chunks, yielding first audio at 2.15s instead of
waiting for all codes (3.97s non-streaming, 32.4s old baseline).

This PR was authored with Claude.
Replaces the multi-bucket decoder-only pipeline with a single .pte file
containing all 6 pipeline stages (encode_text, talker, code_predictor,
codec_embed, cp_head, decode_audio), following the Parakeet multi-method
export pattern.

Key changes:
- export_unified.py: multi-method export with per-component quantization,
  dynamic-shape decoder (patched CausalConvNet for SymInt compat), and
  embedding quantization support (--qembedding 4w/8w)
- qwen3_tts_unified_runner: C++ runner with lazy method loading, XNNPACK
  warmup, automatic silence trimming, and decode-only backward compat
- generate_codes.py: added --trim-silence to strip conditioning prefix

Model sizes: 1.0 GB (4w emb) / 1.2 GB (8w emb) / 2.1 GB (no emb quant)
Decode perf: 2.0s for 91 codes (3.6x realtime) after XNNPACK warmup

Authored with Claude.
Teach the unified runner and export path to mirror the MLX reference for dynamic text prompts, sampling behavior, and English codec prefix handling so XNNPACK text synthesis stays coherent end to end. Add contract tests, checked-in manifests, and small export compatibility shims so the single-PTE workflow remains reproducible.

Made-with: Cursor
Replace the greedy-only unrolled cp_generate export with a sampling-aware
v2 contract that performs inverse-CDF top-k(50) sampling inside the fused
XNNPACK graph, collapsing 15 host-side sub-code round trips into one call.

Add a persistent SynthesisSession with per-session RNG so the runner stays
loaded/warmed across sequential prompts. Extend main_unified.cpp with
--prompts_path, --repeat, --seed, and --disable_fused_cp_generate flags for
multi-prompt warm benchmarking with generation-only timing breakdowns.

The runner gates the fast path on exported metadata (contract version,
top_k match, temperature threshold) and falls back to the legacy host-side
sub-code loop for older .pte artifacts or unsupported sampler modes.

Warm benchmark results show the fused path reduces per-step codegen cost
by ~15-20% compared to the legacy loop on the same XNNPACK artifact.

Generated with assistance from Claude.

Made-with: Cursor
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 25, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18508

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 498b6d2 with merge base 518daa8 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 25, 2026
@seyeong-han seyeong-han marked this pull request as draft March 25, 2026 22:27
@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant