Qwen3-TTS: end-to-end XNNPACK text-to-speech bring-up#18508
Qwen3-TTS: end-to-end XNNPACK text-to-speech bring-up#18508seyeong-han wants to merge 5 commits intopytorch:mainfrom
Conversation
Implement a conversion/export/runtime path for the Qwen3-TTS speech tokenizer decoder with XNNPACK on CPU: weight conversion from HF snapshots, static-shape export, codec generation helper, and a C++ runner that decodes codec ids to WAV output. Made-with: Cursor
Decoder performance: export at multiple fixed codes_len buckets (75/150/300/600/1200) instead of a single 1200. The runner selects the smallest bucket that fits the input, reducing vocoder padding waste from 13x to 1.6x for typical inputs. Measured 10.5x decode speedup (32.4s → 3.1s for 91 codes, 8da4w XNNPACK CPU). Talker export: reuse the existing Llama/Qwen3 infrastructure to export the talker backbone (28-layer transformer) and code predictor (5-layer) as .pte models with static KV cache and 8da4w quantization. Weight conversion maps HF talker checkpoint to Meta/Llama format. Talker runs at 64ms/step, code predictor at 7.2ms/step on CPU. Streaming decode: interleave code generation with incremental vocoder decoding in 25-code chunks, yielding first audio at 2.15s instead of waiting for all codes (3.97s non-streaming, 32.4s old baseline). This PR was authored with Claude.
Replaces the multi-bucket decoder-only pipeline with a single .pte file containing all 6 pipeline stages (encode_text, talker, code_predictor, codec_embed, cp_head, decode_audio), following the Parakeet multi-method export pattern. Key changes: - export_unified.py: multi-method export with per-component quantization, dynamic-shape decoder (patched CausalConvNet for SymInt compat), and embedding quantization support (--qembedding 4w/8w) - qwen3_tts_unified_runner: C++ runner with lazy method loading, XNNPACK warmup, automatic silence trimming, and decode-only backward compat - generate_codes.py: added --trim-silence to strip conditioning prefix Model sizes: 1.0 GB (4w emb) / 1.2 GB (8w emb) / 2.1 GB (no emb quant) Decode perf: 2.0s for 91 codes (3.6x realtime) after XNNPACK warmup Authored with Claude.
Teach the unified runner and export path to mirror the MLX reference for dynamic text prompts, sampling behavior, and English codec prefix handling so XNNPACK text synthesis stays coherent end to end. Add contract tests, checked-in manifests, and small export compatibility shims so the single-PTE workflow remains reproducible. Made-with: Cursor
Replace the greedy-only unrolled cp_generate export with a sampling-aware v2 contract that performs inverse-CDF top-k(50) sampling inside the fused XNNPACK graph, collapsing 15 host-side sub-code round trips into one call. Add a persistent SynthesisSession with per-session RNG so the runner stays loaded/warmed across sequential prompts. Extend main_unified.cpp with --prompts_path, --repeat, --seed, and --disable_fused_cp_generate flags for multi-prompt warm benchmarking with generation-only timing breakdowns. The runner gates the fast path on exported metadata (contract version, top_k match, temperature threshold) and falls back to the legacy host-side sub-code loop for older .pte artifacts or unsupported sampler modes. Warm benchmark results show the fused path reduces per-step codegen cost by ~15-20% compared to the legacy loop on the same XNNPACK artifact. Generated with assistance from Claude. Made-with: Cursor
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18508
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 498b6d2 with merge base 518daa8 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Summary
Add a complete Qwen3-TTS text-to-speech pipeline for ExecuTorch with XNNPACK backend, from model export through C++ inference to WAV output.
export_unified.py): packagesencode_text,talker,code_predictor,codec_embed,cp_head,cp_generate, anddecode_audiointo onemodel.ptewith 8da4w quantization and optional embedding quantization (4w/8w)cp_generatev2: collapses the 15-step host-side sub-code loop into a single XNNPACK graph call with inverse-CDF top-k(50) sampling, reducing per-step codegen cost by ~15-20%SynthesisSessionAPI: keeps the runner warm across sequential prompts with per-session RNG, detailedSynthesisTimingbreakdowns (prompt prep, talker prefill, codegen, decode-audio), and automatic fast-path/legacy-fallback gating based on exported contract metadata--prompts_path,--repeat,--seed,--disable_fused_cp_generateflags for honest generation-only latency measurement without startup taxArchitecture
Benchmark (Apple M-series, XNNPACK 8da4w, warm single process)
Commits (review order)
53ab54c— Initial XNNPACK bring-up: model wrappers, weight conversion, decode-only runneraa37d0f— Multi-bucket decoder, talker export, streaming decode scaffolding510c0ff— Unified single-PTE export and C++ runner with text synthesise3ddd29— Align text synthesis with MLX reference semantics (English prefix, sampling, EOS)498b6d2— Fused cp_generate v2, SynthesisSession API, warm benchmark toolingTest plan
python -m unittestpasses all 4 test modules (28 tests)cmake --build cmake-out/examples/models/qwen3-ttsmodel.pteexported from updatedexport_unified.py(no missing ops)--text "..." --max_new_tokens 247--text "..." --top_k 50--prompts_pathGenerated with assistance from Claude.
Made with Cursor