Add streaming Silero VAD runner for real-time speech detection#18507
Add streaming Silero VAD runner for real-time speech detection#18507seyeong-han wants to merge 1 commit intopytorch:mainfrom
Conversation
Add a new `silero_vad_stream_runner` CLI that reads 16kHz mono float32 PCM from stdin and outputs per-frame speech probabilities via a simple line protocol (`PROB <time> <probability>`). This enables real-time VAD as a subprocess for apps like the Voxtral Realtime macOS dictation app. Changes: - Add `reset_stream()` and `process_frame()` to SileroVadRunner for stateful frame-by-frame inference with persistent LSTM state - Add `stream_main.cpp` as the streaming CLI entry point - Update CMakeLists.txt to build both `silero_vad_runner` (offline) and `silero_vad_stream_runner` (streaming) targets - Remove unnecessary `extension_llm_runner` dependency that caused build conflicts with sentencepiece headers - Update Makefile `silero-vad-cpu` target to build both runners with `-DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=OFF` - Update README with streaming usage and architecture docs Authored with assistance from Claude. Made-with: Cursor
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18507
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below:
|
This PR needs a
|
Summary
Add a streaming CLI entry point (
silero_vad_stream_runner) for the Silero VAD model that enables real-time, frame-by-frame voice activity detection from stdin. This powers the "hey torch" wake-up feature in the Voxtral Realtime macOS app.Changes
New:
silero_vad_stream_runnerA CLI that reads 16kHz mono float32 PCM from stdin and outputs per-frame speech probabilities via a line protocol:
This enables any app to run Silero VAD as a subprocess — pipe audio in, parse probabilities out. The Voxtral macOS app uses this for hands-free wake-up detection.
New: Streaming API on
SileroVadRunnerreset_stream()— re-initialize LSTM state and context buffersprocess_frame(audio_data, num_samples)— process a single 512-sample chunk, return speech probability, carry LSTM state forwardThe existing
detect()method now usesprocess_frame()internally, so offline and streaming paths share the same inference code.Build changes
CMakeLists.txt— addsilero_vad_stream_runnertarget alongsidesilero_vad_runnerextension_llm_runnerlink dependency that causedstring_viewambiguity with sentencepiece headersMakefilesilero-vad-cputarget — build both runners, configure with-DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=OFFREADME.md— document streaming usage, architecture, and line protocolUsage
Test plan
make silero-vad-cpubuilds bothsilero_vad_runnerandsilero_vad_stream_runnerAuthored with assistance from Claude.
Made with Cursor