Files
cohere-transcribe/docs/superpowers/specs/2026-05-29-live-streaming-transcription-design.md
2026-05-29 02:38:05 +08:00

3.2 KiB

Live Streaming Microphone Transcription

Summary

Add a --stream mode to transcribe.py that continuously captures audio from the microphone, detects speech segments using energy-based VAD, and transcribes each segment in near-real-time using the Cohere ASR model. Output scrolls as timestamped lines in the terminal. Ctrl+C stops the session.

Context

  • Model: CohereLabs/cohere-transcribe-03-2026, max 35s audio clips, 5s overlap for auto-chunking
  • Inference speed: ~0.4s for 5-10s audio on GPU (0.04-0.08x real-time)
  • Microphone: PD200X Podcast Microphone via PipeWire, 16kHz mono
  • Existing code: transcribe.py has --mic (fixed duration) and demo file modes

Architecture

Audio Capture

sounddevice.InputStream with a callback streams 16kHz mono float32 audio into a thread-safe buffer. The callback appends raw samples; a separate consumer reads them.

Voice Activity Detection

Energy-based VAD using RMS amplitude over 50ms frames (800 samples at 16kHz):

  • Threshold: Calibrated from 0.5s of ambient silence at startup, with a sensible fallback (-40 dBFS)
  • State machine: SILENCE -> SPEAKING -> SILENCE
    • SILENCE -> SPEAKING: RMS exceeds threshold for >= 3 consecutive frames (~150ms)
    • SPEAKING -> SILENCE: RMS stays below threshold for >= 0.8s
  • Pre-roll: ~0.3s of audio before speech onset is included to avoid clipping word beginnings
  • Safety cap: If speech exceeds 30s without a pause, force a chunk boundary (model max is 35s)

Threading Model

Two threads communicating via queue.Queue:

  1. Audio thread (sounddevice callback + VAD logic): captures audio, runs VAD state machine, pushes completed speech segments onto the queue
  2. Transcription thread: pulls segments from the queue, runs processor() -> model.generate() -> processor.decode(), prints results

No state carried between segments. Each is transcribed independently.

Output

Timestamped lines printed to stdout as each segment is transcribed:

[00:03] Good morning, this is a test of the live captioning system.
[00:08] The model seems to be picking up my voice pretty well.

Shutdown

Ctrl+C sets a stop flag via signal handler. The audio stream stops, any buffered speech is flushed and transcribed, then the program exits cleanly.

CLI Interface

uv run python transcribe.py --stream            # stream, default language (en)
uv run python transcribe.py --stream --lang ja   # stream in Japanese
uv run python transcribe.py --mic [duration]     # existing fixed-duration mode
uv run python transcribe.py                      # existing demo file mode

Startup Sequence

  1. Print "Loading model..." and load model
  2. Record ~0.5s of ambient audio, compute silence threshold
  3. Print threshold info and "Listening... (Ctrl+C to stop)"
  4. Begin streaming

Dependencies

No new dependencies. Uses: sounddevice, numpy, threading, queue, signal, time (all already available).

Code Organization

All new logic in transcribe.py. File grows from ~50 to ~150-180 lines. No new files.

Constraints

  • Model max input: 35s per chunk (safety cap at 30s)
  • Sampling rate must be 16kHz
  • Single-channel (mono) audio only