docs/superpowers/specs/2026-05-29-live-streaming-transcription-design.md

# Live Streaming Microphone Transcription

## Summary

Add a `--stream` mode to `transcribe.py` that continuously captures audio from the microphone, detects speech segments using energy-based VAD, and transcribes each segment in near-real-time using the Cohere ASR model. Output scrolls as timestamped lines in the terminal. Ctrl+C stops the session.

## Context

- **Model**: CohereLabs/cohere-transcribe-03-2026, max 35s audio clips, 5s overlap for auto-chunking
- **Inference speed**: ~0.4s for 5-10s audio on GPU (0.04-0.08x real-time)
- **Microphone**: PD200X Podcast Microphone via PipeWire, 16kHz mono
- **Existing code**: `transcribe.py` has `--mic` (fixed duration) and demo file modes

## Architecture

### Audio Capture

`sounddevice.InputStream` with a callback streams 16kHz mono float32 audio into a thread-safe buffer. The callback appends raw samples; a separate consumer reads them.

### Voice Activity Detection

Energy-based VAD using RMS amplitude over 50ms frames (800 samples at 16kHz):

- **Threshold**: Calibrated from ~0.5s of ambient silence at startup, with a sensible fallback (~-40 dBFS)
- **State machine**: `SILENCE -> SPEAKING -> SILENCE`
  - SILENCE -> SPEAKING: RMS exceeds threshold for >= 3 consecutive frames (~150ms)
  - SPEAKING -> SILENCE: RMS stays below threshold for >= 0.8s
- **Pre-roll**: ~0.3s of audio before speech onset is included to avoid clipping word beginnings
- **Safety cap**: If speech exceeds 30s without a pause, force a chunk boundary (model max is 35s)

### Threading Model

Two threads communicating via `queue.Queue`:

1. **Audio thread** (sounddevice callback + VAD logic): captures audio, runs VAD state machine, pushes completed speech segments onto the queue
2. **Transcription thread**: pulls segments from the queue, runs `processor() -> model.generate() -> processor.decode()`, prints results

No state carried between segments. Each is transcribed independently.

### Output

Timestamped lines printed to stdout as each segment is transcribed:

```
[00:03] Good morning, this is a test of the live captioning system.
[00:08] The model seems to be picking up my voice pretty well.
```

### Shutdown

Ctrl+C sets a stop flag via signal handler. The audio stream stops, any buffered speech is flushed and transcribed, then the program exits cleanly.

## CLI Interface

```
uv run python transcribe.py --stream            # stream, default language (en)
uv run python transcribe.py --stream --lang ja   # stream in Japanese
uv run python transcribe.py --mic [duration]     # existing fixed-duration mode
uv run python transcribe.py                      # existing demo file mode
```

### Startup Sequence

1. Print "Loading model..." and load model
2. Record ~0.5s of ambient audio, compute silence threshold
3. Print threshold info and "Listening... (Ctrl+C to stop)"
4. Begin streaming

## Dependencies

No new dependencies. Uses: `sounddevice`, `numpy`, `threading`, `queue`, `signal`, `time` (all already available).

## Code Organization

All new logic in `transcribe.py`. File grows from ~50 to ~150-180 lines. No new files.

## Constraints

- Model max input: 35s per chunk (safety cap at 30s)
- Sampling rate must be 16kHz
- Single-channel (mono) audio only
Add design spec for live streaming microphone transcription 2026-05-29 02:38:05 +08:00			`# Live Streaming Microphone Transcription`

			`## Summary`

			Add a `--stream` mode to `transcribe.py` that continuously captures audio from the microphone, detects speech segments using energy-based VAD, and transcribes each segment in near-real-time using the Cohere ASR model. Output scrolls as timestamped lines in the terminal. Ctrl+C stops the session.

			`## Context`

			`- Model: CohereLabs/cohere-transcribe-03-2026, max 35s audio clips, 5s overlap for auto-chunking`
			`- Inference speed: ~0.4s for 5-10s audio on GPU (0.04-0.08x real-time)`
			`- Microphone: PD200X Podcast Microphone via PipeWire, 16kHz mono`
			- Existing code: `transcribe.py` has `--mic` (fixed duration) and demo file modes

			`## Architecture`

			`### Audio Capture`

			`sounddevice.InputStream` with a callback streams 16kHz mono float32 audio into a thread-safe buffer. The callback appends raw samples; a separate consumer reads them.

			`### Voice Activity Detection`

			`Energy-based VAD using RMS amplitude over 50ms frames (800 samples at 16kHz):`

			`- Threshold: Calibrated from ~0.5s of ambient silence at startup, with a sensible fallback (~-40 dBFS)`
			- State machine: `SILENCE -> SPEAKING -> SILENCE`
			`- SILENCE -> SPEAKING: RMS exceeds threshold for >= 3 consecutive frames (~150ms)`
			`- SPEAKING -> SILENCE: RMS stays below threshold for >= 0.8s`
			`- Pre-roll: ~0.3s of audio before speech onset is included to avoid clipping word beginnings`
			`- Safety cap: If speech exceeds 30s without a pause, force a chunk boundary (model max is 35s)`

			`### Threading Model`

			Two threads communicating via `queue.Queue`:

			`1. Audio thread (sounddevice callback + VAD logic): captures audio, runs VAD state machine, pushes completed speech segments onto the queue`
			2. Transcription thread: pulls segments from the queue, runs `processor() -> model.generate() -> processor.decode()`, prints results

			`No state carried between segments. Each is transcribed independently.`

			`### Output`

			`Timestamped lines printed to stdout as each segment is transcribed:`

			```
			`[00:03] Good morning, this is a test of the live captioning system.`
			`[00:08] The model seems to be picking up my voice pretty well.`
			```

			`### Shutdown`

			`Ctrl+C sets a stop flag via signal handler. The audio stream stops, any buffered speech is flushed and transcribed, then the program exits cleanly.`

			`## CLI Interface`

			```
			`uv run python transcribe.py --stream # stream, default language (en)`
			`uv run python transcribe.py --stream --lang ja # stream in Japanese`
			`uv run python transcribe.py --mic [duration] # existing fixed-duration mode`
			`uv run python transcribe.py # existing demo file mode`
			```

			`### Startup Sequence`

			`1. Print "Loading model..." and load model`
			`2. Record ~0.5s of ambient audio, compute silence threshold`
			`3. Print threshold info and "Listening... (Ctrl+C to stop)"`
			`4. Begin streaming`

			`## Dependencies`

			No new dependencies. Uses: `sounddevice`, `numpy`, `threading`, `queue`, `signal`, `time` (all already available).

			`## Code Organization`

			All new logic in `transcribe.py`. File grows from ~50 to ~150-180 lines. No new files.

			`## Constraints`

			`- Model max input: 35s per chunk (safety cap at 30s)`
			`- Sampling rate must be 16kHz`
			`- Single-channel (mono) audio only`