Add design spec for live streaming microphone transcription

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-29 02:38:05 +08:00
parent c055a8ffb9
commit e0911653fe
3 changed files with 113 additions and 0 deletions
@@ -0,0 +1 @@
+use flake
@@ -0,0 +1,81 @@
+# Live Streaming Microphone Transcription
+
+## Summary
+
+Add a `--stream` mode to `transcribe.py` that continuously captures audio from the microphone, detects speech segments using energy-based VAD, and transcribes each segment in near-real-time using the Cohere ASR model. Output scrolls as timestamped lines in the terminal. Ctrl+C stops the session.
+
+## Context
+
+- **Model**: CohereLabs/cohere-transcribe-03-2026, max 35s audio clips, 5s overlap for auto-chunking
+- **Inference speed**: ~0.4s for 5-10s audio on GPU (0.04-0.08x real-time)
+- **Microphone**: PD200X Podcast Microphone via PipeWire, 16kHz mono
+- **Existing code**: `transcribe.py` has `--mic` (fixed duration) and demo file modes
+
+## Architecture
+
+### Audio Capture
+
+`sounddevice.InputStream` with a callback streams 16kHz mono float32 audio into a thread-safe buffer. The callback appends raw samples; a separate consumer reads them.
+
+### Voice Activity Detection
+
+Energy-based VAD using RMS amplitude over 50ms frames (800 samples at 16kHz):
+
+- **Threshold**: Calibrated from ~0.5s of ambient silence at startup, with a sensible fallback (~-40 dBFS)
+- **State machine**: `SILENCE -> SPEAKING -> SILENCE`
+  - SILENCE -> SPEAKING: RMS exceeds threshold for >= 3 consecutive frames (~150ms)
+  - SPEAKING -> SILENCE: RMS stays below threshold for >= 0.8s
+- **Pre-roll**: ~0.3s of audio before speech onset is included to avoid clipping word beginnings
+- **Safety cap**: If speech exceeds 30s without a pause, force a chunk boundary (model max is 35s)
+
+### Threading Model
+
+Two threads communicating via `queue.Queue`:
+
+1. **Audio thread** (sounddevice callback + VAD logic): captures audio, runs VAD state machine, pushes completed speech segments onto the queue
+2. **Transcription thread**: pulls segments from the queue, runs `processor() -> model.generate() -> processor.decode()`, prints results
+
+No state carried between segments. Each is transcribed independently.
+
+### Output
+
+Timestamped lines printed to stdout as each segment is transcribed:
+
+```
+[00:03] Good morning, this is a test of the live captioning system.
+[00:08] The model seems to be picking up my voice pretty well.
+```
+
+### Shutdown
+
+Ctrl+C sets a stop flag via signal handler. The audio stream stops, any buffered speech is flushed and transcribed, then the program exits cleanly.
+
+## CLI Interface
+
+```
+uv run python transcribe.py --stream            # stream, default language (en)
+uv run python transcribe.py --stream --lang ja   # stream in Japanese
+uv run python transcribe.py --mic [duration]     # existing fixed-duration mode
+uv run python transcribe.py                      # existing demo file mode
+```
+
+### Startup Sequence
+
+1. Print "Loading model..." and load model
+2. Record ~0.5s of ambient audio, compute silence threshold
+3. Print threshold info and "Listening... (Ctrl+C to stop)"
+4. Begin streaming
+
+## Dependencies
+
+No new dependencies. Uses: `sounddevice`, `numpy`, `threading`, `queue`, `signal`, `time` (all already available).
+
+## Code Organization
+
+All new logic in `transcribe.py`. File grows from ~50 to ~150-180 lines. No new files.
+
+## Constraints
+
+- Model max input: 35s per chunk (safety cap at 30s)
+- Sampling rate must be 16kHz
+- Single-channel (mono) audio only
@@ -0,0 +1,31 @@
+{
+  inputs = {
+    nixpkgs.url = "github:NixOS/nixpkgs/nixpkgs-unstable";
+  };
+
+  outputs =
+    { nixpkgs, ... }:
+    let
+      system = "x86_64-linux";
+      pkgs = import nixpkgs {
+        inherit system;
+        config.allowUnfree = true;
+      };
+    in
+    {
+      devShells.${system}.default = pkgs.mkShell {
+        packages = with pkgs; [
+          uv
+          python314
+          portaudio
+          cudaPackages.cudatoolkit
+        ];
+
+        env = {
+          LD_LIBRARY_PATH = pkgs.lib.makeLibraryPath [
+            pkgs.cudaPackages.cudatoolkit
+          ];
+        };
+      };
+    };
+}