diff --git a/.envrc b/.envrc new file mode 100644 index 0000000..3550a30 --- /dev/null +++ b/.envrc @@ -0,0 +1 @@ +use flake diff --git a/docs/superpowers/specs/2026-05-29-live-streaming-transcription-design.md b/docs/superpowers/specs/2026-05-29-live-streaming-transcription-design.md new file mode 100644 index 0000000..37c9de6 --- /dev/null +++ b/docs/superpowers/specs/2026-05-29-live-streaming-transcription-design.md @@ -0,0 +1,81 @@ +# Live Streaming Microphone Transcription + +## Summary + +Add a `--stream` mode to `transcribe.py` that continuously captures audio from the microphone, detects speech segments using energy-based VAD, and transcribes each segment in near-real-time using the Cohere ASR model. Output scrolls as timestamped lines in the terminal. Ctrl+C stops the session. + +## Context + +- **Model**: CohereLabs/cohere-transcribe-03-2026, max 35s audio clips, 5s overlap for auto-chunking +- **Inference speed**: ~0.4s for 5-10s audio on GPU (0.04-0.08x real-time) +- **Microphone**: PD200X Podcast Microphone via PipeWire, 16kHz mono +- **Existing code**: `transcribe.py` has `--mic` (fixed duration) and demo file modes + +## Architecture + +### Audio Capture + +`sounddevice.InputStream` with a callback streams 16kHz mono float32 audio into a thread-safe buffer. The callback appends raw samples; a separate consumer reads them. + +### Voice Activity Detection + +Energy-based VAD using RMS amplitude over 50ms frames (800 samples at 16kHz): + +- **Threshold**: Calibrated from ~0.5s of ambient silence at startup, with a sensible fallback (~-40 dBFS) +- **State machine**: `SILENCE -> SPEAKING -> SILENCE` + - SILENCE -> SPEAKING: RMS exceeds threshold for >= 3 consecutive frames (~150ms) + - SPEAKING -> SILENCE: RMS stays below threshold for >= 0.8s +- **Pre-roll**: ~0.3s of audio before speech onset is included to avoid clipping word beginnings +- **Safety cap**: If speech exceeds 30s without a pause, force a chunk boundary (model max is 35s) + +### Threading Model + +Two threads communicating via `queue.Queue`: + +1. **Audio thread** (sounddevice callback + VAD logic): captures audio, runs VAD state machine, pushes completed speech segments onto the queue +2. **Transcription thread**: pulls segments from the queue, runs `processor() -> model.generate() -> processor.decode()`, prints results + +No state carried between segments. Each is transcribed independently. + +### Output + +Timestamped lines printed to stdout as each segment is transcribed: + +``` +[00:03] Good morning, this is a test of the live captioning system. +[00:08] The model seems to be picking up my voice pretty well. +``` + +### Shutdown + +Ctrl+C sets a stop flag via signal handler. The audio stream stops, any buffered speech is flushed and transcribed, then the program exits cleanly. + +## CLI Interface + +``` +uv run python transcribe.py --stream # stream, default language (en) +uv run python transcribe.py --stream --lang ja # stream in Japanese +uv run python transcribe.py --mic [duration] # existing fixed-duration mode +uv run python transcribe.py # existing demo file mode +``` + +### Startup Sequence + +1. Print "Loading model..." and load model +2. Record ~0.5s of ambient audio, compute silence threshold +3. Print threshold info and "Listening... (Ctrl+C to stop)" +4. Begin streaming + +## Dependencies + +No new dependencies. Uses: `sounddevice`, `numpy`, `threading`, `queue`, `signal`, `time` (all already available). + +## Code Organization + +All new logic in `transcribe.py`. File grows from ~50 to ~150-180 lines. No new files. + +## Constraints + +- Model max input: 35s per chunk (safety cap at 30s) +- Sampling rate must be 16kHz +- Single-channel (mono) audio only diff --git a/flake.nix b/flake.nix new file mode 100644 index 0000000..2b05d43 --- /dev/null +++ b/flake.nix @@ -0,0 +1,31 @@ +{ + inputs = { + nixpkgs.url = "github:NixOS/nixpkgs/nixpkgs-unstable"; + }; + + outputs = + { nixpkgs, ... }: + let + system = "x86_64-linux"; + pkgs = import nixpkgs { + inherit system; + config.allowUnfree = true; + }; + in + { + devShells.${system}.default = pkgs.mkShell { + packages = with pkgs; [ + uv + python314 + portaudio + cudaPackages.cudatoolkit + ]; + + env = { + LD_LIBRARY_PATH = pkgs.lib.makeLibraryPath [ + pkgs.cudaPackages.cudatoolkit + ]; + }; + }; + }; +}