Add design spec for live streaming microphone transcription
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,81 @@
|
|||||||
|
# Live Streaming Microphone Transcription
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Add a `--stream` mode to `transcribe.py` that continuously captures audio from the microphone, detects speech segments using energy-based VAD, and transcribes each segment in near-real-time using the Cohere ASR model. Output scrolls as timestamped lines in the terminal. Ctrl+C stops the session.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
- **Model**: CohereLabs/cohere-transcribe-03-2026, max 35s audio clips, 5s overlap for auto-chunking
|
||||||
|
- **Inference speed**: ~0.4s for 5-10s audio on GPU (0.04-0.08x real-time)
|
||||||
|
- **Microphone**: PD200X Podcast Microphone via PipeWire, 16kHz mono
|
||||||
|
- **Existing code**: `transcribe.py` has `--mic` (fixed duration) and demo file modes
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
### Audio Capture
|
||||||
|
|
||||||
|
`sounddevice.InputStream` with a callback streams 16kHz mono float32 audio into a thread-safe buffer. The callback appends raw samples; a separate consumer reads them.
|
||||||
|
|
||||||
|
### Voice Activity Detection
|
||||||
|
|
||||||
|
Energy-based VAD using RMS amplitude over 50ms frames (800 samples at 16kHz):
|
||||||
|
|
||||||
|
- **Threshold**: Calibrated from ~0.5s of ambient silence at startup, with a sensible fallback (~-40 dBFS)
|
||||||
|
- **State machine**: `SILENCE -> SPEAKING -> SILENCE`
|
||||||
|
- SILENCE -> SPEAKING: RMS exceeds threshold for >= 3 consecutive frames (~150ms)
|
||||||
|
- SPEAKING -> SILENCE: RMS stays below threshold for >= 0.8s
|
||||||
|
- **Pre-roll**: ~0.3s of audio before speech onset is included to avoid clipping word beginnings
|
||||||
|
- **Safety cap**: If speech exceeds 30s without a pause, force a chunk boundary (model max is 35s)
|
||||||
|
|
||||||
|
### Threading Model
|
||||||
|
|
||||||
|
Two threads communicating via `queue.Queue`:
|
||||||
|
|
||||||
|
1. **Audio thread** (sounddevice callback + VAD logic): captures audio, runs VAD state machine, pushes completed speech segments onto the queue
|
||||||
|
2. **Transcription thread**: pulls segments from the queue, runs `processor() -> model.generate() -> processor.decode()`, prints results
|
||||||
|
|
||||||
|
No state carried between segments. Each is transcribed independently.
|
||||||
|
|
||||||
|
### Output
|
||||||
|
|
||||||
|
Timestamped lines printed to stdout as each segment is transcribed:
|
||||||
|
|
||||||
|
```
|
||||||
|
[00:03] Good morning, this is a test of the live captioning system.
|
||||||
|
[00:08] The model seems to be picking up my voice pretty well.
|
||||||
|
```
|
||||||
|
|
||||||
|
### Shutdown
|
||||||
|
|
||||||
|
Ctrl+C sets a stop flag via signal handler. The audio stream stops, any buffered speech is flushed and transcribed, then the program exits cleanly.
|
||||||
|
|
||||||
|
## CLI Interface
|
||||||
|
|
||||||
|
```
|
||||||
|
uv run python transcribe.py --stream # stream, default language (en)
|
||||||
|
uv run python transcribe.py --stream --lang ja # stream in Japanese
|
||||||
|
uv run python transcribe.py --mic [duration] # existing fixed-duration mode
|
||||||
|
uv run python transcribe.py # existing demo file mode
|
||||||
|
```
|
||||||
|
|
||||||
|
### Startup Sequence
|
||||||
|
|
||||||
|
1. Print "Loading model..." and load model
|
||||||
|
2. Record ~0.5s of ambient audio, compute silence threshold
|
||||||
|
3. Print threshold info and "Listening... (Ctrl+C to stop)"
|
||||||
|
4. Begin streaming
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
No new dependencies. Uses: `sounddevice`, `numpy`, `threading`, `queue`, `signal`, `time` (all already available).
|
||||||
|
|
||||||
|
## Code Organization
|
||||||
|
|
||||||
|
All new logic in `transcribe.py`. File grows from ~50 to ~150-180 lines. No new files.
|
||||||
|
|
||||||
|
## Constraints
|
||||||
|
|
||||||
|
- Model max input: 35s per chunk (safety cap at 30s)
|
||||||
|
- Sampling rate must be 16kHz
|
||||||
|
- Single-channel (mono) audio only
|
||||||
@@ -0,0 +1,31 @@
|
|||||||
|
{
|
||||||
|
inputs = {
|
||||||
|
nixpkgs.url = "github:NixOS/nixpkgs/nixpkgs-unstable";
|
||||||
|
};
|
||||||
|
|
||||||
|
outputs =
|
||||||
|
{ nixpkgs, ... }:
|
||||||
|
let
|
||||||
|
system = "x86_64-linux";
|
||||||
|
pkgs = import nixpkgs {
|
||||||
|
inherit system;
|
||||||
|
config.allowUnfree = true;
|
||||||
|
};
|
||||||
|
in
|
||||||
|
{
|
||||||
|
devShells.${system}.default = pkgs.mkShell {
|
||||||
|
packages = with pkgs; [
|
||||||
|
uv
|
||||||
|
python314
|
||||||
|
portaudio
|
||||||
|
cudaPackages.cudatoolkit
|
||||||
|
];
|
||||||
|
|
||||||
|
env = {
|
||||||
|
LD_LIBRARY_PATH = pkgs.lib.makeLibraryPath [
|
||||||
|
pkgs.cudaPackages.cudatoolkit
|
||||||
|
];
|
||||||
|
};
|
||||||
|
};
|
||||||
|
};
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user