GPU-accelerated speech-to-text using faster-whisper with CTranslate2 backend. 4x faster than OpenAI's original Whisper implementation. Every segment gets precise timestamps and full-text search indexing.
Audio is extracted from each video clip via FFmpeg, then fed to faster-whisper running on the RTX A6000 in float16 mode. The model produces word-level aligned segments with start/end timestamps. These segments are stored as JSON and indexed for full-text NLP search.
| Parameter | Value |
|---|---|
| Engine | faster-whisper 1.2.1 (CTranslate2) |
| Model Size | large-v3 |
| Compute | NVIDIA RTX A6000 48GB |
| Precision | float16 (primary) → int8 → cpu (fallback chain) |
| Audio Extraction | FFmpeg via ffmpeg-python wrapper |
| Output Format | JSON segments with start/end timestamps |
| Search | Full-text NLP across all indexed segments |
Segments are stored per-clip in the library JSON and rolled up into the searchable catalogue. Queries like "find all mentions of Scrooge" return matching segments with timestamps ready for timeline insertion.
If the primary float16 inference fails (VRAM pressure from concurrent ComfyUI/Resolve), the engine cascades through progressively lighter configurations. No crashes — always produces output.
| Priority | Config | VRAM |
|---|---|---|
| 1 (Primary) | large-v3 / float16 | ~4 GB |
| 2 (Fallback) | large-v3 / int8 | ~2.5 GB |
| 3 (Fallback) | medium / int8 | ~1.5 GB |
| 4 (Last resort) | small / CPU | 0 GPU |