Transcription

Metrics

148

Segments Indexed

20

Clips Processed

4×

Faster Than Whisper

float16

GPU Precision

How It Works

Processing Flow

Audio is extracted from each video clip via FFmpeg, then fed to faster-whisper running on the RTX A6000 in float16 mode. The model produces word-level aligned segments with start/end timestamps. These segments are stored as JSON and indexed for full-text NLP search.

Parameter	Value
Engine	faster-whisper 1.2.1 (CTranslate2)
Model Size	large-v3
Compute	NVIDIA RTX A6000 48GB
Precision	float16 (primary) → int8 → cpu (fallback chain)
Audio Extraction	FFmpeg via ffmpeg-python wrapper
Output Format	JSON segments with start/end timestamps
Search	Full-text NLP across all indexed segments

Output Format

Segment Structure

    // Each transcribed segment

    {

      "id": 42,

      "start": 125.3,

      "end": 131.8,

      "text": "I wear the chain I forged in life",

      "clip_id": "CC_SHOT_007"

    }

Segments are stored per-clip in the library JSON and rolled up into the searchable catalogue. Queries like "find all mentions of Scrooge" return matching segments with timestamps ready for timeline insertion.

GPU Fallback Chain

Reliability Strategy

If the primary float16 inference fails (VRAM pressure from concurrent ComfyUI/Resolve), the engine cascades through progressively lighter configurations. No crashes — always produces output.

Priority	Config	VRAM
1 (Primary)	large-v3 / float16	~4 GB
2 (Fallback)	large-v3 / int8	~2.5 GB
3 (Fallback)	medium / int8	~1.5 GB
4 (Last resort)	small / CPU	0 GPU

Transcription Engine