Metrics

148
Segments Indexed
20
Clips Processed
Faster Than Whisper
float16
GPU Precision

How It Works

Processing Flow

Audio is extracted from each video clip via FFmpeg, then fed to faster-whisper running on the RTX A6000 in float16 mode. The model produces word-level aligned segments with start/end timestamps. These segments are stored as JSON and indexed for full-text NLP search.

ParameterValue
Enginefaster-whisper 1.2.1 (CTranslate2)
Model Sizelarge-v3
ComputeNVIDIA RTX A6000 48GB
Precisionfloat16 (primary) → int8 → cpu (fallback chain)
Audio ExtractionFFmpeg via ffmpeg-python wrapper
Output FormatJSON segments with start/end timestamps
SearchFull-text NLP across all indexed segments

Output Format

Segment Structure

// Each transcribed segment
{
  "id": 42,
  "start": 125.3,
  "end": 131.8,
  "text": "I wear the chain I forged in life",
  "clip_id": "CC_SHOT_007"
}

Segments are stored per-clip in the library JSON and rolled up into the searchable catalogue. Queries like "find all mentions of Scrooge" return matching segments with timestamps ready for timeline insertion.


GPU Fallback Chain

Reliability Strategy

If the primary float16 inference fails (VRAM pressure from concurrent ComfyUI/Resolve), the engine cascades through progressively lighter configurations. No crashes — always produces output.

PriorityConfigVRAM
1 (Primary)large-v3 / float16~4 GB
2 (Fallback)large-v3 / int8~2.5 GB
3 (Fallback)medium / int8~1.5 GB
4 (Last resort)small / CPU0 GPU

Related