AllInVault uses a flexible stage-based pipeline architecture for podcast processing. The pipeline consists of the following stages:
- FETCH_METADATA: Retrieves episode metadata from YouTube
- ANALYZE_EPISODES: Analyzes episodes to determine if they are full episodes or shorts
- DOWNLOAD_AUDIO: Downloads raw audio files in WebM format
- CONVERT_AUDIO: Converts WebM files to MP3 with parallel processing
- TRANSCRIBE_AUDIO: Transcribes MP3 files using Deepgram API
- IDENTIFY_SPEAKERS: Identifies speakers in transcripts
┌─────────────────┐
│ FETCH_METADATA │
└─────────┬───────┘
│
▼
┌─────────────────┐
│ ANALYZE_EPISODES│
└─────────┬───────┘
│
▼
┌─────────────────┐
│ DOWNLOAD_AUDIO │ ─── WebM files ─── ▶ /data/webm
└─────────┬───────┘
│
▼
┌─────────────────┐
│ CONVERT_AUDIO │ ─── MP3 files ──── ▶ /data/audio
│ (Parallel) │
└─────────┬───────┘
│
▼
┌─────────────────┐
│ TRANSCRIBE_AUDIO│ ─── Transcripts ── ▶ /data/transcripts
└─────────┬───────┘
│
▼
┌─────────────────┐
│ IDENTIFY_SPEAKERS│
└─────────────────┘
/data/json: JSON metadata files/data/webm: Raw WebM audio files/data/audio: Converted MP3 audio files/data/transcripts: Transcriptions in JSON and text formats
PodcastEpisode: Data model for podcast episodes, including metadata and file references
PipelineOrchestrator: Manages the entire pipeline executionYouTubeService: Retrieves episode metadata from YouTubeEpisodeAnalyzerService: Analyzes episodes to determine typeYtDlpDownloader: Downloads and converts audio filesBatchTranscriberService: Manages batch transcription of audio filesSpeakerIdentificationService: Identifies speakers in transcripts
JsonFileRepository: Stores and retrieves podcast episode data
The CONVERT_AUDIO stage uses Python's concurrent.futures.ThreadPoolExecutor to convert multiple audio files in parallel, improving performance. The number of parallel workers can be configured through the conversion_threads setting.