This project processes audio and video files to extract, preprocess, and analyze speech/text data. It includes scripts for converting videos to audio, transcribing audio, preprocessing JSON outputs, and handling user queries. The workspace is organized for easy extension and integration with retrieval-augmented generation (RAG) pipelines.
Note: Audio, video, and JSON data files are not included in the repository. You must provide your own files in the correct folders as described below.
git clone <https://github.com/sanyamj-081/video-audio-rag-search>
cd rag-aiEnsure you have Python 3.8+ installed. Then run:
pip install -r requirements.txt- Place your video files (
.mp4) in thevideos/directory. - The pipeline will generate audio files (
.mp3) in theaudios/directory. - Transcription outputs (
.json) will be saved in thejson/directory.
- Convert videos to audio:
python video_to_mp3.py
- Transcribe audio to JSON:
python mp3_to_json.py
- Preprocess JSON files and generate embeddings:
python preprocess_json.py
- Process user queries:
python process_user_query.py
- Embeddings are stored in
embeddings.joblib. This file get created once embeddings are built. - Responses to queries are written to
response.txt.
rag-ai/
├── audios/ # Extracted audio files (.mp3)
├── json/ # Transcription outputs (.json)
├── videos/ # Source video files (.mp4)
├── unused/ # Experimental/unused scripts and outputs
├── embeddings.joblib # Embedding data for RAG
├── mp3_to_json.py # Audio-to-JSON transcription script
├── preprocess_json.py# Preprocessing JSON outputs
├── process_user_query.py # Handles user queries
├── prompt.txt # Prompt template
├── response.txt # Output responses
├── requirements.txt # Project necessary packages
├── Readme.md # Project documentation
- Data files (videos, audios, JSON) are not tracked in git. You must add your own files to the appropriate folders.
- The scripts will create output files as needed.
- For best results, use high-quality audio/video files.
MIT License
Sanyam Jain