RAG-AI Project

Overview

This project processes audio and video files to extract, preprocess, and analyze speech/text data. It includes scripts for converting videos to audio, transcribing audio, preprocessing JSON outputs, and handling user queries. The workspace is organized for easy extension and integration with retrieval-augmented generation (RAG) pipelines.

Quick Start (for Forked Repos)

Note: Audio, video, and JSON data files are not included in the repository. You must provide your own files in the correct folders as described below.

1. Clone the Repository

git clone <https://github.com/sanyamj-081/video-audio-rag-search>
cd rag-ai

2. Install Dependencies

Ensure you have Python 3.8+ installed. Then run:

pip install -r requirements.txt

3. Prepare Your Data

Place your video files (.mp4) in the videos/ directory.
The pipeline will generate audio files (.mp3) in the audios/ directory.
Transcription outputs (.json) will be saved in the json/ directory.

4. Run the Processing Pipeline

Convert videos to audio:
```
python video_to_mp3.py
```
Transcribe audio to JSON:
```
python mp3_to_json.py
```
Preprocess JSON files and generate embeddings:
```
python preprocess_json.py
```
Process user queries:
```
python process_user_query.py
```

5. Output

Embeddings are stored in embeddings.joblib. This file get created once embeddings are built.
Responses to queries are written to response.txt.

Directory Structure

rag-ai/
├── audios/           # Extracted audio files (.mp3)
├── json/             # Transcription outputs (.json)
├── videos/           # Source video files (.mp4)
├── unused/           # Experimental/unused scripts and outputs
├── embeddings.joblib # Embedding data for RAG
├── mp3_to_json.py    # Audio-to-JSON transcription script
├── preprocess_json.py# Preprocessing JSON outputs
├── process_user_query.py # Handles user queries
├── prompt.txt        # Prompt template
├── response.txt      # Output responses
├── requirements.txt  # Project necessary packages
├── Readme.md         # Project documentation

Notes

Data files (videos, audios, JSON) are not tracked in git. You must add your own files to the appropriate folders.
The scripts will create output files as needed.
For best results, use high-quality audio/video files.

License

MIT License

Author

Sanyam Jain

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG-AI Project

Overview

Quick Start (for Forked Repos)

1. Clone the Repository

2. Install Dependencies

3. Prepare Your Data

4. Run the Processing Pipeline

5. Output

Directory Structure

Notes

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
audios		audios
json		json
unused		unused
videos		videos
Readme.md		Readme.md
mp3_to_json.py		mp3_to_json.py
preprocess_json.py		preprocess_json.py
process_user_query.py		process_user_query.py
prompt.txt		prompt.txt
requirements.txt		requirements.txt
response.txt		response.txt
video_to_mp3.py		video_to_mp3.py

Folders and files

Latest commit

History

Repository files navigation

RAG-AI Project

Overview

Quick Start (for Forked Repos)

1. Clone the Repository

2. Install Dependencies

3. Prepare Your Data

4. Run the Processing Pipeline

5. Output

Directory Structure

Notes

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages