Agentic RAG: Web-Based Retrieval-Augmented Generation System

Project Overview

This project provides a web-based interface for intelligent research paper search and analysis. The system combines advanced RAG techniques with Google Gemini AI to deliver accurate, contextual answers about ArXiv research papers.

System Architecture

The agentic RAG system follows these key steps:

Query Classification: Determines if query requires arXiv paper retrieval using Gemini
Query Rewriting: Generates optimized query variation for better retrieval using Gemini
Document Retrieval: Semantic search using rewritten query with Pinecone vector database
Response Generation: Final answer generation using Gemini with retrieved abstracts
Abstract Simplification: AI-powered simplification of complex abstracts for easier comprehension

The system provides real-time progress tracking and outputs both the main response and simplified source documents.

Project Structure

rag_article_search/
├── main_app.py                   # Web interface (Streamlit)
├── main_cli.py                   # Command-line interface
├── src/                          # Core system modules
│   ├── rag/                      # RAG workflow components
│   │   ├── state.py              # RAG state definition
│   │   ├── rag_graph.py          # Graph workflow builder
│   │   └── nodes/                # Individual workflow nodes
│   │       ├── classification.py
│   │       ├── query_processing.py
│   │       ├── retrieval.py
│   │       ├── response_generation.py
│   │       └── simplify.py
│   ├── models/                   # Model management
│   │   └── model_loader.py       # Gemini, embeddings setup
│   ├── connections/              # External service connections
│   │   ├── pinecone_db.py        # Vector database operations
│   │   └── gemini_query.py       # Google Gemini integration
│   └── loaders/                  # Data processing utilities
│       ├── data_loader.py        # Dataset management
│       └── embedding.py          # Embedding operations
├── scripts/                      # Background processing
│   └── get_data_pipeline.py      # Data download and processing
├── config/                       # Configuration
│   └── config.yaml               # System settings
├── data/                         # Processed data storage
└── output/                       # Results and visualizations
```ted Generation System

An intelligent web application for searching and analyzing ArXiv research papers using advanced Retrieval-Augmented Generation (RAG) with Google Gemini. Features a user-friendly Streamlit interface with smart data management and query processing.

## Quick Start

1. **Install dependencies:**
   ```bash
   poetry install  # or pip install -r requirements.txt

Set up environment variables:

# Create .env file with your API keys
GEMINI_API_KEY=your_gemini_api_key
PINECONE_API_KEY=your_pinecone_api_key

Set up Kaggle authentication:
- Download kaggle.json from Kaggle Settings → API
- Place at: ~/.kaggle/kaggle.json (Linux/Mac) or C:\Users\{username}\.kaggle\kaggle.json (Windows)
Launch the web application:
```
streamlit run streamlit_app.py
```
First time setup: Use the web interface to download and process ArXiv data
Start querying: Ask questions about research papers!

Project Overview

This project provides a web-based interface for intelligent research paper search and analysis. The system combines advanced RAG techniques with Google Gemini AI to deliver accurate, contextual answers about ArXiv research papers.

Features

🌐 Web Interface: Clean Streamlit app for searching research papers
💻 Multiple Entry Points: Web UI, CLI, and programmatic access
🤖 Agentic RAG: Intelligent query processing and response generation
☁️ Google Gemini Integration: Powered by Google Gemini 2.0 Flash
🧠 Fine-tuned Model: SmolLM2 model for enhanced abstract simplification (GPU accelerated)
🗃️ Vector Database: Pinecone integration for semantic search
📄 Simplified Abstracts: AI-powered simplification of complex research papers
⏱️ Real-time Progress: Live progress tracking during processing

Installation

# Clone repository
git clone <repository-url>
cd rag_article_search

# Install dependencies
poetry install
# OR
pip install -r requirements.txt

GPU Support: CUDA-compatible GPU recommended for optimal performance with fine-tuned model

Environment Variables

Create a .env file in the project root:

GEMINI_API_KEY=your_gemini_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here

Get API Keys:

Gemini: Google AI Studio
Pinecone: Pinecone Console

Kaggle Setup (for data download):

Download kaggle.json from Kaggle Settings → API
Place at: ~/.kaggle/kaggle.json (Linux/Mac) or C:\Users\{username}\.kaggle\kaggle.json (Windows)

Usage

Web Interface (Recommended)

streamlit run main_app.py

Command Line Interface

python main_cli.py "Your research question here"

Programmatic Usage

from main_app import main
result = main("Your research question here")

First Time Setup

Click "Run Data Pipeline" in the web interface to download and process ArXiv data
Wait for processing to complete (may take several hours)
Then start asking questions about research papers!
Model Selection: Configure use_finetuned: true/false in config/config.yaml to switch between fine-tuned and Gemini models

The system shows real-time progress and outputs both the main response and simplified source documents.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
config		config
scripts		scripts
src		src
test		test
.gitignore		.gitignore
README.md		README.md
TODO.md		TODO.md
app.py		app.py
cli.py		cli.py
prompts.txt		prompts.txt
pyproject.toml		pyproject.toml
test_cli.py		test_cli.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic RAG: Web-Based Retrieval-Augmented Generation System

Project Overview

System Architecture

Project Structure

Project Overview

Features

Installation

Environment Variables

Usage

Web Interface (Recommended)

Command Line Interface

Programmatic Usage

First Time Setup

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agentic RAG: Web-Based Retrieval-Augmented Generation System

Project Overview

System Architecture

Project Structure

Project Overview

Features

Installation

Environment Variables

Usage

Web Interface (Recommended)

Command Line Interface

Programmatic Usage

First Time Setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages