This project provides a web-based interface for intelligent research paper search and analysis. The system combines advanced RAG techniques with Google Gemini AI to deliver accurate, contextual answers about ArXiv research papers.
The agentic RAG system follows these key steps:
- Query Classification: Determines if query requires arXiv paper retrieval using Gemini
- Query Rewriting: Generates optimized query variation for better retrieval using Gemini
- Document Retrieval: Semantic search using rewritten query with Pinecone vector database
- Response Generation: Final answer generation using Gemini with retrieved abstracts
- Abstract Simplification: AI-powered simplification of complex abstracts for easier comprehension
The system provides real-time progress tracking and outputs both the main response and simplified source documents.
rag_article_search/
├── main_app.py # Web interface (Streamlit)
├── main_cli.py # Command-line interface
├── src/ # Core system modules
│ ├── rag/ # RAG workflow components
│ │ ├── state.py # RAG state definition
│ │ ├── rag_graph.py # Graph workflow builder
│ │ └── nodes/ # Individual workflow nodes
│ │ ├── classification.py
│ │ ├── query_processing.py
│ │ ├── retrieval.py
│ │ ├── response_generation.py
│ │ └── simplify.py
│ ├── models/ # Model management
│ │ └── model_loader.py # Gemini, embeddings setup
│ ├── connections/ # External service connections
│ │ ├── pinecone_db.py # Vector database operations
│ │ └── gemini_query.py # Google Gemini integration
│ └── loaders/ # Data processing utilities
│ ├── data_loader.py # Dataset management
│ └── embedding.py # Embedding operations
├── scripts/ # Background processing
│ └── get_data_pipeline.py # Data download and processing
├── config/ # Configuration
│ └── config.yaml # System settings
├── data/ # Processed data storage
└── output/ # Results and visualizations
```ted Generation System
An intelligent web application for searching and analyzing ArXiv research papers using advanced Retrieval-Augmented Generation (RAG) with Google Gemini. Features a user-friendly Streamlit interface with smart data management and query processing.
## Quick Start
1. **Install dependencies:**
```bash
poetry install # or pip install -r requirements.txt
-
Set up environment variables:
# Create .env file with your API keys GEMINI_API_KEY=your_gemini_api_key PINECONE_API_KEY=your_pinecone_api_key -
Set up Kaggle authentication:
- Download
kaggle.jsonfrom Kaggle Settings → API - Place at:
~/.kaggle/kaggle.json(Linux/Mac) orC:\Users\{username}\.kaggle\kaggle.json(Windows)
- Download
-
Launch the web application:
streamlit run streamlit_app.py
-
First time setup: Use the web interface to download and process ArXiv data
-
Start querying: Ask questions about research papers!
This project provides a web-based interface for intelligent research paper search and analysis. The system combines advanced RAG techniques with Google Gemini AI to deliver accurate, contextual answers about ArXiv research papers.
- 🌐 Web Interface: Clean Streamlit app for searching research papers
- 💻 Multiple Entry Points: Web UI, CLI, and programmatic access
- 🤖 Agentic RAG: Intelligent query processing and response generation
- ☁️ Google Gemini Integration: Powered by Google Gemini 2.0 Flash
- 🧠 Fine-tuned Model: SmolLM2 model for enhanced abstract simplification (GPU accelerated)
- 🗃️ Vector Database: Pinecone integration for semantic search
- 📄 Simplified Abstracts: AI-powered simplification of complex research papers
- ⏱️ Real-time Progress: Live progress tracking during processing
# Clone repository
git clone <repository-url>
cd rag_article_search
# Install dependencies
poetry install
# OR
pip install -r requirements.txtGPU Support: CUDA-compatible GPU recommended for optimal performance with fine-tuned model
Create a .env file in the project root:
GEMINI_API_KEY=your_gemini_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_hereGet API Keys:
- Gemini: Google AI Studio
- Pinecone: Pinecone Console
Kaggle Setup (for data download):
- Download
kaggle.jsonfrom Kaggle Settings → API - Place at:
~/.kaggle/kaggle.json(Linux/Mac) orC:\Users\{username}\.kaggle\kaggle.json(Windows)
streamlit run main_app.pypython main_cli.py "Your research question here"from main_app import main
result = main("Your research question here")- Click "Run Data Pipeline" in the web interface to download and process ArXiv data
- Wait for processing to complete (may take several hours)
- Then start asking questions about research papers!
- Model Selection: Configure
use_finetuned: true/falseinconfig/config.yamlto switch between fine-tuned and Gemini models
The system shows real-time progress and outputs both the main response and simplified source documents.