A Retrieval-Augmented Generation (RAG) pipeline that analyzes research papers based on custom research frameworks. The application extracts key information from PDFs, retrieves relevant content, and generates structured summaries guided by your research questions and hypotheses.
You can use the app directly by following this link:
🔗 Live Streamlit App: https://ysls2dqiutadl8td5wjnr8.streamlit.app/
- Start the application and define your research framework (field of study, main research question, sub-questions, hypotheses)
- Upload a PDF research paper
- The system will:
- Extract and chunk the paper
- Create embeddings and store in Pinecone
- Retrieve relevant sections matching your framework
- Generate a structured summary with page citations
- Download the summary or save it to your collection
This project implements an intelligent research paper analysis system combining:
- PDF Processing – Extracts text and metadata from research papers with page tracking
- Vector Embeddings – Converts document chunks into semantic embeddings using OpenAI
- Vector Storage – Stores embeddings in Pinecone for efficient similarity search
- RAG Pipeline – Retrieves relevant document sections based on research questions
- LLM Summarization – Generates comprehensive summaries aligned with your research framework
- Web Interface – Streamlit-based UI for uploading PDFs and managing summaries
Users define their research context (study field, main question, sub-questions, hypotheses), and the system analyzes uploaded papers specifically against that framework — including page citations.
-
Embedding Model:
text-embedding-3-small(OpenAI)- Dimension: 1536
- Used for converting text chunks into semantic vectors
-
Summarization Model:
gpt-4o(OpenAI)- Temperature: 0.3 (for consistent, focused summaries)
- Used for analyzing papers and generating structured summaries
- Chunk Size: 800 characters
- Chunk Overlap: 100 characters
- Retrieval Count (k): 5–10 chunks per query
- Vector Database: Pinecone (serverless, AWS region:
us-east-1) - Similarity Metric: Cosine distance
langchain— LLM orchestrationlangchain-openai— OpenAI API integrationlangchain-pinecone— Pinecone vector storepdfplumber— PDF text extraction with page trackingPyPDF2— PDF metadata extractionpinecone-client— Vector database operationsstreamlit— Web interface framework
- Page Number Accuracy: Page tracking is based on text matching and may be imprecise for PDFs with complex layouts or graphics
- Content Extraction: Title, Author, Images, tables, and formatted content in PDFs may not be extracted correctly
- Language Support: Currently optimized for English-language papers
- API Costs: Uses paid OpenAI API calls; large documents or high query volumes will incur costs
- Context Window: Retrieves up to 50 chunks maximum due to LLM token limits; very large papers may not be fully analyzed
- Namespace Isolation: Each PDF creates a separate Pinecone namespace; cross-document retrieval is not supported
- Multi-Language Support: Add language detection and support for papers in multiple languages
- Advanced PDF Handling: Implement table and figure extraction with OCR capabilities
- Cross-Document Analysis: Enable comparison and synthesis across multiple papers
- Annotation Tools: Allow users to highlight, annotate, and refine summaries within the UI
- Citation Management: Integrate with citation managers (BibTeX, Zotero, Mendeley)
- Performance Optimization: Implement caching and indexing strategies to reduce API costs
AI_project_module6/
├── streamlit_app.py # Streamlit web interface
├── preprocessing_data.py # PDF processing and Pinecone upload
├── rag_pipeline.py # Main RAG orchestration
├── summarization_chain.py # LLM summarization logic
├── server.py # FastAPI backend server for REST API
├── requirements.txt # Python package dependencies
├── prompt_engineering_skills.md # Prompt selection and notes
├── presentation_AI_research_paper_summarizer.pptx # Architecture overview + QR code to app
├── ai_env/ # Environment details
├── sample_research_papers/ # Sample PDFs for testing
│ ├── 30-39.pdf
│ └── ...
└── README.md # This file