A minimal API service that answers natural-language questions about member messages using retrieval-augmented generation (RAG).
URL: https://simple-nlp-answering-system.up.railway.app/
Quick Test:
# Health check
curl https://simple-nlp-answering-system.up.railway.app/
# Ask a question
curl -X POST https://simple-nlp-answering-system.up.railway.app/ask \
-H "Content-Type: application/json" \
-d '{"question": "How many cars does Vikram Desai have?"}'Example Questions:
- "When is Layla planning her trip to London?"
- "How many cars does Vikram Desai have?"
- "What are Amina's favorite restaurants?"
- π Semantic Search: Uses FastEmbed (BAAI/bge-small-en-v1.5) to embed queries and messages for accurate similarity-based retrieval
- π€ Smart Name Recognition: spaCy NER (en_core_web_md) extracts person names from queries, with fuzzy matching and first-name disambiguation
- π Contextual Filtering: Retrieves top-K most relevant messages filtered by detected member names for focused results
- π§ LLM Generation: Groq API (llama-3.3-70b-versatile) synthesizes natural language answers from retrieved context
- β‘ Fast Cold Starts: Models pre-initialized at startup (~20s) for sub-3s response times on first request
- π Rate Limiting: Built-in request throttling (5 req/min per IP) to prevent abuse
- π¦ Production-Ready: Docker containerized with persistent model caching and health checks
- π CORS Enabled: Cross-origin requests supported for frontend integration
- π Structured Prompts: Custom system/user prompts loaded from files for easy tuning
- Data Ingestion: Fetches messages from the November7 API (
GET /messages) and stores locally in JSON format with embeddings - Vector Store: Messages are embedded using FastEmbed (
BAAI/bge-small-en-v1.5, 384 dims) and indexed in Pinecone with cosine similarity - Retrieval: Two-stage process: (1) spaCy NER extracts person entities from query, (2) semantic search in Pinecone filtered by member name
- Name Resolution: Cached
known_names.jsonprovides fuzzy matching, first-name disambiguation, and "Did you mean?" suggestions - Generation: Groq API (llama-3.3-70b-versatile) receives chronologically sorted context with metadata (member name, latest activity, snippet count)
- API: FastAPI service with async lifespan for model pre-initialization, structured logging, and comprehensive error handling
- Creating User Profiles:
- Ideally, the question answers could be improved by maintaining a user profile database with nicknames and list of facts (e.g., favorite foods, hobbies). However, this would require much more setup (e.g., defining a schema for the profile database or defining an llm prompt to summarize the messages), so I opted for an approach using NER and RAG based message retrieval.
- If I had created a user profile database, I'd expose it via an MCP tool (client-initiated call into the profile service) or pipe the question through text-to-SQL so the agent can hit the structured database directly.
- Improving Retrieval:
- I considered building an agentic system where the first step would be to generate multiple alternative questions to improve lookup accuracy in Pinecone. However, for such a small dataset (and such short messages), this would likely add unnecessary complexity and latency without significant benefit.
- Summary timeline database of user trips:
- I thought about creating a structured timeline database summarizing user trips extracted from messages (creating a timeline/history database) to facilitate more accurate date-related queries. However, this would require significant upfront processing and might not cover all edge cases, so I decided to rely on the existing message data with enhanced prompt engineering instead.
- Categorizations of messages and query:
- I considered categorizing messages (e.g., travel plans, purchases, dining) and classifying user queries to route them to specialized retrieval/generation pipelines. However, this would add complexity and quite a bit of upfront processing, so I opted for a single unified approach with improved prompt instructions.
- Keyword-Based Name Matching:
- I considered creating a simple
Setof all unique member names (e.g., "Vikram Desai") on startup. When a query came in, the service would iterate through this set and check if any name was a substring of the question. - Reason for Choosing NER Instead: This simple keyword-matching approach is fast but very brittle. It would fail on common and crucial variations, particularly possessives (e.g., it wouldn't match "Layla" in "Layla's trip"). A statistical NER model, while not perfect, is trained to be context-aware and can correctly identify "Layla" as a
PERSONentity from "Layla's," making it a more robust and flexible solution out-of-the-box.
- I considered creating a simple
3,349 messages from 10 members over 364 days (Nov 2024 - Nov 2025)
Data Quality:
- No missing timestamps, duplicate IDs, or malformed data
- All timestamps are UTC-aware and within valid range
- Zero duplicate messages (100% unique content)
- 2 extremely short messages (<10 chars) - appear to be incomplete sentences
- "I want to" and "I finally"
- 48 long messages (>88 chars) - mostly from Amina Van Den Berg requesting travel/concierge services
Message Patterns:
- Average message length: 68 characters (median: 68, std: 8.7)
- Message frequency: ~26 hours between messages (median: 18 hours)
- Fastest burst: 1.13 minutes between consecutive messages
- Longest gap: ~7 days between messages
- Uneven distribution: 288-365 messages per user (Lorenzo Cavalli has fewest, Lily O'Sullivan has most)
Notable Observations:
- All users have nearly identical message counts (~334 avg)
- No duplicate message content despite high volume, indicating carefully curated dataset
- From visual inspection, messages generally appear to be about scheduling restaurant visits (Michelin star), travel plans, and car service requests, updating profile information / preferences (e.g., favorite cuisines, phone numbers, insurance info, etc.), vactions, etc. all of which generally align with what a concierge service might build.
- Python 3.10-3.12
- Pinecone account with an index created (
simple-nlp, cosine similarity, 384 dimensions) - Groq account with an API key
-
Clone the repository
git clone https://github.com/yourusername/simple_nlp_answering_system.git cd simple_nlp_answering_system -
Create and activate virtual environment
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Set environment variables
export PINECONE_API_KEY="your-pinecone-key" export GROQ_API_KEY="your-groq-key" export GROQ_MODEL="llama-3.3-70b-versatile"
-
Prepare data (one-time setup)
# Fetch messages from the API and store in data/all_messages.json python run_one_time/get_messages.py # Generate embeddings and upload to Pinecone python run_one_time/pinecone_upload.py # Cache normalized member names for validation in config/known_names.json python run_one_time/get_known_names.py
-
Start the API server
uvicorn main:app --reload
The server will start on
http://localhost:8000. First startup takes ~20s to download and cache the spaCy model. -
Test the service
# Health check curl -s http://localhost:8000/ # Ask a question curl -X POST http://localhost:8000/ask \ -H "Content-Type: application/json" \ -d '{"question": "When is Layla planning her trip to London?"}'
- Push your code to GitHub
- Visit railway.app and create a new project
- Connect your GitHub repository
- Add environment variables in the Railway dashboard:
PINECONE_API_KEYGROQ_API_KEYGROQ_MODEL(e.g.,llama-3.3-70b-versatile)
- Railway will automatically detect the Dockerfile and deploy. The spaCy model is downloaded during Docker build.
- Verify the live service
Find your domain under Railway Settings β Domains.
# Health check (replace with your Railway domain) curl https://your-app.up.railway.app/ # Ask a question curl -X POST https://your-app.up.railway.app/ask \ -H "Content-Type: application/json" \ -d '{"question": "When is Layla planning her trip to London?"}'
- Push your code to GitHub
- Visit render.com and create a new Web Service
- Connect your repository
- Render will use
render.yamlfor configuration - Add environment variables in the Render dashboard
PINECONE_API_KEYGROQ_API_KEYGROQ_MODEL
- Deploy
# Build the image
docker build -t nlp-qa-service .
# Run locally
docker run -p 8000:8000 \
-e PINECONE_API_KEY="your-key" \
-e GROQ_API_KEY="your-groq-key" \
-e GROQ_MODEL="llama-3.3-70b-versatile" \
nlp-qa-serviceOnce deployed, visit https://simple-nlp-answering-system.up.railway.app/docs for interactive Swagger documentation.
POST /ask
- Rate limited to 5 requests/minute per IP
- Request body:
{"question": "your question here"} - Response:
{"answer": "the generated answer"}
GET /
- Health check endpoint
- Returns:
{"status": "ok", "message": "Q&A Service is running."}
| Variable | Required | Description |
|---|---|---|
PINECONE_API_KEY |
Yes | Your Pinecone API key |
GROQ_API_KEY |
Yes | Groq API key for generation |
GROQ_MODEL |
Yes | Groq model name (default llama-3.3-70b-versatile) |
QA_CONFIG_PATH |
No | Path to config file (default: config/config.yaml) |
SPACY_MODEL_DIR |
No | Override for spaCy model storage (default: ./runtime_models/spacy) |
The service uses en_core_web_md (~90MB in memory) for name extraction, providing good accuracy while fitting Railway's free tier memory limits.
How it works:
- Docker build: Model is downloaded during
docker buildand baked into the image (no runtime download) - Local development: First startup downloads model to
./runtime_models/spacy(~20s), then caches it for instant subsequent runs - Production: Model is pre-loaded during container startup (FastAPI lifespan) for fast first-request response (~3s)
βββ main.py # FastAPI application
βββ config/
β βββ config.yaml # Configuration (Pinecone index, embedder, etc.)
βββ src/
β βββ rag/
β β βββ retriever.py # Semantic search and NER filtering
β β βββ service.py # QA orchestration
β βββ utils.py # Logging and config utilities
βββ run_one_time/
β βββ get_messages.py # Fetch data from API
β βββ pinecone_upload.py # Index messages in Pinecone
β βββ get_known_names.py # Cache normalized member names
βββ Dockerfile # Container definition
βββ requirements.txt # Python dependencies