Upload any dataset. Ask questions in plain English. Get instant insights.
A complete backend system that combines a custom ML model, a rule-based engine, and an LLM fallback into a layered "sandwich architecture" for intelligent data analysis.
User Query
│
▼
┌──────────────────┐
│ Interface Layer │ FastAPI endpoints (/upload, /eda, /ask)
└──────┬───────────┘
│
▼
┌──────────────────┐
│ NLP Layer │ TF-IDF + Logistic Regression intent classifier
│ │ + regex-based entity extractor
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Confidence Layer │ confidence > 0.75 ?
└──────┬───────┬───┘
YES│ │NO
▼ ▼
┌────────┐ ┌────────┐
│ Rule │ │ LLM │
│ Engine │ │Fallback │
└───┬────┘ └───┬────┘
│ │
└─────┬─────┘
▼
┌──────────────────┐
│ Execution Engine │ safe Pandas operations
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Response Layer │ structured JSON output
└──────────────────┘
- 13 supported intents:
top_n,mean,sum,count,filter,groupby,min,max,describe,correlation,unique,sort,value_counts - Custom ML model: TF-IDF + Logistic Regression with 280 training examples across 13 intents
- Smart fallback: when model confidence < 0.75, the system falls back to Gemini/OpenAI; if LLM is unavailable, it trusts the NLP model's best-guess intent
- Entity extraction: regex-based extraction of column names, numeric values, conditions, and sort direction
- Multi-format support: CSV, Excel (multi-sheet), JSON, Parquet
- Full EDA suite: summary stats, correlation, charts, outlier detection, ML suggestions
- Preprocessing pipeline: missing values, duplicates, encoding, scaling, outlier removal
backend/
├── app/
│ ├── __init__.py
│ ├── config.py # ← settings + env vars
│ ├── main.py # ← FastAPI app + lifespan
│ ├── models/
│ │ ├── schemas.py # ← Pydantic request/response models
│ ├── nlp/ # ← NLP LAYER
│ │ ├── intent_model.py # ← TF-IDF + LogReg classifier
│ │ ├── entity_extractor.py# ← regex entity extraction
│ │ └── training_data.py # ← 280 labeled query→intent pairs
│ └── logic/ # ← LOGIC LAYER
│ └── query_builder.py # ← rule engine + execution engine
├── routes/
│ ├── upload.py # ← POST /api/upload/file
│ ├── eda.py # ← GET /api/eda/*
│ ├── preprocess.py # ← POST /api/preprocess/*
│ └── ask.py # ← POST /api/ask (AI analyst)
├── services/
│ ├── file_handler.py # ← data layer (in-memory DataFrame)
│ ├── eda.py # ← EDA computations + Plotly charts
│ ├── preprocessing.py # ← data cleaning operations
│ ├── analyst_service.py # ← sandwich architecture orchestrator
│ └── llm_service.py # ← LLM fallback (Gemini / OpenAI)
├── utils/
│ └── helpers.py # ← serialization, memory, filenames
├── models/ # ← persisted ML artefacts
│ ├── intent_model.pkl
│ └── vectorizer.pkl
├── sample_data/
│ └── employees.csv # ← sample dataset for testing
├── run.py # ← uvicorn entry point
├── train_model.py # ← standalone model training script
├── test_api.py # ← API test suite
├── requirements.txt
├── .env.example
└── README.md
- Python 3.9+
- pip
cd backend
# Create virtual environment (recommended)
python -m venv venv
# Activate — Windows:
venv\Scripts\activate
# Activate — macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt# Copy the template
copy .env.example .env # Windows
cp .env.example .env # macOS/Linux
# Edit .env and add your API key for LLM fallback:
# GEMINI_API_KEY=your_key_here
# or
# OPENAI_API_KEY=your_key_hereNote: The LLM fallback is optional. The custom ML model handles most queries on its own. The LLM only activates when model confidence drops below 0.75.
python train_model.pyExpected output:
Training accuracy: 1.0000
Cross-val mean: 0.6607 ± 0.0319
Samples: 280
Classes: 13
✓ Model saved to models/intent_model.pkl
✓ Vectorizer saved to models/vectorizer.pkl
Tip: If you skip this step, the model auto-trains on first server startup.
python run.pyThe API is now live at http://localhost:8000
- Swagger docs: http://localhost:8000/api/docs
- ReDoc: http://localhost:8000/api/redoc
# In a second terminal (server must be running)
pip install httpx # one-time
python test_api.py| Method | Endpoint | Description |
|---|---|---|
| POST | /api/upload/file |
Upload CSV/Excel/JSON/Parquet |
| GET | /api/upload/current |
Get current dataset info |
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/eda/summary |
Summary statistics |
| GET | /api/eda/correlation |
Correlation matrix |
| GET | /api/eda/skewness-kurtosis |
Distribution shape |
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/ask |
Ask a natural language question |
| GET | /api/ask/model-info |
Model metadata + supported intents |
{
"query": "what is the average salary by department",
"confidence_threshold": 0.75
}{
"success": true,
"answer": "Mean of salary grouped by department.",
"data": [
{"department": "Engineering", "mean_salary": 105100.0},
{"department": "Sales", "mean_salary": 73125.0},
{"department": "Marketing", "mean_salary": 74833.33},
{"department": "HR", "mean_salary": 64600.0}
],
"source": "model",
"intent": "groupby",
"confidence": 0.9234,
"processing_time_ms": 12.45,
"operation": {
"operation": "groupby",
"params": {
"group_column": "department",
"target_column": "salary",
"agg": "mean"
},
"description": "Group by department, mean of salary"
}
}| Method | Endpoint | Description |
|---|---|---|
| POST | /api/preprocess/missing-values |
Handle missing values |
| POST | /api/preprocess/remove-duplicates |
Remove duplicates |
| POST | /api/preprocess/encode |
Encode categoricals |
| POST | /api/preprocess/scale |
Scale numericals |
| GET | /api/preprocess/export |
Export cleaned data |
Below are copy-paste-ready examples. The server must be running at http://localhost:8000.
curl -X POST http://localhost:8000/api/upload/file \
-F "file=@sample_data/employees.csv"Postman: POST http://localhost:8000/api/upload/file → Body → form-data → key: file (File), value: select employees.csv.
# Average salary
curl -X POST http://localhost:8000/api/ask \
-H "Content-Type: application/json" \
-d '{"query": "what is the average salary"}'
# Top 5 employees by salary
curl -X POST http://localhost:8000/api/ask \
-H "Content-Type: application/json" \
-d '{"query": "show top 5 employees by salary"}'
# Filter with operator
curl -X POST http://localhost:8000/api/ask \
-H "Content-Type: application/json" \
-d '{"query": "show rows where salary > 100000"}'
# Group by
curl -X POST http://localhost:8000/api/ask \
-H "Content-Type: application/json" \
-d '{"query": "salary by department"}'
# Correlation
curl -X POST http://localhost:8000/api/ask \
-H "Content-Type: application/json" \
-d '{"query": "correlation between salary and age"}'
# Value distribution
curl -X POST http://localhost:8000/api/ask \
-H "Content-Type: application/json" \
-d '{"query": "distribution of department column"}'
# Custom confidence threshold (lower = more rule engine, higher = more LLM)
curl -X POST http://localhost:8000/api/ask \
-H "Content-Type: application/json" \
-d '{"query": "sort by salary descending", "confidence_threshold": 0.5}'Postman: POST http://localhost:8000/api/ask → Body → raw → JSON:
{
"query": "show top 5 employees by salary",
"confidence_threshold": 0.75
}curl http://localhost:8000/api/health
curl http://localhost:8000/api/ask/model-infocurl http://localhost:8000/api/eda/summary
curl http://localhost:8000/api/eda/correlationOnce you've uploaded a dataset, try these at POST /api/ask:
| Query | Intent | What it does |
|---|---|---|
"what is the average salary" |
mean | Computes df['salary'].mean() |
"show top 5 employees by salary" |
top_n | df.nlargest(5, 'salary') |
"total salary" |
sum | df['salary'].sum() |
"how many rows are there" |
count | len(df) |
"show rows where salary > 100000" |
filter | df[df['salary'] > 100000] |
"salary by department" |
groupby | df.groupby('department')['salary'].sum() |
"minimum salary" |
min | df['salary'].min() |
"correlation between salary and age" |
correlation | df['salary'].corr(df['age']) |
"unique values in city" |
unique | df['city'].unique() |
"sort by salary descending" |
sort | df.sort_values('salary', ascending=False) |
"distribution of department" |
value_counts | df['department'].value_counts() |
"describe the dataset" |
describe | df.describe() |
All settings can be overridden via environment variables or .env:
| Variable | Default | Description |
|---|---|---|
CONFIDENCE_THRESHOLD |
0.75 |
Min model confidence for rule engine |
LLM_PROVIDER |
gemini |
gemini or openai |
GEMINI_API_KEY |
(empty) | Google Gemini API key |
OPENAI_API_KEY |
(empty) | OpenAI API key |
OPENAI_MODEL |
gpt-4o-mini |
OpenAI model name |
DEBUG |
true |
Enable debug logging |
The system uses a TF-IDF vectorizer with unigram + bigram features fed into a Logistic Regression classifier. The model is trained on 280 labelled examples spanning 13 intents.
A regex-based extractor pulls structured entities from the query:
- Column names — fuzzy-matched against the loaded DataFrame columns
- Numeric values —
"top 5"→n=5 - Conditions —
"greater than",">","above"→> - Filter values —
"salary > 100000"→filter_value=100000 - Sort direction —
"descending","high to low"→descending
If the classifier's top probability is ≥ 0.75, the query is routed to the rule engine. Otherwise, it's forwarded to the LLM fallback for interpretation.
Deterministic mapping from (intent, entities) → Pandas operation
specification. Each intent has a dedicated handler.
When the model is uncertain, the system sends a token-optimized prompt to Gemini or OpenAI, requesting only a structured JSON operation (no free-text responses allowed). If the LLM is also unavailable, the system gracefully falls back to the NLP model's best-guess intent via the rule engine — it's typically correct even at lower confidence.
Both paths produce an operation dict:
{"operation": "mean", "params": {"column": "salary"}, "description": "..."}This is safely executed against the in-memory DataFrame.
MIT