DataPilot — AI-Powered Data Analyst

Upload any dataset. Ask questions in plain English. Get instant insights.

A complete backend system that combines a custom ML model, a rule-based engine, and an LLM fallback into a layered "sandwich architecture" for intelligent data analysis.

Architecture Overview

User Query
    │
    ▼
┌──────────────────┐
│  Interface Layer  │  FastAPI endpoints (/upload, /eda, /ask)
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│  NLP Layer        │  TF-IDF + Logistic Regression intent classifier
│                   │  + regex-based entity extractor
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│  Confidence Layer │  confidence > 0.75 ?
└──────┬───────┬───┘
    YES│       │NO
       ▼       ▼
┌────────┐  ┌────────┐
│  Rule  │  │  LLM   │
│ Engine │  │Fallback │
└───┬────┘  └───┬────┘
    │           │
    └─────┬─────┘
          ▼
┌──────────────────┐
│ Execution Engine  │  safe Pandas operations
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│ Response Layer    │  structured JSON output
└──────────────────┘

Features

13 supported intents: top_n, mean, sum, count, filter, groupby, min, max, describe, correlation, unique, sort, value_counts
Custom ML model: TF-IDF + Logistic Regression with 280 training examples across 13 intents
Smart fallback: when model confidence < 0.75, the system falls back to Gemini/OpenAI; if LLM is unavailable, it trusts the NLP model's best-guess intent
Entity extraction: regex-based extraction of column names, numeric values, conditions, and sort direction
Multi-format support: CSV, Excel (multi-sheet), JSON, Parquet
Full EDA suite: summary stats, correlation, charts, outlier detection, ML suggestions
Preprocessing pipeline: missing values, duplicates, encoding, scaling, outlier removal

Project Structure

backend/
├── app/
│   ├── __init__.py
│   ├── config.py              # ← settings + env vars
│   ├── main.py                # ← FastAPI app + lifespan
│   ├── models/
│   │   ├── schemas.py         # ← Pydantic request/response models
│   ├── nlp/                   # ← NLP LAYER
│   │   ├── intent_model.py    # ← TF-IDF + LogReg classifier
│   │   ├── entity_extractor.py# ← regex entity extraction
│   │   └── training_data.py   # ← 280 labeled query→intent pairs
│   └── logic/                 # ← LOGIC LAYER
│       └── query_builder.py   # ← rule engine + execution engine
├── routes/
│   ├── upload.py              # ← POST /api/upload/file
│   ├── eda.py                 # ← GET  /api/eda/*
│   ├── preprocess.py          # ← POST /api/preprocess/*
│   └── ask.py                 # ← POST /api/ask  (AI analyst)
├── services/
│   ├── file_handler.py        # ← data layer (in-memory DataFrame)
│   ├── eda.py                 # ← EDA computations + Plotly charts
│   ├── preprocessing.py       # ← data cleaning operations
│   ├── analyst_service.py     # ← sandwich architecture orchestrator
│   └── llm_service.py         # ← LLM fallback (Gemini / OpenAI)
├── utils/
│   └── helpers.py             # ← serialization, memory, filenames
├── models/                    # ← persisted ML artefacts
│   ├── intent_model.pkl
│   └── vectorizer.pkl
├── sample_data/
│   └── employees.csv          # ← sample dataset for testing
├── run.py                     # ← uvicorn entry point
├── train_model.py             # ← standalone model training script
├── test_api.py                # ← API test suite
├── requirements.txt
├── .env.example
└── README.md

Quick Start

Prerequisites

Python 3.9+
pip

1. Clone & install dependencies

cd backend

# Create virtual environment (recommended)
python -m venv venv

# Activate — Windows:
venv\Scripts\activate
# Activate — macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

2. Configure environment (optional — for LLM fallback)

# Copy the template
copy .env.example .env      # Windows
cp .env.example .env        # macOS/Linux

# Edit .env and add your API key for LLM fallback:
# GEMINI_API_KEY=your_key_here
# or
# OPENAI_API_KEY=your_key_here

Note: The LLM fallback is optional. The custom ML model handles most queries on its own. The LLM only activates when model confidence drops below 0.75.

3. Train the intent model

python train_model.py

Expected output:

  Training accuracy:  1.0000
  Cross-val mean:     0.6607 ± 0.0319
  Samples:            280
  Classes:            13

  ✓ Model saved to models/intent_model.pkl
  ✓ Vectorizer saved to models/vectorizer.pkl

Tip: If you skip this step, the model auto-trains on first server startup.

4. Start the server

python run.py

The API is now live at http://localhost:8000

Swagger docs: http://localhost:8000/api/docs
ReDoc: http://localhost:8000/api/redoc

5. Run the test suite

# In a second terminal (server must be running)
pip install httpx   # one-time
python test_api.py

API Reference

Upload

Method	Endpoint	Description
POST	`/api/upload/file`	Upload CSV/Excel/JSON/Parquet
GET	`/api/upload/current`	Get current dataset info

EDA

Method	Endpoint	Description
GET	`/api/eda/summary`	Summary statistics
GET	`/api/eda/correlation`	Correlation matrix
GET	`/api/eda/skewness-kurtosis`	Distribution shape

AI Analyst

Method	Endpoint	Description
POST	`/api/ask`	Ask a natural language question
GET	`/api/ask/model-info`	Model metadata + supported intents

POST /api/ask — Request

{
  "query": "what is the average salary by department",
  "confidence_threshold": 0.75
}

POST /api/ask — Response

{
  "success": true,
  "answer": "Mean of salary grouped by department.",
  "data": [
    {"department": "Engineering", "mean_salary": 105100.0},
    {"department": "Sales", "mean_salary": 73125.0},
    {"department": "Marketing", "mean_salary": 74833.33},
    {"department": "HR", "mean_salary": 64600.0}
  ],
  "source": "model",
  "intent": "groupby",
  "confidence": 0.9234,
  "processing_time_ms": 12.45,
  "operation": {
    "operation": "groupby",
    "params": {
      "group_column": "department",
      "target_column": "salary",
      "agg": "mean"
    },
    "description": "Group by department, mean of salary"
  }
}

Preprocessing

Method	Endpoint	Description
POST	`/api/preprocess/missing-values`	Handle missing values
POST	`/api/preprocess/remove-duplicates`	Remove duplicates
POST	`/api/preprocess/encode`	Encode categoricals
POST	`/api/preprocess/scale`	Scale numericals
GET	`/api/preprocess/export`	Export cleaned data

cURL & Postman Examples

Below are copy-paste-ready examples. The server must be running at http://localhost:8000.

1. Upload a CSV

curl -X POST http://localhost:8000/api/upload/file \
  -F "file=@sample_data/employees.csv"

Postman: POST http://localhost:8000/api/upload/file → Body → form-data → key: file (File), value: select employees.csv.

2. Ask a natural language question

# Average salary
curl -X POST http://localhost:8000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "what is the average salary"}'

# Top 5 employees by salary
curl -X POST http://localhost:8000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "show top 5 employees by salary"}'

# Filter with operator
curl -X POST http://localhost:8000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "show rows where salary > 100000"}'

# Group by
curl -X POST http://localhost:8000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "salary by department"}'

# Correlation
curl -X POST http://localhost:8000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "correlation between salary and age"}'

# Value distribution
curl -X POST http://localhost:8000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "distribution of department column"}'

# Custom confidence threshold (lower = more rule engine, higher = more LLM)
curl -X POST http://localhost:8000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "sort by salary descending", "confidence_threshold": 0.5}'

Postman: POST http://localhost:8000/api/ask → Body → raw → JSON:

{
  "query": "show top 5 employees by salary",
  "confidence_threshold": 0.75
}

3. Health check & model info

curl http://localhost:8000/api/health
curl http://localhost:8000/api/ask/model-info

4. EDA endpoints

curl http://localhost:8000/api/eda/summary
curl http://localhost:8000/api/eda/correlation

Sample Queries

Once you've uploaded a dataset, try these at POST /api/ask:

Query	Intent	What it does
`"what is the average salary"`	mean	Computes `df['salary'].mean()`
`"show top 5 employees by salary"`	top_n	`df.nlargest(5, 'salary')`
`"total salary"`	sum	`df['salary'].sum()`
`"how many rows are there"`	count	`len(df)`
`"show rows where salary > 100000"`	filter	`df[df['salary'] > 100000]`
`"salary by department"`	groupby	`df.groupby('department')['salary'].sum()`
`"minimum salary"`	min	`df['salary'].min()`
`"correlation between salary and age"`	correlation	`df['salary'].corr(df['age'])`
`"unique values in city"`	unique	`df['city'].unique()`
`"sort by salary descending"`	sort	`df.sort_values('salary', ascending=False)`
`"distribution of department"`	value_counts	`df['department'].value_counts()`
`"describe the dataset"`	describe	`df.describe()`

Configuration

All settings can be overridden via environment variables or .env:

Variable	Default	Description
`CONFIDENCE_THRESHOLD`	`0.75`	Min model confidence for rule engine
`LLM_PROVIDER`	`gemini`	`gemini` or `openai`
`GEMINI_API_KEY`	(empty)	Google Gemini API key
`OPENAI_API_KEY`	(empty)	OpenAI API key
`OPENAI_MODEL`	`gpt-4o-mini`	OpenAI model name
`DEBUG`	`true`	Enable debug logging

How It Works

1. Intent Classification (NLP Layer)

The system uses a TF-IDF vectorizer with unigram + bigram features fed into a Logistic Regression classifier. The model is trained on 280 labelled examples spanning 13 intents.

2. Entity Extraction

A regex-based extractor pulls structured entities from the query:

Column names — fuzzy-matched against the loaded DataFrame columns
Numeric values — "top 5" → n=5
Conditions — "greater than", ">", "above" → >
Filter values — "salary > 100000" → filter_value=100000
Sort direction — "descending", "high to low" → descending

3. Confidence Layer

If the classifier's top probability is ≥ 0.75, the query is routed to the rule engine. Otherwise, it's forwarded to the LLM fallback for interpretation.

4. Rule Engine

Deterministic mapping from (intent, entities) → Pandas operation specification. Each intent has a dedicated handler.

5. LLM Fallback

When the model is uncertain, the system sends a token-optimized prompt to Gemini or OpenAI, requesting only a structured JSON operation (no free-text responses allowed). If the LLM is also unavailable, the system gracefully falls back to the NLP model's best-guess intent via the rule engine — it's typically correct even at lower confidence.

6. Execution Engine

Both paths produce an operation dict:

{"operation": "mean", "params": {"column": "salary"}, "description": "..."}

This is safely executed against the in-memory DataFrame.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md
intent_dataset.csv		intent_dataset.csv

Folders and files

Latest commit

History

Repository files navigation

DataPilot — AI-Powered Data Analyst

Architecture Overview

Features

Project Structure

Quick Start

Prerequisites

1. Clone & install dependencies

2. Configure environment (optional — for LLM fallback)

3. Train the intent model

4. Start the server

5. Run the test suite

API Reference

Upload

EDA

AI Analyst

POST /api/ask — Request

POST /api/ask — Response

Preprocessing

cURL & Postman Examples

1. Upload a CSV

2. Ask a natural language question

3. Health check & model info

4. EDA endpoints

Sample Queries

Configuration

How It Works

1. Intent Classification (NLP Layer)

2. Entity Extraction

3. Confidence Layer

4. Rule Engine

5. LLM Fallback

6. Execution Engine

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages