Skip to content

Kakarot35/DataPilot--AI-Natural-Language-Data-Analyst

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataPilot — AI-Powered Data Analyst

Upload any dataset. Ask questions in plain English. Get instant insights.

A complete backend system that combines a custom ML model, a rule-based engine, and an LLM fallback into a layered "sandwich architecture" for intelligent data analysis.


Architecture Overview

User Query
    │
    ▼
┌──────────────────┐
│  Interface Layer  │  FastAPI endpoints (/upload, /eda, /ask)
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│  NLP Layer        │  TF-IDF + Logistic Regression intent classifier
│                   │  + regex-based entity extractor
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│  Confidence Layer │  confidence > 0.75 ?
└──────┬───────┬───┘
    YES│       │NO
       ▼       ▼
┌────────┐  ┌────────┐
│  Rule  │  │  LLM   │
│ Engine │  │Fallback │
└───┬────┘  └───┬────┘
    │           │
    └─────┬─────┘
          ▼
┌──────────────────┐
│ Execution Engine  │  safe Pandas operations
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│ Response Layer    │  structured JSON output
└──────────────────┘

Features

  • 13 supported intents: top_n, mean, sum, count, filter, groupby, min, max, describe, correlation, unique, sort, value_counts
  • Custom ML model: TF-IDF + Logistic Regression with 280 training examples across 13 intents
  • Smart fallback: when model confidence < 0.75, the system falls back to Gemini/OpenAI; if LLM is unavailable, it trusts the NLP model's best-guess intent
  • Entity extraction: regex-based extraction of column names, numeric values, conditions, and sort direction
  • Multi-format support: CSV, Excel (multi-sheet), JSON, Parquet
  • Full EDA suite: summary stats, correlation, charts, outlier detection, ML suggestions
  • Preprocessing pipeline: missing values, duplicates, encoding, scaling, outlier removal

Project Structure

backend/
├── app/
│   ├── __init__.py
│   ├── config.py              # ← settings + env vars
│   ├── main.py                # ← FastAPI app + lifespan
│   ├── models/
│   │   ├── schemas.py         # ← Pydantic request/response models
│   ├── nlp/                   # ← NLP LAYER
│   │   ├── intent_model.py    # ← TF-IDF + LogReg classifier
│   │   ├── entity_extractor.py# ← regex entity extraction
│   │   └── training_data.py   # ← 280 labeled query→intent pairs
│   └── logic/                 # ← LOGIC LAYER
│       └── query_builder.py   # ← rule engine + execution engine
├── routes/
│   ├── upload.py              # ← POST /api/upload/file
│   ├── eda.py                 # ← GET  /api/eda/*
│   ├── preprocess.py          # ← POST /api/preprocess/*
│   └── ask.py                 # ← POST /api/ask  (AI analyst)
├── services/
│   ├── file_handler.py        # ← data layer (in-memory DataFrame)
│   ├── eda.py                 # ← EDA computations + Plotly charts
│   ├── preprocessing.py       # ← data cleaning operations
│   ├── analyst_service.py     # ← sandwich architecture orchestrator
│   └── llm_service.py         # ← LLM fallback (Gemini / OpenAI)
├── utils/
│   └── helpers.py             # ← serialization, memory, filenames
├── models/                    # ← persisted ML artefacts
│   ├── intent_model.pkl
│   └── vectorizer.pkl
├── sample_data/
│   └── employees.csv          # ← sample dataset for testing
├── run.py                     # ← uvicorn entry point
├── train_model.py             # ← standalone model training script
├── test_api.py                # ← API test suite
├── requirements.txt
├── .env.example
└── README.md

Quick Start

Prerequisites

  • Python 3.9+
  • pip

1. Clone & install dependencies

cd backend

# Create virtual environment (recommended)
python -m venv venv

# Activate — Windows:
venv\Scripts\activate
# Activate — macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

2. Configure environment (optional — for LLM fallback)

# Copy the template
copy .env.example .env      # Windows
cp .env.example .env        # macOS/Linux

# Edit .env and add your API key for LLM fallback:
# GEMINI_API_KEY=your_key_here
# or
# OPENAI_API_KEY=your_key_here

Note: The LLM fallback is optional. The custom ML model handles most queries on its own. The LLM only activates when model confidence drops below 0.75.

3. Train the intent model

python train_model.py

Expected output:

  Training accuracy:  1.0000
  Cross-val mean:     0.6607 ± 0.0319
  Samples:            280
  Classes:            13

  ✓ Model saved to models/intent_model.pkl
  ✓ Vectorizer saved to models/vectorizer.pkl

Tip: If you skip this step, the model auto-trains on first server startup.

4. Start the server

python run.py

The API is now live at http://localhost:8000

5. Run the test suite

# In a second terminal (server must be running)
pip install httpx   # one-time
python test_api.py

API Reference

Upload

Method Endpoint Description
POST /api/upload/file Upload CSV/Excel/JSON/Parquet
GET /api/upload/current Get current dataset info

EDA

Method Endpoint Description
GET /api/eda/summary Summary statistics
GET /api/eda/correlation Correlation matrix
GET /api/eda/skewness-kurtosis Distribution shape

AI Analyst

Method Endpoint Description
POST /api/ask Ask a natural language question
GET /api/ask/model-info Model metadata + supported intents

POST /api/ask — Request

{
  "query": "what is the average salary by department",
  "confidence_threshold": 0.75
}

POST /api/ask — Response

{
  "success": true,
  "answer": "Mean of salary grouped by department.",
  "data": [
    {"department": "Engineering", "mean_salary": 105100.0},
    {"department": "Sales", "mean_salary": 73125.0},
    {"department": "Marketing", "mean_salary": 74833.33},
    {"department": "HR", "mean_salary": 64600.0}
  ],
  "source": "model",
  "intent": "groupby",
  "confidence": 0.9234,
  "processing_time_ms": 12.45,
  "operation": {
    "operation": "groupby",
    "params": {
      "group_column": "department",
      "target_column": "salary",
      "agg": "mean"
    },
    "description": "Group by department, mean of salary"
  }
}

Preprocessing

Method Endpoint Description
POST /api/preprocess/missing-values Handle missing values
POST /api/preprocess/remove-duplicates Remove duplicates
POST /api/preprocess/encode Encode categoricals
POST /api/preprocess/scale Scale numericals
GET /api/preprocess/export Export cleaned data

cURL & Postman Examples

Below are copy-paste-ready examples. The server must be running at http://localhost:8000.

1. Upload a CSV

curl -X POST http://localhost:8000/api/upload/file \
  -F "file=@sample_data/employees.csv"

Postman: POST http://localhost:8000/api/upload/file → Body → form-data → key: file (File), value: select employees.csv.

2. Ask a natural language question

# Average salary
curl -X POST http://localhost:8000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "what is the average salary"}'

# Top 5 employees by salary
curl -X POST http://localhost:8000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "show top 5 employees by salary"}'

# Filter with operator
curl -X POST http://localhost:8000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "show rows where salary > 100000"}'

# Group by
curl -X POST http://localhost:8000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "salary by department"}'

# Correlation
curl -X POST http://localhost:8000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "correlation between salary and age"}'

# Value distribution
curl -X POST http://localhost:8000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "distribution of department column"}'

# Custom confidence threshold (lower = more rule engine, higher = more LLM)
curl -X POST http://localhost:8000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "sort by salary descending", "confidence_threshold": 0.5}'

Postman: POST http://localhost:8000/api/ask → Body → raw → JSON:

{
  "query": "show top 5 employees by salary",
  "confidence_threshold": 0.75
}

3. Health check & model info

curl http://localhost:8000/api/health
curl http://localhost:8000/api/ask/model-info

4. EDA endpoints

curl http://localhost:8000/api/eda/summary
curl http://localhost:8000/api/eda/correlation

Sample Queries

Once you've uploaded a dataset, try these at POST /api/ask:

Query Intent What it does
"what is the average salary" mean Computes df['salary'].mean()
"show top 5 employees by salary" top_n df.nlargest(5, 'salary')
"total salary" sum df['salary'].sum()
"how many rows are there" count len(df)
"show rows where salary > 100000" filter df[df['salary'] > 100000]
"salary by department" groupby df.groupby('department')['salary'].sum()
"minimum salary" min df['salary'].min()
"correlation between salary and age" correlation df['salary'].corr(df['age'])
"unique values in city" unique df['city'].unique()
"sort by salary descending" sort df.sort_values('salary', ascending=False)
"distribution of department" value_counts df['department'].value_counts()
"describe the dataset" describe df.describe()

Configuration

All settings can be overridden via environment variables or .env:

Variable Default Description
CONFIDENCE_THRESHOLD 0.75 Min model confidence for rule engine
LLM_PROVIDER gemini gemini or openai
GEMINI_API_KEY (empty) Google Gemini API key
OPENAI_API_KEY (empty) OpenAI API key
OPENAI_MODEL gpt-4o-mini OpenAI model name
DEBUG true Enable debug logging

How It Works

1. Intent Classification (NLP Layer)

The system uses a TF-IDF vectorizer with unigram + bigram features fed into a Logistic Regression classifier. The model is trained on 280 labelled examples spanning 13 intents.

2. Entity Extraction

A regex-based extractor pulls structured entities from the query:

  • Column names — fuzzy-matched against the loaded DataFrame columns
  • Numeric values"top 5"n=5
  • Conditions"greater than", ">", "above">
  • Filter values"salary > 100000"filter_value=100000
  • Sort direction"descending", "high to low"descending

3. Confidence Layer

If the classifier's top probability is ≥ 0.75, the query is routed to the rule engine. Otherwise, it's forwarded to the LLM fallback for interpretation.

4. Rule Engine

Deterministic mapping from (intent, entities) → Pandas operation specification. Each intent has a dedicated handler.

5. LLM Fallback

When the model is uncertain, the system sends a token-optimized prompt to Gemini or OpenAI, requesting only a structured JSON operation (no free-text responses allowed). If the LLM is also unavailable, the system gracefully falls back to the NLP model's best-guess intent via the rule engine — it's typically correct even at lower confidence.

6. Execution Engine

Both paths produce an operation dict:

{"operation": "mean", "params": {"column": "salary"}, "description": "..."}

This is safely executed against the in-memory DataFrame.


License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors