Skip to content

Multilingual internship search assistant that converts natural language queries into SQL using LangChain + Groq Llama 3 and retrieves results from a local SQLite database via a Flask web UI.

Notifications You must be signed in to change notification settings

Deepak-J0shi/Internship-Extraction-Chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

40 Commits
Β 
Β 
Β 
Β 

Repository files navigation

SAHAYAK – Internship Extraction AI Assistant

A multilingual internship search assistant that uses LangChain + Llama 3 (Groq) to convert natural language questions into SQL queries over a local SQLite database of internships. It returns clickable internship links, formatted answers, and an HTML table viewβ€”accessible through a simple Flask + Jinja web UI and deployed as a Hugging Face Space.

πŸ‘‰ Live Demo: https://huggingface.co/spaces/joshi-deepak08/Internship_extraction_chatbot-Sahayak

image

Table of Contents

  1. Overview
  2. Features
  3. Folder Structure
  4. How to Run Locally
  5. Architecture & Design Decisions
  6. Approach
  7. Pipeline Design
  8. Challenges & Trade-Offs

Overview

SAHAYAK is an AI assistant that helps students discover internships from a structured local database using natural language queries.

Key capabilities:

  • Understands user questions in any language (auto-detected & translated).

  • Uses LangChain + Llama 3 (Groq) to generate SQL queries dynamically.

  • Executes those SQL queries against a local internship.db SQLite database.

  • Returns:

    • a human-friendly answer with Markdown links
    • an optional HTML table of internships (title, link, stipend).
  • Frontend is a Flask app with a simple chat-style interface showing previous conversation history.


Features

  • NL β†’ SQL using LangChain + Llama 3
  • Multilingual support via deep_translator (input + output).
  • SQLite database of internships (internship.db).
  • Tabular view of selected query results (e.g., internships with low stipend).
  • Conversation history maintained in memory for UX.
  • Deployed on Hugging Face Space for easy access.

Folder Structure

Internship-Extraction-Chatbot/
β”‚
β”œβ”€β”€ SAHAYAK: An AI Assistant/
β”‚   β”œβ”€β”€ static/               # CSS, JS, assets for UI
β”‚   β”œβ”€β”€ templates/
β”‚   β”‚   └── index.html        # Chat UI (Flask + Jinja)
β”‚   β”œβ”€β”€ app.py                # Flask app + LangChain pipeline
β”‚   β”œβ”€β”€ internship.db         # SQLite database with internship records
β”‚   └── README.md

(Your GitHub repo root may directly contain these files depending on layout.)


βš™οΈ How to Run Locally

1️⃣ Clone the repo

git clone https://github.com/JoshiDeepak08/Internship-Extraction-Chatbot.git
cd Internship-Extraction-Chatbot/SAHAYAK:\ An\ AI\ Assistant

(or navigate into the folder that contains app.py and internship.db.)

2️⃣ Create & activate a virtual environment (recommended)

python -m venv venv
source venv/bin/activate      # macOS / Linux
# or
venv\Scripts\activate         # Windows

3️⃣ Install dependencies

pip install -r requirements.txt   # if present

If there is no requirements.txt at this level, install the key libs manually:

pip install flask sqlite3 pandas langchain langchain-community langchain-groq \
            deep-translator

4️⃣ Set environment variables (Groq API key)

In app.py you currently have:

os.environ["OPENAI_API_KEY"] = ""
os.environ["GROQ_API_KEY"] = ""

Instead of hardcoding, export these before running:

export GROQ_API_KEY="your_groq_api_key_here"
# (OPENAI_API_KEY can remain empty; it's not used right now)

On Windows (PowerShell):

$env:GROQ_API_KEY="your_groq_api_key_here"

Alternatively, you can directly set them in the code (not recommended for production).

5️⃣ Run the Flask app

python app.py

By default it will start in debug mode at:

http://127.0.0.1:5000

Open this in your browser to use the chatbot.


Architecture & Design Decisions

Tech Stack

  • Backend: Flask
  • Database: SQLite (internship.db)
  • LLM: llama3-8b-8192 via ChatGroq
  • Orchestration: LangChain (SQLDatabase, create_sql_query_chain, QuerySQLDataBaseTool)
  • Translation: deep_translator.GoogleTranslator
  • Frontend: Jinja2 HTML template (templates/index.html) + simple CSS/Bootstrap.

Why LangChain SQL Chain?

  • Automatically maps natural language to SQL given the DB schema.
  • Handles query generation + execution pipeline in a few lines.
  • Easy to change LLM or database backend later.

Why Groq Llama 3?

  • Fast inference, good cost-performance.
  • Open-weight model with strong reasoning over structured tasks like SQL generation.

Why SQLite?

  • Lightweight, file-based DB.
  • Perfect for a single-file internship dataset.
  • Easy to ship with the repo and deploy on Hugging Face Space.

Approach

  1. User types a question (can be in Hindi, English, or any language).

  2. The app:

    • Translates the question β†’ English.
    • Uses create_sql_query_chain with ChatGroq and SQLDatabase to generate a SQL query.
    • Extracts the SQL text from the LLM output (generated_query.split("SQLQuery: ")[-1].strip()).
    • Executes SQL using QuerySQLDataBaseTool.
  3. The raw SQL result (rows) is:

    • Optionally converted into an HTML table (especially for internship listings).
    • Passed to another LLM prompt (answer_prompt) to generate a short, readable summary.
  4. The answer (in English) is translated back to the user’s original language using GoogleTranslator.

  5. The final, translated response and HTML table are rendered on the page.


Pipeline Design

High-Level Flow

flowchart TD
A["User Question (Any Language)"] --> B["Flask Route"]
B --> C["Translate to English\ndeep-translator"]
C --> D["LangChain SQL Query Chain"]
D --> E["Generated SQL Query"]
E --> F["Execute on SQLite DB"]
F --> G["Raw SQL Result Rows"]

Loading

Challenges & Trade-Offs

1. LLM-Generated SQL Safety

  • Direct LLM-to-SQL can be risky if DB has write/drop access.
  • Here, DB is read-only and local, so impact is contained.

2. Translation Accuracy

  • Using GoogleTranslator for both directions introduces:

    • Possible semantic drift.
    • But huge benefit of multilingual UX.
  • Trade-off: slight inaccuracy vs accessibility for non-English users.

3. Schema-Dependent Queries

  • The quality of SQL generation depends heavily on:

    • Clear column names in internship.db.
    • Proper metadata exposure via SQLDatabase.from_uri.
  • If schema changes, prompts may need updating.

4. Stateless vs Stateful Conversations

  • Currently, only a simple in-memory list previous_conversations is used.
  • No complex context or multi-turn reasoning yet; but good enough for first prototype.

About

Multilingual internship search assistant that converts natural language queries into SQL using LangChain + Groq Llama 3 and retrieves results from a local SQLite database via a Flask web UI.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published