Natural Language SQL Chatbot with GenAI

Unlock the power of your structured data with natural language queries. Whether you're exploring vast data lakes or querying relational databases, this NLQ data analyst Gen AI Agent provides an intuitive interface for data discovery and analysis. The agent automatically converts natural language queries into SQL to extract insights from structured data sources.

Key Features

Query structured data using natural language
Supports multiple data sources:
- Amazon S3 data lakes including JSON, CSV, Parquet etc. file formats (via Amazon Athena and AWS Glue)
- SQL databases (MySQL, PostgreSQL, etc.)
- Data warehouses
Powered by Amazon Bedrock LLMs
Interactive UI built with Chainlit
Flexible agent architecture using LangChain/LangGraph
- Real-time query generation & execution
- Transparency with Agent chain of thought and steps
Automatic data source schema inference and mapping

Example

Transparency of SQL Generated

I.e.

SELECT publisher, SUM(global_sales) AS total_sales
FROM video_game_sales
WHERE year BETWEEN '2000' AND '2010'
GROUP BY publisher
ORDER BY total_sales DESC
LIMIT 5;

How It Works

Users input natural language queries through the Chainlit UI
The LangChain/LangGraph agent, powered by Amazon Bedrock, interprets the query
The agent determines the appropriate data source(s) to query
It generates the necessary SQL to retrieve data to answer the user's question
Results are processed and presented back to the user in a readable format

Benefits

Democratize data access across your organization
Reduce dependency on data analysts for simple queries
Accelerate data-driven decision making
Seamlessly integrate with existing AWS infrastructure
Scale effortlessly with serverless architecture

AWS Deployment Instructions

Pre-requisites

Docker Installed. See here for installation instructions: https://docs.docker.com/desktop/
Amazon Bedrock model access enabled:
- (Requried) Follow the instructions here on enabling model access. By default need to enable at least Claude Haiku 3.5 in us-west-2.
- (Optional) You can change the Amazon Bedrock inference region in cdk/app.py. I.e. you may deploy the application infrastructure (ECS, VPC etc.) in ap-southeast-2 but use us-west-2 for Amazon Bedrock inference. See here for model region availability

AWS CLI Credentials - see here for instructions
AWS CDK CLI installed: https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html
(Optional) Add your own datasets or more from Kaggle. Follow Instructions under the Adding Your Own Data to the S3 Data Lake section. Otherwise only the windfarm dataset is included. Can add after deployment too.

Deployment

Change to the cdk directory:
```
cd cdk
```

Extract the example-data.zip to a folder specifically named 'example-data':

MacOS/Linux
```
unzip example-data.zip
```

Windows

# (Command Prompt):
expand example-data.zip -F:*
# or 
tar -xf example-data.zip

# (PowerShell):
Expand-Archive -Path example-data.zip -DestinationPath .

Create a virtualenv:
```
python -m venv .venv
```

Activate the virtualenv:

MacOS/Linux:
```
source .venv/bin/activate
```

Windows:

# In cmd.exe
.venv\Scripts\activate.bat

or 

# In PowerShell
.venv\Scripts\Activate.ps1

Install the required dependencies:
```
pip install -r requirements.txt
```
Generate the chainlit auth secret (required for Cognito Auth). Follow either of these approaches:

Option A - Automatic Approach

Run the respective command to automatically generate and save the secret:
```
# Linux/Mac
chainlit create-secret | grep "CHAINLIT_AUTH_SECRET" > .env
```
```
# Windows
chainlit create-secret | Select-String "CHAINLIT_AUTH_SECRET" | Out-File -FilePath ".env" -Encoding utf8NoBOM
```
NOTE: Ignore any warnings/errors like below:
```
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1738205048.894250    2795 init.cc:232] grpc_wait_for_shutdown_with_timeout() timed out.
```
Option B - Manual Approach (If Option A fails):
1. If it exists, delete the existing cdk/.env file
2. Generate the secret:
```
chainlit create-secret
```
3. Copy only the line containing the generated secret (i.e. CHAINLIT_AUTH_SECRET="XXXXX")
4. Create a new file at cdk/.env and paste the secret into it.
Important: The .env file should be placed directly in the top level cdk folder you are currently in, not in cdk/cdk.
Bootstrap (Only required on first run):
```
cdk bootstrap
```

Update the CDK generated ECR Repo. Change the tag immutability setting to mutable:

aws ecr put-image-tag-mutability --repository-name $(aws ecr describe-repositories --query 'repositories[?contains(repositoryName, `cdk-hnb659fds-container-assets`)].repositoryName' --output text) --image-tag-mutability MUTABLE

see here for more info - https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-tag-mutability.html

Deploy (10-15 mins):
```
cdk deploy
```
You may need to accept IAM statement changes by entering y or n.

Post Deployment Steps (Required)

Run the Glue Crawler in the AWS Console: https://console.aws.amazon.com/glue/home#/v2/data-catalog/crawlers
In order to access the site, create a Cognito user via the CognitoUserPoolConsoleLink in the CloudFormation Outputs.
Access the chatbot at the CloudFrontDomain in the CloudFormation outputs.
Ask questions of the chatbot! Examples include:
- How many turbines are in the database and what are their asset ids?
  - Which of these turbines has had the highest average temperature and what was it?
  - How was this average temp determined?
- Which turbine has the highest vibration value, and when was it recorded?
  - Is this statistically anomalous, compared to the average value for turbine3?
  - Why would a turbine be producing more vibrations than usual?
  - When was the turbine last serviced?
  - What city would turbine3 be located in, based on long and lat values for the turbine?

Cleanup Instructions

Note: This will delete all resources within the stacks, including S3 buckets and all data included inside them.

From the ~/cdk folder, run:
```
cdk destroy --all
```

Adding Your Own Data to the S3 Data Lake

Create a folder in the example-data folder with your chosen table name:
```
mkdir cdk/example-data/<new-table-name>
```
Add your CSV, JSON Lines etc. within this new folder. I.e. ~/cdk/example-data/health-data/heath-stats.csv.

Note: it doesn't have to be a single file, it could be multiple (i.e. Parquet) files in the folder. Similar to the wind farm data.
Re-run CDK Deploy:
```
cdk deploy
```
Re-run the crawler to automatically create your tables: https://console.aws.amazon.com/glue/home#/v2/data-catalog/crawlers

Example datasets to try:

Download one or all of the below Kaggle datasets and add a folder like so:

mkdir cdk/example-data/video-game-sales/video-game-sales.csv

From Kaggle:

Already included:

Synthetic Wind Farm Turbines Data

(Kaggle Dataset Questions)

who were the biggest game publishers in the decade 2000-2010? And give me the total sales
what years were the biggest sales for nintendo?
- and list the biggest game for each year. put this all in table format
on the titanic, for each class what was the percentage of passengers who survived?
what were the latest turbine temps?
what was the average temp for turbine 1 on 20 August?

Connecting to RDS Database

Required Steps:

Update the db_connection_string parameter in cdk/cdk/main_stack.py. Click here to go to the specific line
Update dialect to your DB's dialect in chainlit-app.py. Click here to go to the specific line
Ensure the Fargate task has necessary IAM permissions and networking to access your RDS instance
Configure RDS security groups to allow access from the Fargate service
Re-deploy using cdk deploy

The connection string follows this format:

{dialect}://{username}:{password}@{host}:{port}/{database_name}

Example:

db_connection_string = "postgresql://myuser:[email protected]:5432/mydatabase"

Security Considerations:

Never commit credentials to version control
Use AWS Secrets Manager for storing database credentials
Ensure your RDS security group allows inbound traffic from the Fargate service's security group
Use SSL/TLS for database connections when possible
Consider using IAM Authentication

Supported Database Types:

PostgreSQL
MySQL
Amazon Aurora
Other SQL databases supported by SQLAlchemy

Prompt Engineering

The LLM system prompts are generalised to work with any dataset - howver you may find it beneficial to customise the prompts to your dataset.

Edit the prompts in eihter:

The cdk/prompts_stack.py and redeploy with cdk deploy (Recommended)
The AWS Console under Bedrock Prompt Management

Local Development

From the root folder of the project repository, install requirements.

pip install -r requirements.txt

Create a .env file in the top level folder. You can find the required environment variables from the CloudFormation Outputs.

ATHENA_CONNECTION_STRING="<PASTE_ATHENA_CONNECTION_STRING_HERE>"
BEDROCK_PROMPT_ID_1="ABC123456" 
BEDROCK_PROMPT_ID_2="XYZ123456"
AWS_DEFAULT_REGION="<REGION>"
OAUTH_COGNITO_CLIENT_ID="<CLIEN_ID>"
OAUTH_COGNITO_CLIENT_SECRET ="<CLIENT_SECRET>"
OAUTH_COGNITO_DOMAIN="<COGNITO_DOMAIN>"
CHAINLIT_AUTH_SECRET="<CHAINLIT_SECRET>"

Run Chainlit App

chainlit run -w chainlit-app.py

Architecture

AWS Services:

Amazon S3 for data storage
AWS Glue for data cataloging and ETL
Amazon Athena for querying data lakes
Amazon Elastic Container Service (ECS) Fargate for scalable, serverless deployment
Amazon Elastic Container Registry (ECR) for registering Docker images
Amazon Bedrock for Large Language Model's (LLM) on-demand
Application Load Balancer
Amazon CloudFront
Amazon Cognito for user management
Amazon CloudWatch for logging
AWS Identity and Access Management (IAM) for access controls
AWS Secrets Manager
Amazon Virtual Private Cloud (VPC)
CDK generated resources:
- AWS Lambda

NOTES:

Customers should consider using AWS WAF with Amazon CloudFront and their Application Load Balancers (ALB), it is a web application firewall that helps protect web applications from attacks by allowing you to configure rules that allow, block, or monitor (count) web requests.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.chainlit		.chainlit
cdk		cdk
public		public
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
architecture.png		architecture.png
chainlit-app.py		chainlit-app.py
example.png		example.png
example2.png		example2.png
requirements.in		requirements.in
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural Language SQL Chatbot with GenAI

Key Features

Example

Transparency of SQL Generated

How It Works

Benefits

AWS Deployment Instructions

Pre-requisites

Deployment

Option A - Automatic Approach

Option B - Manual Approach (If Option A fails):

Post Deployment Steps (Required)

Cleanup Instructions

Adding Your Own Data to the S3 Data Lake

Example datasets to try:

Connecting to RDS Database

Required Steps:

Security Considerations:

Supported Database Types:

Prompt Engineering

Local Development

Architecture

AWS Services:

NOTES:

About

Releases

Packages

Contributors 2

Languages

License

aws-samples/nlq-genai-chatbot-for-structured-data

Folders and files

Latest commit

History

Repository files navigation

Natural Language SQL Chatbot with GenAI

Key Features

Example

Transparency of SQL Generated

How It Works

Benefits

AWS Deployment Instructions

Pre-requisites

Deployment

Option A - Automatic Approach

Option B - Manual Approach (If Option A fails):

Post Deployment Steps (Required)

Cleanup Instructions

Adding Your Own Data to the S3 Data Lake

Example datasets to try:

Connecting to RDS Database

Required Steps:

Security Considerations:

Supported Database Types:

Prompt Engineering

Local Development

Architecture

AWS Services:

NOTES:

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages