Unlock the power of your structured data with natural language queries. Whether you're exploring vast data lakes or querying relational databases, this NLQ data analyst Gen AI Agent provides an intuitive interface for data discovery and analysis. The agent automatically converts natural language queries into SQL to extract insights from structured data sources.
- Query structured data using natural language
- Supports multiple data sources:
- Amazon S3 data lakes including JSON, CSV, Parquet etc. file formats (via Amazon Athena and AWS Glue)
- SQL databases (MySQL, PostgreSQL, etc.)
- Data warehouses
- Powered by Amazon Bedrock LLMs
- Interactive UI built with Chainlit
- Flexible agent architecture using LangChain/LangGraph
- Real-time query generation & execution
- Transparency with Agent chain of thought and steps
- Automatic data source schema inference and mapping
I.e.
SELECT publisher, SUM(global_sales) AS total_sales
FROM video_game_sales
WHERE year BETWEEN '2000' AND '2010'
GROUP BY publisher
ORDER BY total_sales DESC
LIMIT 5;
- Users input natural language queries through the Chainlit UI
- The LangChain/LangGraph agent, powered by Amazon Bedrock, interprets the query
- The agent determines the appropriate data source(s) to query
- It generates the necessary SQL to retrieve data to answer the user's question
- Results are processed and presented back to the user in a readable format
- Democratize data access across your organization
- Reduce dependency on data analysts for simple queries
- Accelerate data-driven decision making
- Seamlessly integrate with existing AWS infrastructure
- Scale effortlessly with serverless architecture
- Docker Installed. See here for installation instructions: https://docs.docker.com/desktop/
- Amazon Bedrock model access enabled:
- (Requried) Follow the instructions here on enabling model access. By default need to enable at least
Claude Haiku 3.5
inus-west-2
. - (Optional) You can change the Amazon Bedrock inference region in
cdk/app.py
. I.e. you may deploy the application infrastructure (ECS, VPC etc.) in ap-southeast-2 but use us-west-2 for Amazon Bedrock inference. See here for model region availability
- (Requried) Follow the instructions here on enabling model access. By default need to enable at least
- AWS CLI Credentials - see here for instructions
- AWS CDK CLI installed: https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html
- (Optional) Add your own datasets or more from Kaggle. Follow Instructions under the
Adding Your Own Data to the S3 Data Lake
section. Otherwise only thewindfarm
dataset is included. Can add after deployment too.
-
Change to the cdk directory:
cd cdk
-
Extract the example-data.zip to a folder specifically named 'example-data':
-
MacOS/Linux
unzip example-data.zip
-
Windows
# (Command Prompt): expand example-data.zip -F:* # or tar -xf example-data.zip # (PowerShell): Expand-Archive -Path example-data.zip -DestinationPath .
-
-
Create a virtualenv:
python -m venv .venv
-
Activate the virtualenv:
- MacOS/Linux:
source .venv/bin/activate
- Windows:
# In cmd.exe .venv\Scripts\activate.bat or # In PowerShell .venv\Scripts\Activate.ps1
- MacOS/Linux:
-
Install the required dependencies:
pip install -r requirements.txt
-
Generate the chainlit auth secret (required for Cognito Auth). Follow either of these approaches:
Run the respective command to automatically generate and save the secret:
# Linux/Mac chainlit create-secret | grep "CHAINLIT_AUTH_SECRET" > .env
# Windows chainlit create-secret | Select-String "CHAINLIT_AUTH_SECRET" | Out-File -FilePath ".env" -Encoding utf8NoBOM
NOTE: Ignore any warnings/errors like below:
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1738205048.894250 2795 init.cc:232] grpc_wait_for_shutdown_with_timeout() timed out.
-
If it exists, delete the existing
cdk/.env
file -
Generate the secret:
chainlit create-secret
-
Copy only the line containing the generated secret (i.e. CHAINLIT_AUTH_SECRET="XXXXX")
-
Create a new file at
cdk/.env
and paste the secret into it.
Important: The .env file should be placed directly in the top level
cdk
folder you are currently in, not incdk/cdk
. -
-
Bootstrap (Only required on first run):
cdk bootstrap
-
Update the CDK generated ECR Repo. Change the tag immutability setting to
mutable
:aws ecr put-image-tag-mutability --repository-name $(aws ecr describe-repositories --query 'repositories[?contains(repositoryName, `cdk-hnb659fds-container-assets`)].repositoryName' --output text) --image-tag-mutability MUTABLE
see here for more info - https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-tag-mutability.html
-
Deploy (10-15 mins):
cdk deploy
You may need to accept IAM statement changes by entering
y
orn
.
-
Run the Glue Crawler in the AWS Console: https://console.aws.amazon.com/glue/home#/v2/data-catalog/crawlers
-
In order to access the site, create a Cognito user via the CognitoUserPoolConsoleLink in the CloudFormation Outputs.
-
Access the chatbot at the CloudFrontDomain in the CloudFormation outputs.
-
Ask questions of the chatbot! Examples include:
- How many turbines are in the database and what are their asset ids?
- Which of these turbines has had the highest average temperature and what was it?
- How was this average temp determined?
- Which turbine has the highest vibration value, and when was it recorded?
- Is this statistically anomalous, compared to the average value for turbine3?
- Why would a turbine be producing more vibrations than usual?
- When was the turbine last serviced?
- What city would turbine3 be located in, based on long and lat values for the turbine?
- How many turbines are in the database and what are their asset ids?
Note: This will delete all resources within the stacks, including S3 buckets and all data included inside them.
- From the
~/cdk
folder, run:cdk destroy --all
-
Create a folder in the
example-data
folder with your chosen table name:mkdir cdk/example-data/<new-table-name>
-
Add your CSV, JSON Lines etc. within this new folder. I.e.
~/cdk/example-data/health-data/heath-stats.csv
.Note: it doesn't have to be a single file, it could be multiple (i.e. Parquet) files in the folder. Similar to the wind farm data.
-
Re-run CDK Deploy:
cdk deploy
-
Re-run the crawler to automatically create your tables: https://console.aws.amazon.com/glue/home#/v2/data-catalog/crawlers
Download one or all of the below Kaggle datasets and add a folder like so:
mkdir cdk/example-data/video-game-sales/video-game-sales.csv
From Kaggle:
Already included:
- Synthetic Wind Farm Turbines Data
(Kaggle Dataset Questions)
- who were the biggest game publishers in the decade 2000-2010? And give me the total sales
- what years were the biggest sales for nintendo?
- and list the biggest game for each year. put this all in table format
- on the titanic, for each class what was the percentage of passengers who survived?
- what were the latest turbine temps?
- what was the average temp for turbine 1 on 20 August?
- Update the
db_connection_string
parameter incdk/cdk/main_stack.py
. Click here to go to the specific line - Update
dialect
to your DB's dialect inchainlit-app.py
. Click here to go to the specific line - Ensure the Fargate task has necessary IAM permissions and networking to access your RDS instance
- Configure RDS security groups to allow access from the Fargate service
- Re-deploy using
cdk deploy
The connection string follows this format:
{dialect}://{username}:{password}@{host}:{port}/{database_name}
Example:
db_connection_string = "postgresql://myuser:[email protected]:5432/mydatabase"
- Never commit credentials to version control
- Use AWS Secrets Manager for storing database credentials
- Ensure your RDS security group allows inbound traffic from the Fargate service's security group
- Use SSL/TLS for database connections when possible
- Consider using IAM Authentication
- PostgreSQL
- MySQL
- Amazon Aurora
- Other SQL databases supported by SQLAlchemy
The LLM system prompts are generalised to work with any dataset - howver you may find it beneficial to customise the prompts to your dataset.
Edit the prompts in eihter:
- The
cdk/prompts_stack.py
and redeploy withcdk deploy
(Recommended) - The AWS Console under Bedrock Prompt Management
- From the root folder of the project repository, install requirements.
pip install -r requirements.txt
- Create a
.env
file in the top level folder. You can find the required environment variables from the CloudFormation Outputs.
ATHENA_CONNECTION_STRING="<PASTE_ATHENA_CONNECTION_STRING_HERE>"
BEDROCK_PROMPT_ID_1="ABC123456"
BEDROCK_PROMPT_ID_2="XYZ123456"
AWS_DEFAULT_REGION="<REGION>"
OAUTH_COGNITO_CLIENT_ID="<CLIEN_ID>"
OAUTH_COGNITO_CLIENT_SECRET ="<CLIENT_SECRET>"
OAUTH_COGNITO_DOMAIN="<COGNITO_DOMAIN>"
CHAINLIT_AUTH_SECRET="<CHAINLIT_SECRET>"
- Run Chainlit App
chainlit run -w chainlit-app.py
- Amazon S3 for data storage
- AWS Glue for data cataloging and ETL
- Amazon Athena for querying data lakes
- Amazon Elastic Container Service (ECS) Fargate for scalable, serverless deployment
- Amazon Elastic Container Registry (ECR) for registering Docker images
- Amazon Bedrock for Large Language Model's (LLM) on-demand
- Application Load Balancer
- Amazon CloudFront
- Amazon Cognito for user management
- Amazon CloudWatch for logging
- AWS Identity and Access Management (IAM) for access controls
- AWS Secrets Manager
- Amazon Virtual Private Cloud (VPC)
- CDK generated resources:
- AWS Lambda
- Customers should consider using AWS WAF with Amazon CloudFront and their Application Load Balancers (ALB), it is a web application firewall that helps protect web applications from attacks by allowing you to configure rules that allow, block, or monitor (count) web requests.