Skip to content

chinedunewbirth/Data-Analysis-AI-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿค– Data Analysis AI Agent

Python Version Streamlit OpenAI GPT-4 License

๐Ÿš€ An intelligent data analysis tool powered by GPT-4

Transform your data exploration with natural language queries, automated insights, and interactive visualizations


๐Ÿ“‹ Table of Contents

โœจ Features

AI-Powered Analytics Visualizations Data Processing

๐Ÿง  AI-Powered Analysis

Feature Description
๐Ÿ’ฌ Natural Language Queries Ask questions about your data in plain English
๐Ÿš€ GPT-4 Integration Leverage advanced AI for intelligent data insights
๐Ÿ”ง Code Generation Get Python code suggestions for custom analysis
๐Ÿ“š Analysis History Track your analysis queries and results

๐Ÿ“Š Comprehensive Analytics

Analysis Type Capabilities
๐Ÿ“ˆ Descriptive Statistics Automatic statistical summaries and data profiling
๐Ÿ”— Correlation Analysis Identify relationships between variables with heatmaps
๐ŸŽฏ Outlier Detection Spot anomalies using IQR and Z-score methods
๐ŸŽฒ Clustering K-means clustering with interactive visualizations
๐Ÿ“ PCA Principal Component Analysis for dimensionality reduction
โš–๏ธ Group Comparisons Statistical testing (t-test, ANOVA, Kruskal-Wallis)

๐ŸŽจ Interactive Visualizations

Visualization Description
๐Ÿ“Š Distribution Plots Histograms, bar charts, and density plots
๐ŸŒก๏ธ Correlation Heatmaps Visual correlation matrices with color coding
โšซ Scatter Plots Relationship analysis with trend lines and grouping
๐Ÿ“‰ Time Series Temporal trend analysis and seasonal patterns
๐ŸŽฏ Custom Charts Plotly-powered interactive and responsive visualizations

๐Ÿงน Data Processing

Feature Supported Formats
๐Ÿ“‚ Multi-format Support CSV, Excel (.xlsx, .xls), JSON, Parquet
๐Ÿงฝ Data Cleaning Automated cleaning, missing value handling, duplicate removal
๐Ÿท๏ธ Type Detection Smart data type inference and conversion
๐Ÿ“‹ Quality Reports Comprehensive data quality assessment and profiling

๐Ÿš€ Quick Start

๐Ÿ“ Prerequisites

Before you begin, ensure you have:

  • Python 3.8+ installed on your system
  • An OpenAI API key (Get one here)
  • Internet connection for AI analysis features

1. ๐Ÿ“Ž Installation

Option A: Using Git (Recommended)

# Clone the repository
git clone <repository-url>
cd Data-Analysis-AI-Agent

# Create a virtual environment (recommended)
python -m venv data-analysis-env

# Activate virtual environment
# On Windows:
data-analysis-env\Scripts\activate
# On macOS/Linux:
source data-analysis-env/bin/activate

# Install dependencies
pip install -r requirements.txt

Option B: Direct Download

# Download and extract the project
cd Data-Analysis-AI-Agent

# Install dependencies
pip install -r requirements.txt

2. โš™๏ธ Configuration

Step 1: Create Environment File

# Copy the environment template
cp .env.template .env

Step 2: Configure API Key

Edit the .env file and add your OpenAI API key:

# Required: Your OpenAI API key
OPENAI_API_KEY=sk-your_actual_openai_api_key_here

# Optional: Customize other settings
STREAMLIT_SERVER_PORT=8501
MAX_FILE_SIZE_MB=100

โš ๏ธ Important: Never commit your .env file to version control. Keep your API key secure!

Step 3: Test Configuration (Optional)

# Run the setup script to verify everything is working
python run.py

3. ๐ŸŽ† Launch the Application

Method 1: Using Streamlit directly

streamlit run app.py

Method 2: Using the startup script (Recommended)

python run.py

4. ๐ŸŒ Access the Application

The web interface will automatically open at:

๐Ÿ”— http://localhost:8501

If it doesn't open automatically, copy and paste the URL into your browser.

5. ๐Ÿ“ก First Time Setup

  1. API Key Verification: The sidebar will show "โœ… AI Agent ready!" if your API key is configured correctly
  2. Upload Sample Data: Try uploading the included data/sample_sales_data.csv file
  3. Test Analysis: Ask a simple question like "What are the main patterns in this data?"

๐ŸŽ‰ You're Ready!

Congratulations! Your Data Analysis AI Agent is now running. Start exploring your data with natural language queries!

๐Ÿ“– Usage Guide

๐Ÿ Getting Started

Step Action Description
1๏ธโƒฃ Configure API Key Enter your OpenAI API key in the sidebar
2๏ธโƒฃ Upload Data Use the file uploader to load your dataset
3๏ธโƒฃ Start Analyzing Use any of the analysis tabs to explore your data

๐Ÿค– AI Analysis Tab

Purpose: Natural language querying and AI-powered insights

Features:

  • ๐Ÿ’ฌ Natural Language Interface: Type questions in plain English
  • ๐Ÿง  GPT-4 Analysis: Get intelligent insights and recommendations
  • ๐Ÿ’ป Code Generation: Receive Python code suggestions
  • ๐Ÿ“ˆ Analysis History: Track your previous queries and results

Example Queries:

๐Ÿ“Š "What are the main trends in this dataset?"
๐Ÿ”— "Which variables are most correlated?"
๐ŸŽฏ "Are there any unusual patterns or outliers?"
๐Ÿ† "Compare sales performance across regions"
๐Ÿ“… "Show me seasonal patterns in the data"
๐Ÿ” "What insights can you provide about customer behavior?"

๐Ÿ“Š Quick Stats Tab

Purpose: Instant dataset overview and basic statistics

What You'll See:

  • ๐Ÿ“„ Dataset Overview: Rows, columns, missing values, duplicates
  • ๐Ÿ” Data Preview: First 10 rows of your dataset
  • ๐Ÿ“‹ Column Information: Data types, null counts, unique values
  • ๐Ÿ“ˆ Statistical Summary: Mean, median, std dev for numeric columns

Key Metrics:

Metric Description
Total Rows Number of records in your dataset
Total Columns Number of variables/features
Missing Values Count of null/empty values
Duplicates Number of duplicate rows

๐Ÿ“ˆ Visualizations Tab

Purpose: Create interactive charts and visual analysis

Available Visualizations:

Visualization Use Case Features
๐Ÿ“Š Distribution Plots Understand data distribution Histograms, bar charts, automatic binning
๐ŸŒก๏ธ Correlation Heatmaps Find variable relationships Color-coded correlation matrix
โšซ Scatter Plots Explore relationships Trend lines, color grouping, interactive zoom

How to Use:

  1. Select your desired visualization type
  2. Choose columns from dropdown menus
  3. Click generate to create interactive plots
  4. Hover over plots for detailed information

๐Ÿงฎ Advanced Analysis Tab

Purpose: Sophisticated statistical analysis and machine learning

Analysis Options:

Analysis Description When to Use
๐ŸŽฒ Clustering Group similar data points using K-means Finding customer segments, data patterns
๐Ÿ“ PCA Reduce dimensionality, find principal components High-dimensional data, feature reduction
๐ŸŽฏ Outlier Detection Identify anomalous data points Data quality, fraud detection
โš–๏ธ Group Comparison Statistical testing between categories A/B testing, group differences

Configuration Options:

  • Clustering: Choose number of clusters (2-10)
  • PCA: Select number of components to extract
  • Outlier Detection: Automatic using IQR and Z-score methods
  • Group Comparison: Automatic test selection (t-test, ANOVA, Kruskal-Wallis)

๐Ÿ“‹ Data Quality Tab

Purpose: Assess and improve data quality

Quality Report Features:

  • ๐Ÿ“ˆ Overview Metrics: Basic dataset information
  • โŒ Missing Data Analysis: Identify and quantify missing values
  • ๐Ÿ“‹ Column Analysis: Detailed per-column quality assessment

Data Cleaning Options:

Option Description Effect
๐Ÿงฝ Remove Duplicates Eliminate duplicate rows Reduces dataset size, improves quality
๐Ÿท๏ธ Auto-convert Types Smart data type conversion Better analysis performance
โ“ Handle Missing Values Various strategies for null values Choose: auto, drop, fill, or none

Missing Value Strategies:

  • Auto: Intelligent handling based on data type
  • Drop: Remove rows/columns with missing values
  • Fill: Replace with mean/median/mode
  • None: Keep data as-is

๐Ÿ—๏ธ Project Structure

Data-Analysis-AI-Agent/
โ”œโ”€โ”€ ๐Ÿ–ฅ๏ธ app.py                      # Main Streamlit web application
โ”œโ”€โ”€ ๐Ÿš€ run.py                      # Startup script with environment checks
โ”œโ”€โ”€ ๐Ÿ“‹ requirements.txt            # Python dependencies and versions
โ”œโ”€โ”€ โš™๏ธ .env.template              # Environment variables template
โ”œโ”€โ”€ ๐Ÿ“„ README.md                  # Project documentation (this file)
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ src/                        # Core application modules
โ”‚   โ”œโ”€โ”€ ๐Ÿค– data_analysis_agent.py # Main AI agent with GPT-4 integration
โ”‚   โ”œโ”€โ”€ ๐Ÿงน data_processor.py      # Data loading, cleaning, and quality assessment
โ”‚   โ””โ”€โ”€ ๐Ÿ“ˆ analysis_modules.py    # Statistical analysis and ML algorithms
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ config/                    # Configuration management
โ”‚   โ””โ”€โ”€ โš™๏ธ config.py             # Environment and application settings
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ data/                      # Sample datasets and user uploads
โ”‚   โ””โ”€โ”€ ๐Ÿ“ˆ sample_sales_data.csv # Example dataset for testing
โ”‚
โ””โ”€โ”€ ๐Ÿ“ tests/                     # Unit tests and test utilities
    โ””โ”€โ”€ ๐Ÿงช (Coming soon)          # Test files for quality assurance

๐Ÿ“‚ Key Files Description

File/Directory Purpose Key Features
app.py Main web interface Streamlit UI, tabs, file upload, visualization
run.py Application launcher Environment validation, dependency checks
data_analysis_agent.py Core AI logic GPT-4 integration, natural language processing
data_processor.py Data handling File loading, cleaning, quality reports
analysis_modules.py Analytics engine Statistics, ML algorithms, visualizations
config.py Configuration Environment variables, application settings

API Reference

DataAnalysisAgent Class

Main class for AI-powered data analysis:

from src.data_analysis_agent import DataAnalysisAgent

# Initialize agent
agent = DataAnalysisAgent(api_key="your_openai_key")

# Load data
df = agent.load_data("path/to/your/data.csv")

# Analyze with natural language
result = agent.analyze_with_gpt4("What patterns do you see in this data?")

# Generate code suggestions
code = agent.generate_code_suggestion("Create a correlation matrix")

Analysis Modules

Statistical analysis utilities:

from src.analysis_modules import StatisticalAnalyzer, VisualizationGenerator

# Descriptive statistics
stats = StatisticalAnalyzer.descriptive_statistics(df)

# Create visualizations
fig = VisualizationGenerator.create_correlation_heatmap(df)

Data Processing

Data cleaning and preprocessing:

from src.data_processor import DataProcessor

# Clean data automatically
cleaned_df = DataProcessor.clean_data(df)

# Generate quality report
report = DataProcessor.get_data_quality_report(df)

๐Ÿ’ก Example Queries

Here are categorized examples of natural language queries you can use with the AI agent:

๐Ÿ” General Analysis

Query Expected Insight
"๐Ÿ“Š Give me an overview of this dataset" Basic statistics, data types, structure summary
"๐Ÿ” What are the key insights from this data?" Main patterns, trends, notable findings
"๐Ÿ“ˆ Summarize the main trends and patterns" Statistical trends, correlations, distributions
"โ“ What story does this data tell?" High-level narrative and business insights
"๐Ÿ† What are the most important variables?" Feature importance and impact analysis

๐ŸŽฏ Specific Analysis

Query Analysis Type
"๐Ÿ’ฐ Which product category has the highest average sales?" Categorical comparison
"๐Ÿ”— Is there a correlation between marketing spend and sales?" Correlation analysis
"๐Ÿ“… Show me sales trends over time" Time series analysis
"๐ŸŽฏ Are there any outliers in the sales data?" Anomaly detection
"๐Ÿ“‰ What factors predict customer churn?" Predictive analysis
"๐Ÿ“Š How seasonal is this business?" Seasonal pattern analysis

โš–๏ธ Comparative Analysis

Query Comparison Type
"๐ŸŒ Compare sales performance across different regions" Geographic analysis
"๐Ÿ‚ How do enterprise and consumer segments differ?" Segment comparison
"๐Ÿ“… Which month had the best sales performance?" Temporal comparison
"๐Ÿ‘ฅ Compare customer behavior between age groups" Demographic analysis
"๐Ÿ† What's the difference between top and bottom performers?" Performance analysis

๐Ÿ”ฎ Advanced Queries

Query Advanced Feature
"๐Ÿค– Generate code to create a predictive model" Code generation
"๐Ÿ“ˆ Create a visualization showing the relationship between X and Y" Custom visualization
"๐Ÿ” Find clusters in customer data and describe them" Machine learning analysis
"๐Ÿ“Š Perform statistical significance testing on these groups" Statistical testing
"๐ŸŽฏ Identify the most important features for predicting sales" Feature analysis

๐Ÿ“Š Supported Data Formats

CSV
CSV Files
.csv
Comma-separated values
Excel
Excel Files
.xlsx, .xls
Microsoft Excel formats
JSON
JSON Files
.json
JavaScript Object Notation
Parquet
Parquet Files
.parquet
Apache Parquet format

๐Ÿ“‹ Format Details

Format Max File Size Best Use Case Loading Speed
CSV 100MB Simple datasets, universal compatibility Fast
Excel 100MB Business reports, formatted data Medium
JSON 100MB Nested/hierarchical data, web APIs Medium
Parquet 100MB Large datasets, analytics workloads Very Fast

โš™๏ธ Requirements

๐Ÿ”ง System Requirements

Requirement Version Purpose
Python 3.8+ Core runtime environment
Memory 4GB RAM+ Data processing and AI analysis
Storage 1GB+ Application and dependencies
Internet Stable connection OpenAI API calls

๐Ÿ”‘ API Requirements

  • OpenAI API Key: Required for GPT-4 analysis features
  • API Credits: Pay-per-use pricing for analysis queries
  • Rate Limits: Standard OpenAI API rate limits apply

โš™๏ธ Configuration

The following environment variables can be configured in your .env file:

# OpenAI Settings
OPENAI_API_KEY=your_api_key
OPENAI_MODEL=gpt-4
MAX_TOKENS=2000

# Streamlit Settings
STREAMLIT_SERVER_PORT=8501
STREAMLIT_SERVER_ADDRESS=localhost

# Data Settings
MAX_FILE_SIZE_MB=100
SUPPORTED_FORMATS=csv,xlsx,xls,json,parquet

# Analysis Settings
DEFAULT_CORRELATION_THRESHOLD=0.7
DEFAULT_OUTLIER_METHOD=iqr
MAX_CLUSTERS=10

โ“ Troubleshooting

๐Ÿ”ด Common Issues

๐Ÿ”‘ API Key Error

Symptoms:

  • "Please set your OpenAI API key" error message
  • "Error setting up agent" in sidebar
  • Analysis queries failing

Solutions:

  • โœ… Ensure your OpenAI API key is correctly set in the .env file
  • โœ… Verify the API key starts with sk- and is complete
  • โœ… Check that you have sufficient API credits in your OpenAI account
  • โœ… Test your API key using OpenAI's API documentation
๐Ÿ“ File Upload Error

Symptoms:

  • "Error loading data" message
  • File upload fails silently
  • "Unsupported file format" error

Solutions:

  • โœ… Verify your file format is supported (CSV, Excel, JSON, Parquet)
  • โœ… Check file size doesn't exceed limit (default: 100MB)
  • โœ… Ensure the file is not corrupted or password-protected
  • โœ… Try uploading a different file to isolate the issue
  • โœ… Check file encoding (UTF-8 recommended for CSV files)
๐Ÿ“ˆ Analysis Errors

Symptoms:

  • "Analysis failed" error message
  • Blank or incomplete analysis results
  • Timeout errors during analysis

Solutions:

  • โœ… Make sure your data has appropriate column types
  • โœ… Check for sufficient data points for the requested analysis
  • โœ… Verify stable internet connection for API calls
  • โœ… Try simpler queries first to test the system
  • โœ… Clean your data using the Data Quality tab before analysis
๐ŸŽ Installation Issues

Symptoms:

  • Package installation failures
  • Import errors when running the application
  • Version conflicts

Solutions:

  • โœ… Use a virtual environment to avoid conflicts
  • โœ… Ensure Python 3.8+ is installed
  • โœ… Update pip: pip install --upgrade pip
  • โœ… Try installing packages individually if batch install fails
  • โœ… Check for system-specific requirements (e.g., Visual C++ on Windows)

๐Ÿš€ Performance Tips

Tip Benefit Implementation
๐Ÿ“‰ Use smaller datasets Faster processing Sample large datasets before upload
๐Ÿงฝ Clean data first Better analysis quality Use Data Quality tab before analysis
๐ŸŽฒ Limit clusters Avoid timeouts Use 2-5 clusters for large datasets
๐ŸŽฏ Sample large files Reduce memory usage Use representative subsets of data
๐Ÿ”„ Cache results Faster re-analysis Save analysis history for reference

๐Ÿ†˜ Need More Help?

If you're still experiencing issues:

  1. ๐Ÿ” Check the example queries section
  2. ๐Ÿ“š Review the usage guide for detailed instructions
  3. โš™๏ธ Verify your configuration settings
  4. ๐Ÿ“„ Ensure all requirements are met

๐Ÿค Contributing

We welcome contributions from the community! Here's how you can help make the Data Analysis AI Agent even better.

๐Ÿ Getting Started

  1. ๐Ÿด Fork the repository on your platform
  2. ๐Ÿ—บ Clone your fork locally
  3. ๐ŸŒฑ Create a feature branch: git checkout -b feature/amazing-feature
  4. ๐Ÿ› ๏ธ Make your changes with proper testing
  5. ๐Ÿ“ Commit your changes: git commit -m 'Add amazing feature'
  6. ๐Ÿš€ Push to the branch: git push origin feature/amazing-feature
  7. ๐Ÿ“จ Submit a pull request with a clear description

๐ŸŽจ Types of Contributions

Type Examples Impact
๐Ÿ› Bug Fixes Fix calculation errors, UI issues High
โœจ New Features Additional analysis methods, visualizations High
๐Ÿ“„ Documentation README improvements, code comments Medium
๐Ÿ—บ UI/UX Better interface design, usability Medium
๐ŸŽจ Code Quality Refactoring, optimization Medium
๐Ÿงช Testing Unit tests, integration tests Medium

๐Ÿ“ Development Guidelines

  • ๐Ÿ” Code Style: Follow PEP 8 for Python code
  • ๐Ÿงช Testing: Add tests for new functionality
  • ๐Ÿ“ Documentation: Update relevant documentation
  • ๐Ÿ“‹ Commits: Use clear, descriptive commit messages
  • ๐ŸŽจ Features: Ensure new features are user-friendly

๐Ÿ› Reporting Issues

Found a bug? Please include:

  • ๐Ÿ“„ Description: Clear description of the issue
  • ๐Ÿ”„ Steps to reproduce: Detailed reproduction steps
  • ๐Ÿ’ป Environment: OS, Python version, dependencies
  • ๐Ÿ“ˆ Sample data: If possible, provide sample data (anonymized)
  • ๐Ÿ“ท Screenshots: Visual evidence of the issue

๐Ÿ“„ License

MIT License

This project is licensed under the MIT License - see the details below:

MIT License

Copyright (c) 2024 Data Analysis AI Agent

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

โš–๏ธ What This Means

  • โœ… Use: Free to use for personal and commercial projects
  • โœ… Modify: Free to modify and adapt the code
  • โœ… Distribute: Free to distribute original or modified versions
  • โœ… Private Use: Free to use privately without restrictions
  • โš ๏ธ Attribution: Must include original license in distributions
  • โš ๏ธ No Warranty: Software provided "as-is" without warranties

๐Ÿ†˜ Support

๐Ÿš€ Need Help? We're Here to Support You!

๐ŸŽฏ Quick Help

For immediate assistance, try these resources in order:

  1. ๐Ÿ” Troubleshooting - Common issues and solutions
  2. ๐Ÿ“– Usage Guide - Detailed feature explanations
  3. ๐Ÿ’ก Example Queries - Sample questions to try
  4. โš™๏ธ Configuration - Setup and customization

๐Ÿ“š Knowledge Base

Resource What You'll Find Best For
README.md Complete project documentation General understanding
Code Comments Detailed implementation notes Development questions
Example Data Sample dataset for testing Learning the interface
.env.template Configuration options Setup assistance

โ“ Still Need Help?

If you can't find what you need:

  • ๐Ÿ“ Create an Issue: Report bugs or request features
  • ๐Ÿ’ฌ Ask Questions: Get help from the community
  • ๐Ÿ“š Check Documentation: Review inline code documentation
  • ๐Ÿค Contribute: Help improve the project for everyone

โš–๏ธ Important Notes

  • ๐Ÿ”‘ API Keys: We cannot provide OpenAI API keys - get yours from OpenAI
  • ๐Ÿ’ธ API Costs: You are responsible for OpenAI API usage costs
  • ๐Ÿ”’ Data Privacy: Your data is processed locally and sent only to OpenAI's API
  • ๐Ÿš€ Updates: Check back regularly for new features and improvements

๐ŸŽ† Ready to Explore Your Data?

Happy analyzing! ๐Ÿš€

Transform your data into insights with the power of AI


Made with โค๏ธ by the Data Analysis AI Agent team

About

An intelligent data analysis tool powered by GPT-4 that provides natural language querying, automated insights, and interactive visualizations for your datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages