Transform your data exploration with natural language queries, automated insights, and interactive visualizations
- โจ Features
- ๐ Quick Start
- ๐ Usage Guide
- ๐๏ธ Project Structure
- ๐ง API Reference
- ๐ก Example Queries
- ๐ Supported Data Formats
- โ๏ธ Configuration
- โ Troubleshooting
- ๐ค Contributing
- ๐ License
- ๐ Support
| Feature | Description |
|---|---|
| ๐ฌ Natural Language Queries | Ask questions about your data in plain English |
| ๐ GPT-4 Integration | Leverage advanced AI for intelligent data insights |
| ๐ง Code Generation | Get Python code suggestions for custom analysis |
| ๐ Analysis History | Track your analysis queries and results |
| Analysis Type | Capabilities |
|---|---|
| ๐ Descriptive Statistics | Automatic statistical summaries and data profiling |
| ๐ Correlation Analysis | Identify relationships between variables with heatmaps |
| ๐ฏ Outlier Detection | Spot anomalies using IQR and Z-score methods |
| ๐ฒ Clustering | K-means clustering with interactive visualizations |
| ๐ PCA | Principal Component Analysis for dimensionality reduction |
| โ๏ธ Group Comparisons | Statistical testing (t-test, ANOVA, Kruskal-Wallis) |
| Visualization | Description |
|---|---|
| ๐ Distribution Plots | Histograms, bar charts, and density plots |
| ๐ก๏ธ Correlation Heatmaps | Visual correlation matrices with color coding |
| โซ Scatter Plots | Relationship analysis with trend lines and grouping |
| ๐ Time Series | Temporal trend analysis and seasonal patterns |
| ๐ฏ Custom Charts | Plotly-powered interactive and responsive visualizations |
| Feature | Supported Formats |
|---|---|
| ๐ Multi-format Support | CSV, Excel (.xlsx, .xls), JSON, Parquet |
| ๐งฝ Data Cleaning | Automated cleaning, missing value handling, duplicate removal |
| ๐ท๏ธ Type Detection | Smart data type inference and conversion |
| ๐ Quality Reports | Comprehensive data quality assessment and profiling |
Before you begin, ensure you have:
- Python 3.8+ installed on your system
- An OpenAI API key (Get one here)
- Internet connection for AI analysis features
# Clone the repository
git clone <repository-url>
cd Data-Analysis-AI-Agent
# Create a virtual environment (recommended)
python -m venv data-analysis-env
# Activate virtual environment
# On Windows:
data-analysis-env\Scripts\activate
# On macOS/Linux:
source data-analysis-env/bin/activate
# Install dependencies
pip install -r requirements.txt# Download and extract the project
cd Data-Analysis-AI-Agent
# Install dependencies
pip install -r requirements.txt# Copy the environment template
cp .env.template .envEdit the .env file and add your OpenAI API key:
# Required: Your OpenAI API key
OPENAI_API_KEY=sk-your_actual_openai_api_key_here
# Optional: Customize other settings
STREAMLIT_SERVER_PORT=8501
MAX_FILE_SIZE_MB=100
โ ๏ธ Important: Never commit your.envfile to version control. Keep your API key secure!
# Run the setup script to verify everything is working
python run.pystreamlit run app.pypython run.pyThe web interface will automatically open at:
If it doesn't open automatically, copy and paste the URL into your browser.
- API Key Verification: The sidebar will show "โ AI Agent ready!" if your API key is configured correctly
- Upload Sample Data: Try uploading the included
data/sample_sales_data.csvfile - Test Analysis: Ask a simple question like "What are the main patterns in this data?"
Congratulations! Your Data Analysis AI Agent is now running. Start exploring your data with natural language queries!
| Step | Action | Description |
|---|---|---|
| 1๏ธโฃ | Configure API Key | Enter your OpenAI API key in the sidebar |
| 2๏ธโฃ | Upload Data | Use the file uploader to load your dataset |
| 3๏ธโฃ | Start Analyzing | Use any of the analysis tabs to explore your data |
Purpose: Natural language querying and AI-powered insights
- ๐ฌ Natural Language Interface: Type questions in plain English
- ๐ง GPT-4 Analysis: Get intelligent insights and recommendations
- ๐ป Code Generation: Receive Python code suggestions
- ๐ Analysis History: Track your previous queries and results
๐ "What are the main trends in this dataset?"
๐ "Which variables are most correlated?"
๐ฏ "Are there any unusual patterns or outliers?"
๐ "Compare sales performance across regions"
๐
"Show me seasonal patterns in the data"
๐ "What insights can you provide about customer behavior?"
Purpose: Instant dataset overview and basic statistics
- ๐ Dataset Overview: Rows, columns, missing values, duplicates
- ๐ Data Preview: First 10 rows of your dataset
- ๐ Column Information: Data types, null counts, unique values
- ๐ Statistical Summary: Mean, median, std dev for numeric columns
| Metric | Description |
|---|---|
| Total Rows | Number of records in your dataset |
| Total Columns | Number of variables/features |
| Missing Values | Count of null/empty values |
| Duplicates | Number of duplicate rows |
Purpose: Create interactive charts and visual analysis
| Visualization | Use Case | Features |
|---|---|---|
| ๐ Distribution Plots | Understand data distribution | Histograms, bar charts, automatic binning |
| ๐ก๏ธ Correlation Heatmaps | Find variable relationships | Color-coded correlation matrix |
| โซ Scatter Plots | Explore relationships | Trend lines, color grouping, interactive zoom |
- Select your desired visualization type
- Choose columns from dropdown menus
- Click generate to create interactive plots
- Hover over plots for detailed information
Purpose: Sophisticated statistical analysis and machine learning
| Analysis | Description | When to Use |
|---|---|---|
| ๐ฒ Clustering | Group similar data points using K-means | Finding customer segments, data patterns |
| ๐ PCA | Reduce dimensionality, find principal components | High-dimensional data, feature reduction |
| ๐ฏ Outlier Detection | Identify anomalous data points | Data quality, fraud detection |
| โ๏ธ Group Comparison | Statistical testing between categories | A/B testing, group differences |
- Clustering: Choose number of clusters (2-10)
- PCA: Select number of components to extract
- Outlier Detection: Automatic using IQR and Z-score methods
- Group Comparison: Automatic test selection (t-test, ANOVA, Kruskal-Wallis)
Purpose: Assess and improve data quality
- ๐ Overview Metrics: Basic dataset information
- โ Missing Data Analysis: Identify and quantify missing values
- ๐ Column Analysis: Detailed per-column quality assessment
| Option | Description | Effect |
|---|---|---|
| ๐งฝ Remove Duplicates | Eliminate duplicate rows | Reduces dataset size, improves quality |
| ๐ท๏ธ Auto-convert Types | Smart data type conversion | Better analysis performance |
| โ Handle Missing Values | Various strategies for null values | Choose: auto, drop, fill, or none |
- Auto: Intelligent handling based on data type
- Drop: Remove rows/columns with missing values
- Fill: Replace with mean/median/mode
- None: Keep data as-is
Data-Analysis-AI-Agent/
โโโ ๐ฅ๏ธ app.py # Main Streamlit web application
โโโ ๐ run.py # Startup script with environment checks
โโโ ๐ requirements.txt # Python dependencies and versions
โโโ โ๏ธ .env.template # Environment variables template
โโโ ๐ README.md # Project documentation (this file)
โ
โโโ ๐ src/ # Core application modules
โ โโโ ๐ค data_analysis_agent.py # Main AI agent with GPT-4 integration
โ โโโ ๐งน data_processor.py # Data loading, cleaning, and quality assessment
โ โโโ ๐ analysis_modules.py # Statistical analysis and ML algorithms
โ
โโโ ๐ config/ # Configuration management
โ โโโ โ๏ธ config.py # Environment and application settings
โ
โโโ ๐ data/ # Sample datasets and user uploads
โ โโโ ๐ sample_sales_data.csv # Example dataset for testing
โ
โโโ ๐ tests/ # Unit tests and test utilities
โโโ ๐งช (Coming soon) # Test files for quality assurance
| File/Directory | Purpose | Key Features |
|---|---|---|
app.py |
Main web interface | Streamlit UI, tabs, file upload, visualization |
run.py |
Application launcher | Environment validation, dependency checks |
data_analysis_agent.py |
Core AI logic | GPT-4 integration, natural language processing |
data_processor.py |
Data handling | File loading, cleaning, quality reports |
analysis_modules.py |
Analytics engine | Statistics, ML algorithms, visualizations |
config.py |
Configuration | Environment variables, application settings |
Main class for AI-powered data analysis:
from src.data_analysis_agent import DataAnalysisAgent
# Initialize agent
agent = DataAnalysisAgent(api_key="your_openai_key")
# Load data
df = agent.load_data("path/to/your/data.csv")
# Analyze with natural language
result = agent.analyze_with_gpt4("What patterns do you see in this data?")
# Generate code suggestions
code = agent.generate_code_suggestion("Create a correlation matrix")Statistical analysis utilities:
from src.analysis_modules import StatisticalAnalyzer, VisualizationGenerator
# Descriptive statistics
stats = StatisticalAnalyzer.descriptive_statistics(df)
# Create visualizations
fig = VisualizationGenerator.create_correlation_heatmap(df)Data cleaning and preprocessing:
from src.data_processor import DataProcessor
# Clean data automatically
cleaned_df = DataProcessor.clean_data(df)
# Generate quality report
report = DataProcessor.get_data_quality_report(df)Here are categorized examples of natural language queries you can use with the AI agent:
| Query | Expected Insight |
|---|---|
| "๐ Give me an overview of this dataset" | Basic statistics, data types, structure summary |
| "๐ What are the key insights from this data?" | Main patterns, trends, notable findings |
| "๐ Summarize the main trends and patterns" | Statistical trends, correlations, distributions |
| "โ What story does this data tell?" | High-level narrative and business insights |
| "๐ What are the most important variables?" | Feature importance and impact analysis |
| Query | Analysis Type |
|---|---|
| "๐ฐ Which product category has the highest average sales?" | Categorical comparison |
| "๐ Is there a correlation between marketing spend and sales?" | Correlation analysis |
| "๐ Show me sales trends over time" | Time series analysis |
| "๐ฏ Are there any outliers in the sales data?" | Anomaly detection |
| "๐ What factors predict customer churn?" | Predictive analysis |
| "๐ How seasonal is this business?" | Seasonal pattern analysis |
| Query | Comparison Type |
|---|---|
| "๐ Compare sales performance across different regions" | Geographic analysis |
| "๐ How do enterprise and consumer segments differ?" | Segment comparison |
| "๐ Which month had the best sales performance?" | Temporal comparison |
| "๐ฅ Compare customer behavior between age groups" | Demographic analysis |
| "๐ What's the difference between top and bottom performers?" | Performance analysis |
| Query | Advanced Feature |
|---|---|
| "๐ค Generate code to create a predictive model" | Code generation |
| "๐ Create a visualization showing the relationship between X and Y" | Custom visualization |
| "๐ Find clusters in customer data and describe them" | Machine learning analysis |
| "๐ Perform statistical significance testing on these groups" | Statistical testing |
| "๐ฏ Identify the most important features for predicting sales" | Feature analysis |
|
CSV Files .csv
Comma-separated values |
Excel Files .xlsx, .xls
Microsoft Excel formats |
JSON Files .json
JavaScript Object Notation |
Parquet Files .parquet
Apache Parquet format |
| Format | Max File Size | Best Use Case | Loading Speed |
|---|---|---|---|
| CSV | 100MB | Simple datasets, universal compatibility | Fast |
| Excel | 100MB | Business reports, formatted data | Medium |
| JSON | 100MB | Nested/hierarchical data, web APIs | Medium |
| Parquet | 100MB | Large datasets, analytics workloads | Very Fast |
| Requirement | Version | Purpose |
|---|---|---|
| Python | 3.8+ | Core runtime environment |
| Memory | 4GB RAM+ | Data processing and AI analysis |
| Storage | 1GB+ | Application and dependencies |
| Internet | Stable connection | OpenAI API calls |
- OpenAI API Key: Required for GPT-4 analysis features
- API Credits: Pay-per-use pricing for analysis queries
- Rate Limits: Standard OpenAI API rate limits apply
The following environment variables can be configured in your .env file:
# OpenAI Settings
OPENAI_API_KEY=your_api_key
OPENAI_MODEL=gpt-4
MAX_TOKENS=2000
# Streamlit Settings
STREAMLIT_SERVER_PORT=8501
STREAMLIT_SERVER_ADDRESS=localhost
# Data Settings
MAX_FILE_SIZE_MB=100
SUPPORTED_FORMATS=csv,xlsx,xls,json,parquet
# Analysis Settings
DEFAULT_CORRELATION_THRESHOLD=0.7
DEFAULT_OUTLIER_METHOD=iqr
MAX_CLUSTERS=10๐ API Key Error
Symptoms:
- "Please set your OpenAI API key" error message
- "Error setting up agent" in sidebar
- Analysis queries failing
Solutions:
- โ
Ensure your OpenAI API key is correctly set in the
.envfile - โ
Verify the API key starts with
sk-and is complete - โ Check that you have sufficient API credits in your OpenAI account
- โ Test your API key using OpenAI's API documentation
๐ File Upload Error
Symptoms:
- "Error loading data" message
- File upload fails silently
- "Unsupported file format" error
Solutions:
- โ Verify your file format is supported (CSV, Excel, JSON, Parquet)
- โ Check file size doesn't exceed limit (default: 100MB)
- โ Ensure the file is not corrupted or password-protected
- โ Try uploading a different file to isolate the issue
- โ Check file encoding (UTF-8 recommended for CSV files)
๐ Analysis Errors
Symptoms:
- "Analysis failed" error message
- Blank or incomplete analysis results
- Timeout errors during analysis
Solutions:
- โ Make sure your data has appropriate column types
- โ Check for sufficient data points for the requested analysis
- โ Verify stable internet connection for API calls
- โ Try simpler queries first to test the system
- โ Clean your data using the Data Quality tab before analysis
๐ Installation Issues
Symptoms:
- Package installation failures
- Import errors when running the application
- Version conflicts
Solutions:
- โ Use a virtual environment to avoid conflicts
- โ Ensure Python 3.8+ is installed
- โ
Update pip:
pip install --upgrade pip - โ Try installing packages individually if batch install fails
- โ Check for system-specific requirements (e.g., Visual C++ on Windows)
| Tip | Benefit | Implementation |
|---|---|---|
| ๐ Use smaller datasets | Faster processing | Sample large datasets before upload |
| ๐งฝ Clean data first | Better analysis quality | Use Data Quality tab before analysis |
| ๐ฒ Limit clusters | Avoid timeouts | Use 2-5 clusters for large datasets |
| ๐ฏ Sample large files | Reduce memory usage | Use representative subsets of data |
| ๐ Cache results | Faster re-analysis | Save analysis history for reference |
If you're still experiencing issues:
- ๐ Check the example queries section
- ๐ Review the usage guide for detailed instructions
- โ๏ธ Verify your configuration settings
- ๐ Ensure all requirements are met
We welcome contributions from the community! Here's how you can help make the Data Analysis AI Agent even better.
- ๐ด Fork the repository on your platform
- ๐บ Clone your fork locally
- ๐ฑ Create a feature branch:
git checkout -b feature/amazing-feature - ๐ ๏ธ Make your changes with proper testing
- ๐ Commit your changes:
git commit -m 'Add amazing feature' - ๐ Push to the branch:
git push origin feature/amazing-feature - ๐จ Submit a pull request with a clear description
| Type | Examples | Impact |
|---|---|---|
| ๐ Bug Fixes | Fix calculation errors, UI issues | High |
| โจ New Features | Additional analysis methods, visualizations | High |
| ๐ Documentation | README improvements, code comments | Medium |
| ๐บ UI/UX | Better interface design, usability | Medium |
| ๐จ Code Quality | Refactoring, optimization | Medium |
| ๐งช Testing | Unit tests, integration tests | Medium |
- ๐ Code Style: Follow PEP 8 for Python code
- ๐งช Testing: Add tests for new functionality
- ๐ Documentation: Update relevant documentation
- ๐ Commits: Use clear, descriptive commit messages
- ๐จ Features: Ensure new features are user-friendly
Found a bug? Please include:
- ๐ Description: Clear description of the issue
- ๐ Steps to reproduce: Detailed reproduction steps
- ๐ป Environment: OS, Python version, dependencies
- ๐ Sample data: If possible, provide sample data (anonymized)
- ๐ท Screenshots: Visual evidence of the issue
This project is licensed under the MIT License - see the details below:
MIT License
Copyright (c) 2024 Data Analysis AI Agent
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
- โ Use: Free to use for personal and commercial projects
- โ Modify: Free to modify and adapt the code
- โ Distribute: Free to distribute original or modified versions
- โ Private Use: Free to use privately without restrictions
โ ๏ธ Attribution: Must include original license in distributionsโ ๏ธ No Warranty: Software provided "as-is" without warranties
For immediate assistance, try these resources in order:
- ๐ Troubleshooting - Common issues and solutions
- ๐ Usage Guide - Detailed feature explanations
- ๐ก Example Queries - Sample questions to try
- โ๏ธ Configuration - Setup and customization
| Resource | What You'll Find | Best For |
|---|---|---|
| README.md | Complete project documentation | General understanding |
| Code Comments | Detailed implementation notes | Development questions |
| Example Data | Sample dataset for testing | Learning the interface |
| .env.template | Configuration options | Setup assistance |
If you can't find what you need:
- ๐ Create an Issue: Report bugs or request features
- ๐ฌ Ask Questions: Get help from the community
- ๐ Check Documentation: Review inline code documentation
- ๐ค Contribute: Help improve the project for everyone
- ๐ API Keys: We cannot provide OpenAI API keys - get yours from OpenAI
- ๐ธ API Costs: You are responsible for OpenAI API usage costs
- ๐ Data Privacy: Your data is processed locally and sent only to OpenAI's API
- ๐ Updates: Check back regularly for new features and improvements
Happy analyzing! ๐
Transform your data into insights with the power of AI