AutoPrompt MVP Benchmark

A Python-based benchmark system for evaluating AutoPrompt performance using Google's Gemini AI

Features • Quick Start • Results • Documentation

🎯 Overview

AutoPrompt MVP Benchmark is an advanced prompt engineering system that automatically optimizes prompts for improved AI extraction performance. It compares a baseline approach with an intelligent AutoPrompt engine that dynamically generates and scores multiple prompt variants.

Key Highlights

🚀 15-20% Accuracy Improvement over baseline approaches
🧪 Automated Testing with 60%+ code coverage
📊 Visual Analytics with comprehensive performance charts
⚡ Production-Ready with retry logic, rate limiting, and error handling
🔄 CI/CD Pipeline with automated testing on every commit

🎬 Interactive Demo

Live comparison showing AutoPrompt achieving 20.5% higher confidence than baseline approach

✨ Features

Feature	Description
Baseline Pipeline	Standard single-prompt review processing
AutoPrompt Engine	Dynamic prompt variant generation with automatic scoring
Intelligent Scoring	Heuristic-based quality assessment with optional LLM scoring
Comprehensive Evaluation	Metrics for accuracy, edge cases, and confidence levels
Rate Limit Handling	Built-in support for API free tier constraints
Visualization Suite	Automated chart generation for result analysis

🚀 Quick Start

Prerequisites

Python 3.9 or higher
Google Gemini API key (Get one here)
(Optional) Docker for containerized deployment

Installation

Clone the repository

git clone https://github.com/aayush-1o/auto-Prompt.git
cd auto-Prompt

Install dependencies
```
pip install -r requirements.txt
```
Configure API key

Create a .env file in the project root:
```
GEMINI_API_KEY=your_api_key_here
```

Running the Benchmark

# Run the full benchmark
python main.py

# Generate visualizations after benchmark completes
python visualize_results.py

🎨 Interactive Demo (Streamlit)

Launch the interactive web demo:

streamlit run app.py

Then open your browser to http://localhost:8501 to try the system interactively!

🐳 Docker Deployment

Option 1: Docker Compose (Recommended)

# Run Streamlit app
docker-compose up

# Run benchmark (use benchmark profile)
docker-compose --profile benchmark up autoprompt-benchmark

Option 2: Docker (Manual)

# Build image
docker build -t autoprompt .

# Run Streamlit app
docker run -p 8501:8501 -v $(pwd)/.env:/app/.env autoprompt

# Run benchmark
docker run -v $(pwd)/results:/app/results -v $(pwd)/.env:/app/.env autoprompt python main.py

📊 Jupyter Analysis

Open the analysis notebook:

jupyter notebook notebooks/analysis.ipynb

Note: The benchmark processes 20 reviews with built-in rate limiting for free tier API usage.

📊 Results

Performance Comparison: AutoPrompt vs Baseline

The system demonstrates significant improvements across all metrics:

Overall Accuracy: +15-20% improvement
Edge Case Handling: +25% better performance on ambiguous reviews
Failure Rate: -50% fewer malformed outputs
Confidence Score: Higher average confidence in predictions

Sample Output

🎯 AUTOPROMPT EVALUATION REPORT
================================================================
BASELINE RESULTS:
  overall_accuracy: 72.50
  product_accuracy: 75.00
  sentiment_accuracy: 80.00
  edge_case_accuracy: 45.00

AUTOPROMPT RESULTS:
  overall_accuracy: 88.75
  product_accuracy: 92.50
  sentiment_accuracy: 95.00
  edge_case_accuracy: 70.00

IMPROVEMENT:
  overall_accuracy: +16.25%
  edge_case_accuracy: +25.00%
================================================================

Visualizations are automatically generated in results/ directory after running the benchmark.

📁 Project Structure

autoprompt/
├── .github/
│   └── workflows/
│       └── ci.yml              # CI/CD pipeline
├── config/
│   └── prompt_config.yaml      # Prompt templates and candidates
├── data/
│   ├── reviews.csv             # Sample review data
│   └── ground_truth.json       # Labeled ground truth
├── results/                    # Benchmark outputs (generated)
│   ├── baseline_results.json
│   ├── autoprompt_results.json
│   ├── benchmark_report.json
│   └── *.png                   # Visualization charts
├── src/
│   ├── autoprompt.py           # AutoPrompt engine with variant generation
│   ├── baseline.py             # Baseline single-prompt pipeline
│   ├── evaluator.py            # Performance evaluation metrics
│   ├── config_loader.py        # Secure configuration loading
│   └── utils.py                # Data models and utilities
├── tests/
│   ├── test_utils.py           # Unit tests for utilities
│   └── test_evaluator.py       # Unit tests for evaluator
├── main.py                     # Entry point
├── visualize_results.py        # Chart generation script
└── requirements.txt            # Python dependencies

🧪 Testing

Run the test suite:

# Run all tests
pytest

# Run with coverage report
pytest --cov=src --cov-report=html

# Run specific test file
pytest tests/test_utils.py -v

Tests are automatically run via GitHub Actions on every push.

🔧 Configuration

The system uses YAML-based configuration in config/prompt_config.yaml:

Instruction candidates: Different ways to request extraction
Target info candidates: Variations in specifying output fields
Model settings: Temperature, model versions, scoring options

Modify these to experiment with different prompt strategies.

📈 How It Works

AutoPrompt Pipeline

Variant Generation: Creates multiple prompt variations from candidate pools
Parallel Evaluation: Tests each prompt variant on the review
Quality Scoring: Scores each result using heuristics (+ optional LLM)
Best Selection: Returns highest-scoring extraction
Early Stopping: Terminates when acceptable quality threshold is reached

Baseline Pipeline

Single, static prompt for all reviews
Direct extraction without optimization
Serves as performance comparison baseline

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

📧 Contact

Ayush - @aayush-1o

Project Link: https://github.com/aayush-1o/auto-Prompt

⭐ Star this repo if you find it helpful!

Made with ❤️ using Google Gemini AI

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
config		config
data		data
notebooks		notebooks
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
check_model.py		check_model.py
demo_screenshot.png		demo_screenshot.png
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
visualize_results.py		visualize_results.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoPrompt MVP Benchmark

🎯 Overview

Key Highlights

🎬 Interactive Demo

✨ Features

🚀 Quick Start

Prerequisites

Installation

Running the Benchmark

🎨 Interactive Demo (Streamlit)

🐳 Docker Deployment

📊 Jupyter Analysis

📊 Results

Sample Output

📁 Project Structure

🧪 Testing

🔧 Configuration

📈 How It Works

AutoPrompt Pipeline

Baseline Pipeline

📄 License

🤝 Contributing

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

aayush-1o/auto-Prompt

Folders and files

Latest commit

History

Repository files navigation

AutoPrompt MVP Benchmark

🎯 Overview

Key Highlights

🎬 Interactive Demo

✨ Features

🚀 Quick Start

Prerequisites

Installation

Running the Benchmark

🎨 Interactive Demo (Streamlit)

🐳 Docker Deployment

📊 Jupyter Analysis

📊 Results

Sample Output

📁 Project Structure

🧪 Testing

🔧 Configuration

📈 How It Works

AutoPrompt Pipeline

Baseline Pipeline

📄 License

🤝 Contributing

📧 Contact

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages