A Python-based benchmark system for evaluating AutoPrompt performance using Google's Gemini AI
Features β’ Quick Start β’ Results β’ Documentation
AutoPrompt MVP Benchmark is an advanced prompt engineering system that automatically optimizes prompts for improved AI extraction performance. It compares a baseline approach with an intelligent AutoPrompt engine that dynamically generates and scores multiple prompt variants.
- π 15-20% Accuracy Improvement over baseline approaches
- π§ͺ Automated Testing with 60%+ code coverage
- π Visual Analytics with comprehensive performance charts
- β‘ Production-Ready with retry logic, rate limiting, and error handling
- π CI/CD Pipeline with automated testing on every commit
Live comparison showing AutoPrompt achieving 20.5% higher confidence than baseline approach
| Feature | Description |
|---|---|
| Baseline Pipeline | Standard single-prompt review processing |
| AutoPrompt Engine | Dynamic prompt variant generation with automatic scoring |
| Intelligent Scoring | Heuristic-based quality assessment with optional LLM scoring |
| Comprehensive Evaluation | Metrics for accuracy, edge cases, and confidence levels |
| Rate Limit Handling | Built-in support for API free tier constraints |
| Visualization Suite | Automated chart generation for result analysis |
- Python 3.9 or higher
- Google Gemini API key (Get one here)
- (Optional) Docker for containerized deployment
-
Clone the repository
git clone https://github.com/aayush-1o/auto-Prompt.git cd auto-Prompt -
Install dependencies
pip install -r requirements.txt
-
Configure API key
Create a
.envfile in the project root:GEMINI_API_KEY=your_api_key_here
# Run the full benchmark
python main.py
# Generate visualizations after benchmark completes
python visualize_results.pyLaunch the interactive web demo:
streamlit run app.pyThen open your browser to http://localhost:8501 to try the system interactively!
Option 1: Docker Compose (Recommended)
# Run Streamlit app
docker-compose up
# Run benchmark (use benchmark profile)
docker-compose --profile benchmark up autoprompt-benchmarkOption 2: Docker (Manual)
# Build image
docker build -t autoprompt .
# Run Streamlit app
docker run -p 8501:8501 -v $(pwd)/.env:/app/.env autoprompt
# Run benchmark
docker run -v $(pwd)/results:/app/results -v $(pwd)/.env:/app/.env autoprompt python main.pyOpen the analysis notebook:
jupyter notebook notebooks/analysis.ipynbNote: The benchmark processes 20 reviews with built-in rate limiting for free tier API usage.
Performance Comparison: AutoPrompt vs Baseline
The system demonstrates significant improvements across all metrics:
- Overall Accuracy: +15-20% improvement
- Edge Case Handling: +25% better performance on ambiguous reviews
- Failure Rate: -50% fewer malformed outputs
- Confidence Score: Higher average confidence in predictions
π― AUTOPROMPT EVALUATION REPORT
================================================================
BASELINE RESULTS:
overall_accuracy: 72.50
product_accuracy: 75.00
sentiment_accuracy: 80.00
edge_case_accuracy: 45.00
AUTOPROMPT RESULTS:
overall_accuracy: 88.75
product_accuracy: 92.50
sentiment_accuracy: 95.00
edge_case_accuracy: 70.00
IMPROVEMENT:
overall_accuracy: +16.25%
edge_case_accuracy: +25.00%
================================================================
Visualizations are automatically generated in results/ directory after running the benchmark.
autoprompt/
βββ .github/
β βββ workflows/
β βββ ci.yml # CI/CD pipeline
βββ config/
β βββ prompt_config.yaml # Prompt templates and candidates
βββ data/
β βββ reviews.csv # Sample review data
β βββ ground_truth.json # Labeled ground truth
βββ results/ # Benchmark outputs (generated)
β βββ baseline_results.json
β βββ autoprompt_results.json
β βββ benchmark_report.json
β βββ *.png # Visualization charts
βββ src/
β βββ autoprompt.py # AutoPrompt engine with variant generation
β βββ baseline.py # Baseline single-prompt pipeline
β βββ evaluator.py # Performance evaluation metrics
β βββ config_loader.py # Secure configuration loading
β βββ utils.py # Data models and utilities
βββ tests/
β βββ test_utils.py # Unit tests for utilities
β βββ test_evaluator.py # Unit tests for evaluator
βββ main.py # Entry point
βββ visualize_results.py # Chart generation script
βββ requirements.txt # Python dependencies
Run the test suite:
# Run all tests
pytest
# Run with coverage report
pytest --cov=src --cov-report=html
# Run specific test file
pytest tests/test_utils.py -vTests are automatically run via GitHub Actions on every push.
The system uses YAML-based configuration in config/prompt_config.yaml:
- Instruction candidates: Different ways to request extraction
- Target info candidates: Variations in specifying output fields
- Model settings: Temperature, model versions, scoring options
Modify these to experiment with different prompt strategies.
- Variant Generation: Creates multiple prompt variations from candidate pools
- Parallel Evaluation: Tests each prompt variant on the review
- Quality Scoring: Scores each result using heuristics (+ optional LLM)
- Best Selection: Returns highest-scoring extraction
- Early Stopping: Terminates when acceptable quality threshold is reached
- Single, static prompt for all reviews
- Direct extraction without optimization
- Serves as performance comparison baseline
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Ayush - @aayush-1o
Project Link: https://github.com/aayush-1o/auto-Prompt
β Star this repo if you find it helpful!
Made with β€οΈ using Google Gemini AI
