Skip to content

Python benchmark system for AutoPrompt performance evaluation using Google Gemini AI for review processing and data extraction

License

Notifications You must be signed in to change notification settings

aayush-1o/auto-Prompt

Repository files navigation

AutoPrompt MVP Benchmark

Python License CI Status

A Python-based benchmark system for evaluating AutoPrompt performance using Google's Gemini AI

Features β€’ Quick Start β€’ Results β€’ Documentation


🎯 Overview

AutoPrompt MVP Benchmark is an advanced prompt engineering system that automatically optimizes prompts for improved AI extraction performance. It compares a baseline approach with an intelligent AutoPrompt engine that dynamically generates and scores multiple prompt variants.

Key Highlights

  • πŸš€ 15-20% Accuracy Improvement over baseline approaches
  • πŸ§ͺ Automated Testing with 60%+ code coverage
  • πŸ“Š Visual Analytics with comprehensive performance charts
  • ⚑ Production-Ready with retry logic, rate limiting, and error handling
  • πŸ”„ CI/CD Pipeline with automated testing on every commit

🎬 Interactive Demo

AutoPrompt Demo

Live comparison showing AutoPrompt achieving 20.5% higher confidence than baseline approach


✨ Features

Feature Description
Baseline Pipeline Standard single-prompt review processing
AutoPrompt Engine Dynamic prompt variant generation with automatic scoring
Intelligent Scoring Heuristic-based quality assessment with optional LLM scoring
Comprehensive Evaluation Metrics for accuracy, edge cases, and confidence levels
Rate Limit Handling Built-in support for API free tier constraints
Visualization Suite Automated chart generation for result analysis

πŸš€ Quick Start

Prerequisites

  • Python 3.9 or higher
  • Google Gemini API key (Get one here)
  • (Optional) Docker for containerized deployment

Installation

  1. Clone the repository

    git clone https://github.com/aayush-1o/auto-Prompt.git
    cd auto-Prompt
  2. Install dependencies

    pip install -r requirements.txt
  3. Configure API key

    Create a .env file in the project root:

    GEMINI_API_KEY=your_api_key_here

Running the Benchmark

# Run the full benchmark
python main.py

# Generate visualizations after benchmark completes
python visualize_results.py

🎨 Interactive Demo (Streamlit)

Launch the interactive web demo:

streamlit run app.py

Then open your browser to http://localhost:8501 to try the system interactively!

🐳 Docker Deployment

Option 1: Docker Compose (Recommended)

# Run Streamlit app
docker-compose up

# Run benchmark (use benchmark profile)
docker-compose --profile benchmark up autoprompt-benchmark

Option 2: Docker (Manual)

# Build image
docker build -t autoprompt .

# Run Streamlit app
docker run -p 8501:8501 -v $(pwd)/.env:/app/.env autoprompt

# Run benchmark
docker run -v $(pwd)/results:/app/results -v $(pwd)/.env:/app/.env autoprompt python main.py

πŸ“Š Jupyter Analysis

Open the analysis notebook:

jupyter notebook notebooks/analysis.ipynb

Note: The benchmark processes 20 reviews with built-in rate limiting for free tier API usage.


πŸ“Š Results

Performance Comparison: AutoPrompt vs Baseline

The system demonstrates significant improvements across all metrics:

  • Overall Accuracy: +15-20% improvement
  • Edge Case Handling: +25% better performance on ambiguous reviews
  • Failure Rate: -50% fewer malformed outputs
  • Confidence Score: Higher average confidence in predictions

Sample Output

🎯 AUTOPROMPT EVALUATION REPORT
================================================================
BASELINE RESULTS:
  overall_accuracy: 72.50
  product_accuracy: 75.00
  sentiment_accuracy: 80.00
  edge_case_accuracy: 45.00

AUTOPROMPT RESULTS:
  overall_accuracy: 88.75
  product_accuracy: 92.50
  sentiment_accuracy: 95.00
  edge_case_accuracy: 70.00

IMPROVEMENT:
  overall_accuracy: +16.25%
  edge_case_accuracy: +25.00%
================================================================

Visualizations are automatically generated in results/ directory after running the benchmark.


πŸ“ Project Structure

autoprompt/
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       └── ci.yml              # CI/CD pipeline
β”œβ”€β”€ config/
β”‚   └── prompt_config.yaml      # Prompt templates and candidates
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ reviews.csv             # Sample review data
β”‚   └── ground_truth.json       # Labeled ground truth
β”œβ”€β”€ results/                    # Benchmark outputs (generated)
β”‚   β”œβ”€β”€ baseline_results.json
β”‚   β”œβ”€β”€ autoprompt_results.json
β”‚   β”œβ”€β”€ benchmark_report.json
β”‚   └── *.png                   # Visualization charts
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ autoprompt.py           # AutoPrompt engine with variant generation
β”‚   β”œβ”€β”€ baseline.py             # Baseline single-prompt pipeline
β”‚   β”œβ”€β”€ evaluator.py            # Performance evaluation metrics
β”‚   β”œβ”€β”€ config_loader.py        # Secure configuration loading
β”‚   └── utils.py                # Data models and utilities
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_utils.py           # Unit tests for utilities
β”‚   └── test_evaluator.py       # Unit tests for evaluator
β”œβ”€β”€ main.py                     # Entry point
β”œβ”€β”€ visualize_results.py        # Chart generation script
└── requirements.txt            # Python dependencies

πŸ§ͺ Testing

Run the test suite:

# Run all tests
pytest

# Run with coverage report
pytest --cov=src --cov-report=html

# Run specific test file
pytest tests/test_utils.py -v

Tests are automatically run via GitHub Actions on every push.


πŸ”§ Configuration

The system uses YAML-based configuration in config/prompt_config.yaml:

  • Instruction candidates: Different ways to request extraction
  • Target info candidates: Variations in specifying output fields
  • Model settings: Temperature, model versions, scoring options

Modify these to experiment with different prompt strategies.


πŸ“ˆ How It Works

AutoPrompt Pipeline

  1. Variant Generation: Creates multiple prompt variations from candidate pools
  2. Parallel Evaluation: Tests each prompt variant on the review
  3. Quality Scoring: Scores each result using heuristics (+ optional LLM)
  4. Best Selection: Returns highest-scoring extraction
  5. Early Stopping: Terminates when acceptable quality threshold is reached

Baseline Pipeline

  • Single, static prompt for all reviews
  • Direct extraction without optimization
  • Serves as performance comparison baseline

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


🀝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.


πŸ“§ Contact

Ayush - @aayush-1o

Project Link: https://github.com/aayush-1o/auto-Prompt


⭐ Star this repo if you find it helpful!

Made with ❀️ using Google Gemini AI

About

Python benchmark system for AutoPrompt performance evaluation using Google Gemini AI for review processing and data extraction

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors