Skip to content

interviewstreet/astra-benchmark

Repository files navigation

ASTRA Benchmark

Overview

This project is designed to benchmark various multi-file questions using different models. It automates the process of downloading project files, generating solutions using a model, updating project files with the solutions, running tests in Docker containers, and capturing the test results.

Evaluation Pipeline

Image

Requirements

  • Python 3.11+
  • Docker version 4.37.1+
  • OpenAI API key
  • pip for installing Python packages

Prerequisites

Recommendation: Running the evaluation harness can be resource-intensive. To achieve optimal performance, we recommend using a machine equipped with at least 16GB of RAM and 8 CPU cores. You may need to adjust the number of workers in the project_questions_harness.py script to further optimize performance. If running on Docker Desktop, ensure the virtual disk has approximately 120GB of free storage. A configuration of 2-4 workers is generally recommended.

For reference, on a machine with 32GB of RAM, 10 CPU cores, and 3 workers, completing one iteration (processing 65 questions with a given model) takes approximately 1 hour.

  1. Clone the repository:

    git clone https://github.com/interviewstreet/astra-benchmark.git
    cd astra-benchmark
  2. Install the required Python packages:

    python3 -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt
  3. Ensure Docker is installed and running on your system.

  4. Create a .env file in the root directory and add your OpenAI API key:

    OPENAI_API_KEY=your_openai_api_key
    GOOGLE_API_KEY=your_google_api_key
    AWS_REGION=your_aws_region
    AWS_PROFILE=your_aws_profile
  5. Make sure access to the Claude model in AWS Bedrock is enabled before accessing it.

Dataset

The dataset used for astra-benchmark consists of multiple project questions, each with its own unique identifier and associated metadata. The dataset is stored in a CSV file named project_questions.csv located in the root directory of the project.

Structure of project_questions.csv

The CSV file should contain the following columns:

  • id: Unique identifier for the question
  • name: Name of the project
  • type: Type of the project
  • problem_statement: Description of the problem
  • project_url: URL to download the project files
  • sub_type: Sub-type of the project (e.g., nodejs, python)
  • test_command: Command to run the tests
  • testcases: List of test cases
  • testcase_files: List of test case files
  • total_testcases: Total number of test cases

Evaluation Steps

  1. Run the main script:

    python3 project_questions_harness.py
  2. Follow the prompts to enter the model name, number of iterations, question IDs and the response format (json or xml). We would recommend using XML format.

  3. The script will run the benchmarking process for each question and while the benchmarking is running, the partial results will be stored in output_csv/<model>/partial_output/project_questions_output_iteration_<k>_<timestamp>.csv file.

  4. Once the iteration is completed the result is saved in the output_csv/<model>/project_questions_output_iteration_<k>_<model_response_format>_<timestamp>.csv file.

  5. To calculate the final aggregated metrics, run the following command after all the iterations are completed:

    python3 aggregated_metrics.py
  6. The final aggregated results will be saved in the aggregated_results/<model>/aggregated_results_<model>_<pass_k>.csv file.

XML Prompt

Complete the {sub_type} project. Provide me the response in xml format in a xml code block only for the required files 
needed to solve the problem.
Format:

<files>
    <file>
        <path>relative/path/to/file_1</path>
        <content><![CDATA[ // This is file_content_1 ]]></content>
    </file>
    <file>
        <path>relative/path/to/file_2</path>
        <content><![CDATA[ //This is file_content_2 ]]</content>
    </file>
</files>

Do not provide any additional instructions.
Provide the relative file path in xml object just like how it is provided in the attached files in prompt.
Ensure that the xml is properly escaped wherever it is required, and verify that the XML format is always well formed.

Problem Statement:{problem_description}

Project Files: {files}

Supported Models

  1. gpt-4o
  2. o1-preview
  3. claude-3.5-sonnet
  4. gemini-1.5-pro
  5. o1

Aggregated Metrics:

Average Score Calculation

The Average Score with k=32 evaluates the model’s partial correctness and robustness by considering multiple attempts (up to k=32) for each problem. For each problem, the score is calculated as the average proportion of passed test cases across the 32 runs. Then, this score is aggregated across 65 problems to compute the final Average Score.

Average_Score = (1/n) * Σ(i=1 to n) (1/k) * Σ(j=1 to k) (p[ij] / T[i])

Where:
n is the number of problems (e.g., 65)
k is the number of runs/solutions per problem (e.g., 32)
p[ij] is the number of passed test cases for the j-th run of the i-th problem
T[i] is the total number of test cases for the i-th problem

Pass@1 Calculation

Pass@1 measures the success rate of a model by calculating the proportion of tasks for which the first attempt is correct.

Formula used:

Pass@1 = (1/n) * Σ(i=1 to n) (1/k) * Σ(j=1 to k) I(p[ij] = T[i])

Where:
n is the number of problems (e.g., 65)
k is the number of runs/solutions per problem (e.g., 32)
p[ij] is the number of passed test cases for the j-th run of the i-th problem
T[i] is the total number of test cases for the i-th problem
I(⋅) is the indicator function, equal to 1 if a perfect score and 0 otherwise

Standard Deviation Calculation

The Median Standard Deviation measures the consistency of the model's performance across n=65 problems. For each problem, the standard deviation of scores across k=32 solutions is computed. The final metric is the median of these standard deviations across all problems. The rationale for using the median instead of the mean is that the standard deviation of scores across different problems often deviates from a normal distribution.

Formula used:

SD(i) = sqrt((1/k) * Σ(j=1 to k) (Score(i,j) - Average_Score(i))²)
Median_Standard_Deviation = median(SD(i) for i in range(1, n+1))

Where:
Score(i,j) is the score for each solution i of problem j
Average_Score(i) is the average score for problem i

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages