ASTRA Benchmark

Overview

This project is designed to benchmark various multi-file questions using different models. It automates the process of downloading project files, generating solutions using a model, updating project files with the solutions, running tests in Docker containers, and capturing the test results.

Evaluation Pipeline

Requirements

Python 3.11+
Docker version 4.37.1+
OpenAI API key
pip for installing Python packages

Prerequisites

Recommendation: Running the evaluation harness can be resource-intensive. To achieve optimal performance, we recommend using a machine equipped with at least 16GB of RAM and 8 CPU cores. You may need to adjust the number of workers in the project_questions_harness.py script to further optimize performance. If running on Docker Desktop, ensure the virtual disk has approximately 120GB of free storage. A configuration of 2-4 workers is generally recommended.

For reference, on a machine with 32GB of RAM, 10 CPU cores, and 3 workers, completing one iteration (processing 65 questions with a given model) takes approximately 1 hour.

Clone the repository:

git clone https://github.com/interviewstreet/astra-benchmark.git
cd astra-benchmark

Install the required Python packages:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Ensure Docker is installed and running on your system.

Create a .env file in the root directory and add your OpenAI API key:

OPENAI_API_KEY=your_openai_api_key
GOOGLE_API_KEY=your_google_api_key
AWS_REGION=your_aws_region
AWS_PROFILE=your_aws_profile

Make sure access to the Claude model in AWS Bedrock is enabled before accessing it.

Dataset

The dataset used for astra-benchmark consists of multiple project questions, each with its own unique identifier and associated metadata. The dataset is stored in a CSV file named project_questions.csv located in the root directory of the project.

Structure of `project_questions.csv`

The CSV file should contain the following columns:

id: Unique identifier for the question
name: Name of the project
type: Type of the project
problem_statement: Description of the problem
project_url: URL to download the project files
sub_type: Sub-type of the project (e.g., nodejs, python)
test_command: Command to run the tests
testcases: List of test cases
testcase_files: List of test case files
total_testcases: Total number of test cases

Evaluation Steps

Run the main script:
```
python3 project_questions_harness.py
```
Follow the prompts to enter the model name, number of iterations, question IDs and the response format (json or xml). We would recommend using XML format.
The script will run the benchmarking process for each question and while the benchmarking is running, the partial results will be stored in output_csv/<model>/partial_output/project_questions_output_iteration_<k>_<timestamp>.csv file.
Once the iteration is completed the result is saved in the output_csv/<model>/project_questions_output_iteration_<k>_<model_response_format>_<timestamp>.csv file.
To calculate the final aggregated metrics, run the following command after all the iterations are completed:
```
python3 aggregated_metrics.py
```
The final aggregated results will be saved in the aggregated_results/<model>/aggregated_results_<model>_<pass_k>.csv file.

XML Prompt

Complete the {sub_type} project. Provide me the response in xml format in a xml code block only for the required files 
needed to solve the problem.
Format:

<files>
    <file>
        <path>relative/path/to/file_1</path>
        <content><![CDATA[ // This is file_content_1 ]]></content>
    </file>
    <file>
        <path>relative/path/to/file_2</path>
        <content><![CDATA[ //This is file_content_2 ]]</content>
    </file>
</files>

Do not provide any additional instructions.
Provide the relative file path in xml object just like how it is provided in the attached files in prompt.
Ensure that the xml is properly escaped wherever it is required, and verify that the XML format is always well formed.

Problem Statement:{problem_description}

Project Files: {files}

Supported Models

gpt-4o
o1-preview
claude-3.5-sonnet
gemini-1.5-pro
o1

Aggregated Metrics:

Average Score Calculation

The Average Score with k=32 evaluates the model’s partial correctness and robustness by considering multiple attempts (up to k=32) for each problem. For each problem, the score is calculated as the average proportion of passed test cases across the 32 runs. Then, this score is aggregated across 65 problems to compute the final Average Score.

Average_Score = (1/n) * Σ(i=1 to n) (1/k) * Σ(j=1 to k) (p[ij] / T[i])

Where:
n is the number of problems (e.g., 65)
k is the number of runs/solutions per problem (e.g., 32)
p[ij] is the number of passed test cases for the j-th run of the i-th problem
T[i] is the total number of test cases for the i-th problem

Pass@1 Calculation

Pass@1 measures the success rate of a model by calculating the proportion of tasks for which the first attempt is correct.

Formula used:

Pass@1 = (1/n) * Σ(i=1 to n) (1/k) * Σ(j=1 to k) I(p[ij] = T[i])

Where:
n is the number of problems (e.g., 65)
k is the number of runs/solutions per problem (e.g., 32)
p[ij] is the number of passed test cases for the j-th run of the i-th problem
T[i] is the total number of test cases for the i-th problem
I(⋅) is the indicator function, equal to 1 if a perfect score and 0 otherwise

Standard Deviation Calculation

The Median Standard Deviation measures the consistency of the model's performance across n=65 problems. For each problem, the standard deviation of scores across k=32 solutions is computed. The final metric is the median of these standard deviations across all problems. The rationale for using the median instead of the mean is that the standard deviation of scores across different problems often deviates from a normal distribution.

Formula used:

SD(i) = sqrt((1/k) * Σ(j=1 to k) (Score(i,j) - Average_Score(i))²)
Median_Standard_Deviation = median(SD(i) for i in range(1, n+1))

Where:
Score(i,j) is the score for each solution i of problem j
Average_Score(i) is the average score for problem i

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ASTRA Benchmark

Overview

Evaluation Pipeline

Requirements

Prerequisites

Dataset

Structure of `project_questions.csv`

Evaluation Steps

XML Prompt

Supported Models

Aggregated Metrics:

Average Score Calculation

Pass@1 Calculation

Standard Deviation Calculation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
prompts		prompts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
aggregated_metrics.py		aggregated_metrics.py
project_questions.csv		project_questions.csv
project_questions_harness.py		project_questions_harness.py
requirements.txt		requirements.txt

interviewstreet/astra-benchmark

Folders and files

Latest commit

History

Repository files navigation

ASTRA Benchmark

Overview

Evaluation Pipeline

Requirements

Prerequisites

Dataset

Structure of project_questions.csv

Evaluation Steps

XML Prompt

Supported Models

Aggregated Metrics:

Average Score Calculation

Pass@1 Calculation

Standard Deviation Calculation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Structure of `project_questions.csv`

Packages