This project is designed to benchmark various multi-file questions using different models. It automates the process of downloading project files, generating solutions using a model, updating project files with the solutions, running tests in Docker containers, and capturing the test results.
- Python 3.11+
- Docker version 4.37.1+
- OpenAI API key
- pip for installing Python packages
Recommendation: Running the evaluation harness can be resource-intensive. To achieve optimal performance, we recommend using a machine equipped with at least 16GB of RAM and 8 CPU cores. You may need to adjust the number of workers in the project_questions_harness.py script to further optimize performance. If running on Docker Desktop, ensure the virtual disk has approximately 120GB of free storage. A configuration of 2-4 workers is generally recommended.
For reference, on a machine with 32GB of RAM, 10 CPU cores, and 3 workers, completing one iteration (processing 65 questions with a given model) takes approximately 1 hour.
-
Clone the repository:
git clone https://github.com/interviewstreet/astra-benchmark.git cd astra-benchmark
-
Install the required Python packages:
python3 -m venv venv source venv/bin/activate pip install -r requirements.txt
-
Ensure Docker is installed and running on your system.
-
Create a
.env
file in the root directory and add your OpenAI API key:OPENAI_API_KEY=your_openai_api_key GOOGLE_API_KEY=your_google_api_key AWS_REGION=your_aws_region AWS_PROFILE=your_aws_profile
-
Make sure access to the Claude model in AWS Bedrock is enabled before accessing it.
The dataset used for astra-benchmark consists of multiple project questions, each with its own unique identifier and
associated metadata. The dataset is stored in a CSV file named project_questions.csv
located in the root directory of
the project.
The CSV file should contain the following columns:
id
: Unique identifier for the questionname
: Name of the projecttype
: Type of the projectproblem_statement
: Description of the problemproject_url
: URL to download the project filessub_type
: Sub-type of the project (e.g.,nodejs
,python
)test_command
: Command to run the teststestcases
: List of test casestestcase_files
: List of test case filestotal_testcases
: Total number of test cases
-
Run the main script:
python3 project_questions_harness.py
-
Follow the prompts to enter the model name, number of iterations, question IDs and the response format (json or xml). We would recommend using XML format.
-
The script will run the benchmarking process for each question and while the benchmarking is running, the partial results will be stored in
output_csv/<model>/partial_output/project_questions_output_iteration_<k>_<timestamp>.csv
file. -
Once the iteration is completed the result is saved in the
output_csv/<model>/project_questions_output_iteration_<k>_<model_response_format>_<timestamp>.csv
file. -
To calculate the final aggregated metrics, run the following command after all the iterations are completed:
python3 aggregated_metrics.py
-
The final aggregated results will be saved in the
aggregated_results/<model>/aggregated_results_<model>_<pass_k>.csv
file.
Complete the {sub_type} project. Provide me the response in xml format in a xml code block only for the required files
needed to solve the problem.
Format:
<files>
<file>
<path>relative/path/to/file_1</path>
<content><![CDATA[ // This is file_content_1 ]]></content>
</file>
<file>
<path>relative/path/to/file_2</path>
<content><![CDATA[ //This is file_content_2 ]]</content>
</file>
</files>
Do not provide any additional instructions.
Provide the relative file path in xml object just like how it is provided in the attached files in prompt.
Ensure that the xml is properly escaped wherever it is required, and verify that the XML format is always well formed.
Problem Statement:{problem_description}
Project Files: {files}
gpt-4o
o1-preview
claude-3.5-sonnet
gemini-1.5-pro
o1
The Average Score with k=32 evaluates the model’s partial correctness and robustness by considering multiple attempts (up to k=32) for each problem. For each problem, the score is calculated as the average proportion of passed test cases across the 32 runs. Then, this score is aggregated across 65 problems to compute the final Average Score.
Average_Score = (1/n) * Σ(i=1 to n) (1/k) * Σ(j=1 to k) (p[ij] / T[i])
Where:
n is the number of problems (e.g., 65)
k is the number of runs/solutions per problem (e.g., 32)
p[ij] is the number of passed test cases for the j-th run of the i-th problem
T[i] is the total number of test cases for the i-th problem
Pass@1 measures the success rate of a model by calculating the proportion of tasks for which the first attempt is correct.
Formula used:
Pass@1 = (1/n) * Σ(i=1 to n) (1/k) * Σ(j=1 to k) I(p[ij] = T[i])
Where:
n is the number of problems (e.g., 65)
k is the number of runs/solutions per problem (e.g., 32)
p[ij] is the number of passed test cases for the j-th run of the i-th problem
T[i] is the total number of test cases for the i-th problem
I(⋅) is the indicator function, equal to 1 if a perfect score and 0 otherwise
The Median Standard Deviation measures the consistency of the model's performance across n=65 problems. For each problem, the standard deviation of scores across k=32 solutions is computed. The final metric is the median of these standard deviations across all problems. The rationale for using the median instead of the mean is that the standard deviation of scores across different problems often deviates from a normal distribution.
Formula used:
SD(i) = sqrt((1/k) * Σ(j=1 to k) (Score(i,j) - Average_Score(i))²)
Median_Standard_Deviation = median(SD(i) for i in range(1, n+1))
Where:
Score(i,j) is the score for each solution i of problem j
Average_Score(i) is the average score for problem i