Update usage instructions

klieret · klieret · commit 0296b99f8f18 · 2024-07-15T19:48:59.000-04:00
diff --git a/.gitignore b/.gitignore
@@ -4,6 +4,7 @@ keys.cfg
 **/output/**
 **/eval_results/**
 eval/logs/**
+*.h5
 
 
 # -------
diff --git a/README.md b/README.md
@@ -14,6 +14,7 @@ SciCode is a challenging benchmark designed to evaluate the capabilities of lang
 SciCode sources challenging and realistic research-level coding problems across 6 natural science disciplines, covering a total of 16 subfields. Scicode mainly focuses on 1. Numerical methods 2.Simulation of systems 3. Scientific calculation. These are the tasks we believe require intense scientific knowledge and reasoning to optimally test LM’s science capability.
 
 ## 🏆 Leaderboard
+
 | Model                     | Subproblem | Main Problem |
 |---------------------------|------------|--------------|
 | Claude3.5-Sonnet          | **26**         | **4.6**          |
@@ -27,8 +28,13 @@ SciCode sources challenging and realistic research-level coding problems across
 | Mixtral-8x22B-Instruct    | 16.3       | 0            |
 | Llama-3-70B-Chat          | 14.6       | 0            |
 
+## Instructions to evaluate a new model
 
-
+1. Clone this repository `git clone git@github.com:scicode-bench/SciCode.git`
+2. Install the `scicode` package with `pip install -e .`
+3. Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save them as `./eval/data/test_data.h5`
+4. Run `eval/scripts/gencode_json.py` to generate new model outputs (see the [`eval/scripts` readme](eval/scripts/)) for more information
+5. Run `eval/scripts/test_generated_code.py` to evaluate the unittests
 
 ## Contact
 - Minyang Tian: mtian8@illinois.edu
diff --git a/eval/scripts/README.md b/eval/scripts/README.md
@@ -1,24 +1,33 @@
- - ## **Generate LLM code**
-   
-   To run the script, go to the root of this repo and use the following command:
-   
-   ```bash
-   python evaluation/scripts/gencode_json.py [options]
-   ```
-
-   ### Command-Line Arguments
-   - `--model` - Specifies the model name used for generating responses.
-   - `--output-dir` - Directory to store the generated code outputs (Default: `evaluation/eval_results/generated_code`).
-   - `--input-path` - Directory containing the JSON files describing the problems (Default: `evaluation/problem_json`).
-   - `--prompt-dir` - Directory where prompt files are saved (Default: `evaluation/eval_results/prompt`).
-   - `--temperature` - Controls the randomness of the generation (Default: 0).
-    
- - ## **Evaluate generated code**
-
-   Download `test_data.h5` at the path `evaluation/test_data.h5`.
-   
-   To run the script, go to the root of this repo and use the following command:
-   
-   ```bash
-   python evaluation/scripts/test_generated_code.py
-   ```
+ ## **Generate LLM code**
+  
+To run the script, go to the root of this repo and use the following command from the repository root:
+
+```bash
+python evaluation/scripts/gencode_json.py [options]
+```
+
+For example, to create  model results with `gpt-4o` and the default settings, run 
+
+```bash
+python evaluation/scripts/gencode_json.py --model gpt-4o
+```
+
+### Command-Line Arguments
+
+- `--model` - Specifies the model name used for generating responses.
+- `--output-dir` - Directory to store the generated code outputs (Default: `eval_results/generated_code`).
+- `--input-path` - Directory containing the JSON files describing the problems (Default: `eval/data/problems_all.jsonl`).
+- `--prompt-dir` - Directory where prompt files are saved (Default: `eval_results/prompt`).
+- `--temperature` - Controls the randomness of the generation (Default: 0).
+  
+## **Evaluate generated code**
+
+Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save them as `./eval/data/test_data.h5`
+
+To run the script, go to the root of this repo and use the following command:
+
+```bash
+python evaluation/scripts/test_generated_code.py
+```
+
+Please edit the `test_generated_code.py` source file to specify your model name, results directory and problem set (if not `problems_all.jsonl`).