Question about humaneval #2648

Shiguang-Guo · 2025-01-22T09:12:19Z

I tried to evaluate humaneval on meta-llama-3.1-instruct, but got a score close to 0. I printed the output and found

{
    "resps": [
        [
            "Here's a Python function that implements the required functionality:\n\n```python\nfrom typing import List\n"
        ]
    ]
}

I think this may be due to generation_kwargs.until in the configuration. So what is the correct way to evaluate?

The text was updated successfully, but these errors were encountered:

baberabb · 2025-01-22T15:22:16Z

Hi! what model are you using?

Shiguang-Guo · 2025-01-22T15:34:11Z

I use Meta-Llama-3.1-8B-Instruct and run with

lm_eval --model vllm --model_args="pretrained=${model_name},dtype=auto,tensor_parallel_size=${GPUS_PER_NODE},max_model_len=16384,gpu_memory_utilization=0.9,enable_chunked_prefill=True" --tasks=humaneval --batch_size=auto --output_path results --apply_chat_template --fewshot_as_multiturn --gen_kwargs="stop_token_ids=[128009]"

baberabb · 2025-01-22T16:51:07Z

Looks like a prompting/answer extraction issue. Added the prompt from llama evals (as humaneval_instruct) in the PR, but score still lower then official (0.5976 vs. 0.726 )

baberabb linked a pull request Jan 22, 2025 that will close this issue

humaneval instruct #2650

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about humaneval #2648

Question about humaneval #2648

Shiguang-Guo commented Jan 22, 2025

baberabb commented Jan 22, 2025

Shiguang-Guo commented Jan 22, 2025

baberabb commented Jan 22, 2025 •

edited

Loading

Question about humaneval #2648

Question about humaneval #2648

Comments

Shiguang-Guo commented Jan 22, 2025

baberabb commented Jan 22, 2025

Shiguang-Guo commented Jan 22, 2025

baberabb commented Jan 22, 2025 • edited Loading

baberabb commented Jan 22, 2025 •

edited

Loading