Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about humaneval #2648

Open
Shiguang-Guo opened this issue Jan 22, 2025 · 3 comments · May be fixed by #2650
Open

Question about humaneval #2648

Shiguang-Guo opened this issue Jan 22, 2025 · 3 comments · May be fixed by #2650

Comments

@Shiguang-Guo
Copy link

I tried to evaluate humaneval on meta-llama-3.1-instruct, but got a score close to 0. I printed the output and found

{
    "resps": [
        [
            "Here's a Python function that implements the required functionality:\n\n```python\nfrom typing import List\n"
        ]
    ]
}

I think this may be due to generation_kwargs.until in the configuration. So what is the correct way to evaluate?

@baberabb
Copy link
Contributor

Hi! what model are you using?

@Shiguang-Guo
Copy link
Author

I use Meta-Llama-3.1-8B-Instruct and run with

lm_eval --model vllm --model_args="pretrained=${model_name},dtype=auto,tensor_parallel_size=${GPUS_PER_NODE},max_model_len=16384,gpu_memory_utilization=0.9,enable_chunked_prefill=True" --tasks=humaneval --batch_size=auto --output_path results --apply_chat_template --fewshot_as_multiturn --gen_kwargs="stop_token_ids=[128009]"

@baberabb baberabb linked a pull request Jan 22, 2025 that will close this issue
@baberabb
Copy link
Contributor

baberabb commented Jan 22, 2025

Looks like a prompting/answer extraction issue. Added the prompt from llama evals (as humaneval_instruct) in the PR, but score still lower then official (0.5976 vs. 0.726 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants