Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation script fails to work for Qwen-Math models. #174

Open
ChenDRAG opened this issue Feb 4, 2025 · 4 comments · May be fixed by #297
Open

Evaluation script fails to work for Qwen-Math models. #174

ChenDRAG opened this issue Feb 4, 2025 · 4 comments · May be fixed by #297

Comments

@ChenDRAG
Copy link

ChenDRAG commented Feb 4, 2025

I tried running the Qwen-Math models instead of the Qwen models, but find the evaluation scripts don't work for Qwen-Math

ValueError: User-specified max_model_len (32768) is greater than the derived max_model_len (max_position_embeddings=4096 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. To allow overriding this maximum, set the env var VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

MODEL=Qwen/Qwen2.5-Math-1.5B-Instruct
MODEL=Qwen/Qwen2.5-Math-1.5B

NUM_GPUS=1
MODEL_ARGS="pretrained=$MODEL,dtype=float16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"
TASK=math_500
OUTPUT_DIR=data/evals/$MODEL
CUDA_VISIBLE_DEVICES=1 lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
    --output-dir $OUTPUT_DIR 

Hope you could help me out.

@Maxwell-Jia
Copy link

@ChenDRAG The problem you’re encountering is due to a mismatch between the model’s maximum supported context length and the value you’ve specified for max_model_length in MODEL_ARGS. The Qwen2.5 (and -Instruct) models support a maximum context length of 32768, but the Qwen2.5-Math models have a much lower limit of 4096. This discrepancy is causing the error you’re seeing.

To resolve this issue, you should modify the max_model_length parameter to be no greater than 4096 when working with the Qwen-2.5 Math models:

MODEL_ARGS="pretrained=$MODEL,dtype=float16,data_parallel_size=$NUM_GPUS,max_model_length=4096,gpu_memory=..."

You can also override this maximum context length by setting the environment variable VLLM_ALLOW_LONG_MAX_MODEL_LEN=1, as shown in your ValueError message. This allows the model to accept context lengths larger than the model’s default maximum (4096 for the Math models). You can try adding this line to your environment configuration:

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

However, this approach is not recommended because it could lead to incorrect model outputs or CUDA errors due to the mismatch in the expected maximum sequence length. Forcing a larger context length may result in performance issues or even instability, especially when working with Math models that are designed to handle only shorter sequences.

@ChenDRAG
Copy link
Author

ChenDRAG commented Feb 4, 2025

@Maxwell-Jia Thank you so much for your response. I tried setting max_model_length to 4096. However, another issue raises:

[rank0]: ValueError: please provide at least one prompt

[rank0]: │ /home/huayuc/miniconda3/envs/openr1/lib/python3.11/site-packages/vllm/entrypoints/llm.py:1249 in │
[rank0]: │ _convert_v1_inputs                                                                               │
[rank0]: │                                                                                                  │
[rank0]: │   1246 │   │   │   prompts = [p["content"] for p in parse_and_batch_prompt(prompts)]             │
[rank0]: │   1247 │   │   if prompt_token_ids is not None:                                                  │
[rank0]: │   1248 │   │   │   prompt_token_ids = [                                                          │
[rank0]: │ ❱ 1249 │   │   │   │   p["content"] for p in parse_and_batch_prompt(prompt_token_ids)            │
[rank0]: │   1250 │   │   │   ]                                                                             │
[rank0]: │   1251 │   │                                                                                     │
[rank0]: │   1252 │   │   num_requests = None                                                               │
[rank0]: │                                                                                                  │
[rank0]: │ ╭──────────────────────────────── locals ────────────────────────────────╮                       │
[rank0]: │ │ prompt_token_ids = [[], [], [], [], [], [], [], [], [], [], ... +490]  │                       │
[rank0]: │ │          prompts = None                                                │                       │
[rank0]: │ │             self = <vllm.entrypoints.llm.LLM object at 0x7f47d1dca490> │                       │
[rank0]: │ ╰────────────────────────────────────────────────────────────────────────╯                       │
[rank0]: │                                                                                                  │
[rank0]: │ /home/huayuc/miniconda3/envs/openr1/lib/python3.11/site-packages/vllm/inputs/parse.py:58 in      │
[rank0]: │ parse_and_batch_prompt                                                                           │
[rank0]: │                                                                                                  │
[rank0]: │    55 │   │   if is_list_of(prompt, list):                                                       │
[rank0]: │    56 │   │   │   prompt = cast(List[List[int]], prompt)                                         │
[rank0]: │    57 │   │   │   if len(prompt[0]) == 0:                                                        │
[rank0]: │ ❱  58 │   │   │   │   raise ValueError("please provide at least one prompt")                     │
[rank0]: │    59 │   │   │                                                                                  │
[rank0]: │    60 │   │   │   if is_list_of(prompt[0], int):                                                 │
[rank0]: │    61 │   │   │   │   # case 4: array of token arrays                                            │
[rank0]: │                                                                                                  │
[rank0]: │ ╭────────────────────────── locals ───────────────────────────╮                                  │
[rank0]: │ │ prompt = [[], [], [], [], [], [], [], [], [], [], ... +490] │                                  │
[rank0]: │ ╰─────────────────────────────────────────────────────────────╯                                  │
[rank0]: ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
[rank0]: ValueError: please provide at least one prompt
[rank0]:[W203 21:19:49.205203614 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

I use the following command

MODEL=Qwen/Qwen2.5-Math-1.5B

NUM_GPUS=1
MODEL_ARGS="pretrained=$MODEL,dtype=float16,data_parallel_size=$NUM_GPUS,max_model_length=4096,gpu_memory_utilisation=0.8"
TASK=math_500
OUTPUT_DIR=data/evals/$MODEL
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=2 lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
    --output-dir $OUTPUT_DIR 

Any idea? please

@ChenDRAG
Copy link
Author

ChenDRAG commented Feb 4, 2025

Also I have another question. Since Qwen-Math series has a context length limitation of 4096. And MODEL. deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B is fine-tuned from Qwen-Math-Base. Why DeepSeek-R1-Distill-Qwen-1.5B does not have a context limitation of 4096 as shown in the official demonstration?

@jiaxiang-wu
Copy link

@ChenDRAG I think DeepSeek has shifted the context length of Qwen2.5-MATH models from 4096 to 131072 during the distillation process, although they indeed do not explicitly mention this in their technical report.

Qwen2.5-MATH-1.5B
https://huggingface.co/Qwen/Qwen2.5-Math-1.5B/blob/main/config.json#L12

DeepSeek-R1-Distill-Qwen-1.5B
https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/config.json#L12

@Some-random Some-random linked a pull request Feb 12, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants