Evaluation script fails to work for Qwen-Math models. #174

ChenDRAG · 2025-02-04T01:22:40Z

I tried running the Qwen-Math models instead of the Qwen models, but find the evaluation scripts don't work for Qwen-Math

ValueError: User-specified max_model_len (32768) is greater than the derived max_model_len (max_position_embeddings=4096 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. To allow overriding this maximum, set the env var VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

MODEL=Qwen/Qwen2.5-Math-1.5B-Instruct
MODEL=Qwen/Qwen2.5-Math-1.5B

NUM_GPUS=1
MODEL_ARGS="pretrained=$MODEL,dtype=float16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"
TASK=math_500
OUTPUT_DIR=data/evals/$MODEL
CUDA_VISIBLE_DEVICES=1 lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
    --output-dir $OUTPUT_DIR

Hope you could help me out.

The text was updated successfully, but these errors were encountered:

Maxwell-Jia · 2025-02-04T03:21:13Z

@ChenDRAG The problem you’re encountering is due to a mismatch between the model’s maximum supported context length and the value you’ve specified for max_model_length in MODEL_ARGS. The Qwen2.5 (and -Instruct) models support a maximum context length of 32768, but the Qwen2.5-Math models have a much lower limit of 4096. This discrepancy is causing the error you’re seeing.

To resolve this issue, you should modify the max_model_length parameter to be no greater than 4096 when working with the Qwen-2.5 Math models:

MODEL_ARGS="pretrained=$MODEL,dtype=float16,data_parallel_size=$NUM_GPUS,max_model_length=4096,gpu_memory=..."

You can also override this maximum context length by setting the environment variable VLLM_ALLOW_LONG_MAX_MODEL_LEN=1, as shown in your ValueError message. This allows the model to accept context lengths larger than the model’s default maximum (4096 for the Math models). You can try adding this line to your environment configuration:

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

However, this approach is not recommended because it could lead to incorrect model outputs or CUDA errors due to the mismatch in the expected maximum sequence length. Forcing a larger context length may result in performance issues or even instability, especially when working with Math models that are designed to handle only shorter sequences.

ChenDRAG · 2025-02-04T05:23:00Z

@Maxwell-Jia Thank you so much for your response. I tried setting max_model_length to 4096. However, another issue raises:

[rank0]: ValueError: please provide at least one prompt

[rank0]: │ /home/huayuc/miniconda3/envs/openr1/lib/python3.11/site-packages/vllm/entrypoints/llm.py:1249 in │
[rank0]: │ _convert_v1_inputs                                                                               │
[rank0]: │                                                                                                  │
[rank0]: │   1246 │   │   │   prompts = [p["content"] for p in parse_and_batch_prompt(prompts)]             │
[rank0]: │   1247 │   │   if prompt_token_ids is not None:                                                  │
[rank0]: │   1248 │   │   │   prompt_token_ids = [                                                          │
[rank0]: │ ❱ 1249 │   │   │   │   p["content"] for p in parse_and_batch_prompt(prompt_token_ids)            │
[rank0]: │   1250 │   │   │   ]                                                                             │
[rank0]: │   1251 │   │                                                                                     │
[rank0]: │   1252 │   │   num_requests = None                                                               │
[rank0]: │                                                                                                  │
[rank0]: │ ╭──────────────────────────────── locals ────────────────────────────────╮                       │
[rank0]: │ │ prompt_token_ids = [[], [], [], [], [], [], [], [], [], [], ... +490]  │                       │
[rank0]: │ │          prompts = None                                                │                       │
[rank0]: │ │             self = <vllm.entrypoints.llm.LLM object at 0x7f47d1dca490> │                       │
[rank0]: │ ╰────────────────────────────────────────────────────────────────────────╯                       │
[rank0]: │                                                                                                  │
[rank0]: │ /home/huayuc/miniconda3/envs/openr1/lib/python3.11/site-packages/vllm/inputs/parse.py:58 in      │
[rank0]: │ parse_and_batch_prompt                                                                           │
[rank0]: │                                                                                                  │
[rank0]: │    55 │   │   if is_list_of(prompt, list):                                                       │
[rank0]: │    56 │   │   │   prompt = cast(List[List[int]], prompt)                                         │
[rank0]: │    57 │   │   │   if len(prompt[0]) == 0:                                                        │
[rank0]: │ ❱  58 │   │   │   │   raise ValueError("please provide at least one prompt")                     │
[rank0]: │    59 │   │   │                                                                                  │
[rank0]: │    60 │   │   │   if is_list_of(prompt[0], int):                                                 │
[rank0]: │    61 │   │   │   │   # case 4: array of token arrays                                            │
[rank0]: │                                                                                                  │
[rank0]: │ ╭────────────────────────── locals ───────────────────────────╮                                  │
[rank0]: │ │ prompt = [[], [], [], [], [], [], [], [], [], [], ... +490] │                                  │
[rank0]: │ ╰─────────────────────────────────────────────────────────────╯                                  │
[rank0]: ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
[rank0]: ValueError: please provide at least one prompt
[rank0]:[W203 21:19:49.205203614 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

I use the following command

MODEL=Qwen/Qwen2.5-Math-1.5B

NUM_GPUS=1
MODEL_ARGS="pretrained=$MODEL,dtype=float16,data_parallel_size=$NUM_GPUS,max_model_length=4096,gpu_memory_utilisation=0.8"
TASK=math_500
OUTPUT_DIR=data/evals/$MODEL
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=2 lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
    --output-dir $OUTPUT_DIR

Any idea? please

ChenDRAG · 2025-02-04T05:25:33Z

Also I have another question. Since Qwen-Math series has a context length limitation of 4096. And MODEL. deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B is fine-tuned from Qwen-Math-Base. Why DeepSeek-R1-Distill-Qwen-1.5B does not have a context limitation of 4096 as shown in the official demonstration?

jiaxiang-wu · 2025-02-07T08:53:41Z

@ChenDRAG I think DeepSeek has shifted the context length of Qwen2.5-MATH models from 4096 to 131072 during the distillation process, although they indeed do not explicitly mention this in their technical report.

Qwen2.5-MATH-1.5B
https://huggingface.co/Qwen/Qwen2.5-Math-1.5B/blob/main/config.json#L12

DeepSeek-R1-Distill-Qwen-1.5B
https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/config.json#L12

Some-random linked a pull request Feb 12, 2025 that will close this issue

Fix eval max length #297

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation script fails to work for Qwen-Math models. #174

Evaluation script fails to work for Qwen-Math models. #174

ChenDRAG commented Feb 4, 2025

Maxwell-Jia commented Feb 4, 2025

ChenDRAG commented Feb 4, 2025

ChenDRAG commented Feb 4, 2025

jiaxiang-wu commented Feb 7, 2025

Evaluation script fails to work for Qwen-Math models. #174

Evaluation script fails to work for Qwen-Math models. #174

Comments

ChenDRAG commented Feb 4, 2025

Maxwell-Jia commented Feb 4, 2025

ChenDRAG commented Feb 4, 2025

ChenDRAG commented Feb 4, 2025

jiaxiang-wu commented Feb 7, 2025