Skip to content

Qwen2.5-7B-Instruct baseline results are significantly lower than reported in Table 1 #10

@IcyFeather233

Description

@IcyFeather233

Description

I attempted to reproduce the Qwen2.5-7B-Instruct baseline results reported in Table 1 of the paper using the released evaluation code and dataset. My results are significantly lower than the paper's reported numbers across all benchmarks.

Environment

  • Model: Qwen2.5-7B-Instruct (no fine-tuning, used as-is)
  • Evaluation script: examples/eval_ssp.sh (adapted to skip checkpoint merging, directly set
    actor_rollout_ref.model.path to the HuggingFace model)
  • Test set: test.parquet from Quark-LLM/SSP
    (3,125 samples total)
  • Search backend: local dense retriever with e5-base-v2 + wiki-18 FAISS index (as described in README)
  • Judge model: Qwen2.5-32B-Instruct via LLM-as-a-Judge
  • VAL_N=1, VAL_TEMPERATURE=0.0 (greedy decoding)
  • MAX_TURNS=10

My Results

Dataset reward/mean@1 EM F1
NQ 0.378 0.218 0.309
TriviaQA 0.546 0.422 0.503
PopQA 0.266 0.224 0.263
HotpotQA 0.272 0.190 0.253
2WikiMultihopQA 0.176 0.150 0.190
MuSiQue 0.106 0.050 0.111
Bamboogle 0.328 0.240 0.315

Questions

  1. Which metric does Table 1 report? Is it reward/mean@1 (which appears to be a soft LLM-judge score), EM, or F1? The paper mentions different scoring methods and it's not entirely clear from the table caption.

  2. What search backend was used for the baseline evaluation? Was it the same local dense retriever (e5-base-v2 + wiki-18) described in the README, or a different retrieval system?

  3. What judge model was used during evaluation? The reward uses LLM-as-a-Judge; the judge model and its configuration can significantly affect scores.

  4. Is the test set the same? The released test.parquet contains 3,125 samples. Is this the identical test set used in the paper?

  5. Any other configuration differences? For example, MAX_TURNS, MAX_RESPONSE_LENGTH, or chat template settings that might affect the Qwen2.5-7B-Instruct baseline specifically.

Any clarification would be greatly appreciated. Thank you for the great work!

Evaluation Script (examples/eval_ssp.sh)

Click to expand
#!/bin/bash

set -x
ulimit -n 65535

export WORKSPACE=/path/to/SSP
export REPO_DIR=$WORKSPACE/quarl
export PYTHONPATH=$WORKSPACE/verl:$REPO_DIR

EVAL_STEP=${1:-100}

CHECKPOINT_DIR=/path/to/SSP/checkpoints
FSDP_CKPT_PATH=${CHECKPOINT_DIR}/global_step_${EVAL_STEP}/actor
HF_MODEL_PATH=${CHECKPOINT_DIR}/global_step_${EVAL_STEP}/actor_hf

if [ ! -d "${FSDP_CKPT_PATH}" ]; then
    echo "ERROR: Checkpoint not found: ${FSDP_CKPT_PATH}"
    echo "Available checkpoints:"
    ls ${CHECKPOINT_DIR} | grep global_step | sort -t_ -k3 -n
    exit 1
fi

if [ ! -d "${HF_MODEL_PATH}" ]; then
    echo "============================================"
    echo "  HF model not found, merging FSDP checkpoint..."
    echo "  Source: ${FSDP_CKPT_PATH}"
    echo "  Target: ${HF_MODEL_PATH}"
    echo "============================================"
    python -m verl.model_merger merge \
        --backend fsdp \
        --local_dir ${FSDP_CKPT_PATH} \
        --target_dir ${HF_MODEL_PATH}
    if [ $? -ne 0 ]; then
        echo "ERROR: Model merging failed!"
        exit 1
    fi
    echo "Merging done."
else
    echo "HF model already exists at ${HF_MODEL_PATH}, skipping merge."
fi

echo "============================================"
echo "  Evaluating checkpoint: global_step_${EVAL_STEP}"
echo "  Model path: ${HF_MODEL_PATH}"
echo "============================================"

export SOURCE_ACTOR_CHECKPOINT_DIR=/path/to/model_infer
export SOURCE_ACTOR_CHECKPOINT_ITERATION_DIRNAME=Qwen2.5-7B-Instruct
export SOURCE_REWARD_CHECKPOINT_DIR=/placeholder
export SOURCE_REWARD_CHECKPOINT_ITERATION_DIRNAME=placeholder
export SOURCE_CRITIC_CHECKPOINT_DIR=/path/to/model_infer
export SOURCE_CRITIC_CHECKPOINT_ITERATION_DIRNAME=Qwen2.5-7B-Instruct

export TEST_DATA_PATH="/path/to/dataset/SSP/test"
export DATA_PATH="/path/to/dataset/SSP/test"

export EXP_NAME="eval_step${EVAL_STEP}"
BASE_OUTPUT_DIR=/path/to/SSP/output
BASE_SAVE_CHECKPOINT_DIR=/path/to/SSP/checkpoints
BASE_TENSORBOARD_LOG_DIR=/path/to/SSP/tensorboard/logs

export OUTPUT_DIR="${BASE_OUTPUT_DIR}_${EXP_NAME}"
export SAVE_CHECKPOINT_DIR="${BASE_SAVE_CHECKPOINT_DIR}_${EXP_NAME}"
export TENSORBOARD_LOG_DIR="${BASE_TENSORBOARD_LOG_DIR}_${EXP_NAME}"

export NNODES=${NNODES:-${PET_NNODES:-1}}
export RANK=${RANK:-${PET_NODE_RANK:-0}}
export MASTER_ADDR=${MASTER_ADDR:-${PET_MASTER_ADDR:-localhost}}
export MASTER_PORT=${MASTER_PORT:-${PET_MASTER_PORT:-6379}}

export QUARK_BASE_URL=http://your-judge-service:5000/v1
export QUARK_MODEL=qwen25-32b
export QUARK_SEARCH_CHAT_TEMPLATE=default
export SEARCH_IP=your-search-service-ip

export WORKSPACE=/path/to/SSP
export LOG_LEVEL=DEBUG
export SELF_PLAY_DEBUG=True

export NCCL_TIMEOUT=72000000
export TORCH_DISTRIBUTED_TIMEOUT=72000
export NCCL_WORK_FIFO_DEPTH=4194304
export LANG=C.UTF-8
export LANGUAGE=C.UTF-8

export PYTHONPATH=$WORKSPACE/verl
export REPO_DIR=$WORKSPACE/quarl
export PYTHONPATH=$PYTHONPATH:$REPO_DIR
export TENSORBOARD_DIR=${TENSORBOARD_LOG_DIR}

ACTOR_PATH=${HF_MODEL_PATH}
REWARD_PATH=${SOURCE_REWARD_CHECKPOINT_DIR}/${SOURCE_REWARD_CHECKPOINT_ITERATION_DIRNAME}
CRITIC_PATH=${HF_MODEL_PATH}

read -ra TEST_PATHS <<< "$TEST_DATA_PATH"
TEST_FILES=()
for path in "${TEST_PATHS[@]}"; do
    TEST_FILES+=("${path}.parquet")
done
TEST_FILES_STR="[$(IFS=,; echo "${TEST_FILES[*]}")]"

read -ra TRAIN_PATHS <<< "$DATA_PATH"
TRAIN_FILES=()
for path in "${TRAIN_PATHS[@]}"; do
    TRAIN_FILES+=("${path}.parquet")
done
TRAIN_FILES_STR="[$(IFS=,; echo "${TRAIN_FILES[*]}")]"

TOOL_CONFIG=${WORKSPACE}/examples/sglang_multiturn/config/tool_config/search_tool_config.yaml
MAX_TURNS=10

TRAIN_METHOD=grpo
BATCH_SIZE=256
MAX_PROMPT_LENGTH=4096
MAX_RESPONSE_LENGTH=8192
PPO_MAX_TOKEN_LEN_PER_GPU=$(( MAX_PROMPT_LENGTH + MAX_RESPONSE_LENGTH ))
PPO_MICRO_BATCH_PER_GPU=1
LOG_PROB_MIRCO_BATCH_SIZE_PER_GPU=1
REWARD_MIRCO_BATCH_SIZE_PER_GPU=1
CRITIC_MIRCO_BATCH_SIZE_PER_GPU=1
ROLLOUT_TP=4
ACTOR_SP=1
REWARD_SP=1
CRITIC_SP=1
ACTOR_FSDP_SIZE=-1
FSDP_TYPE=fsdp
REWARD_MODEL_ENABLE=False

VAL_N=1
VAL_TEMPERATURE=0.0

ROLLOUT_N=5
ADV_ESTIMATOR=grpo
ACTOR_USE_KL_LOSS=True
USE_KL_IN_REWARD=False

TASK_TYPE=quark_deep_search
RM_MANAGER=quark
CUSTOM_RM_ARGS="reward_model.reward_manager=${RM_MANAGER} \
    +custom_reward_functions.quark_score.labels=['unknown'] \
    +custom_reward_functions.quark_score.integration=sum \
    quark.diff_val_reward_fn_config.reward_model.reward_manager=naive_with_prompt \
    quark.diff_val_reward_fn_config.custom_reward_function.path=$REPO_DIR/reward/score/search_eval_score.py \
    quark.diff_val_reward_fn_config.custom_reward_function.name=compute_score"

mkdir -p $OUTPUT_DIR/logs
mkdir -p $OUTPUT_DIR/val

echo "Output dir   : $OUTPUT_DIR"
echo "TensorBoard  : $TENSORBOARD_LOG_DIR"

export NODE_RANK=${RANK}
export HEAD_STARTUP_WAIT_SECONDS=${HEAD_STARTUP_WAIT_SECONDS:-15}
export HEAD_WORKER_JOIN_WAIT_SECONDS=${HEAD_WORKER_JOIN_WAIT_SECONDS:-120}
export WORKER_CONNECT_RETRY_INTERVAL_SECONDS=${WORKER_CONNECT_RETRY_INTERVAL_SECONDS:-5}
export WORKER_CONNECT_MAX_RETRIES=${WORKER_CONNECT_MAX_RETRIES:-60}

RESOLVED_MASTER_ADDR=$(getent ahostsv4 "${MASTER_ADDR}" 2>/dev/null | awk 'NR==1 {print $1}')
if [[ -n "${RESOLVED_MASTER_ADDR}" ]]; then
    export RAY_MASTER_ADDR="${RESOLVED_MASTER_ADDR}"
else
    export RAY_MASTER_ADDR="${MASTER_ADDR}"
fi

echo "Distributed config:"
echo "  NNODES=${NNODES}, NODE_RANK=${NODE_RANK}"
echo "  MASTER_ADDR=${MASTER_ADDR}, RAY_MASTER_ADDR=${RAY_MASTER_ADDR}, MASTER_PORT=${MASTER_PORT}"

if [ $NODE_RANK -eq 0 ]; then
    ray start --head --node-ip-address=${RAY_MASTER_ADDR} --port=${MASTER_PORT}

    sleep ${HEAD_STARTUP_WAIT_SECONDS}

    if [ ! $NNODES -eq 1 ]; then
        sleep ${HEAD_WORKER_JOIN_WAIT_SECONDS}
    fi

    python3 -m quarl.main_rl \
        quark.task_type=$TASK_TYPE \
        algorithm.adv_estimator=${ADV_ESTIMATOR} \
        data.train_files=${TRAIN_FILES_STR} \
        data.val_files=${TEST_FILES_STR} \
        data.train_batch_size=${BATCH_SIZE} \
        data.val_batch_size=${BATCH_SIZE} \
        data.max_prompt_length=${MAX_PROMPT_LENGTH} \
        data.max_response_length=${MAX_RESPONSE_LENGTH} \
        data.return_raw_chat=True \
        data.filter_overlong_prompts=True \
        data.truncation='left' \
        data.shuffle=True \
        data.custom_cls.name=null \
        actor_rollout_ref.model.path=${ACTOR_PATH} \
        actor_rollout_ref.actor.optim.lr=1e-6 \
        actor_rollout_ref.actor.optim.lr_warmup_steps=5 \
        actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.05 \
        actor_rollout_ref.model.use_remove_padding=True \
        actor_rollout_ref.actor.strategy=${FSDP_TYPE} \
        actor_rollout_ref.actor.ppo_mini_batch_size=128 \
        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${PPO_MICRO_BATCH_PER_GPU} \
        actor_rollout_ref.actor.use_dynamic_bsz=False \
        actor_rollout_ref.actor.use_kl_loss=${ACTOR_USE_KL_LOSS} \
        actor_rollout_ref.actor.kl_loss_coef=0.01 \
        actor_rollout_ref.actor.kl_loss_type=low_var_kl \
        actor_rollout_ref.actor.entropy_coeff=0 \
        actor_rollout_ref.actor.ulysses_sequence_parallel_size=${ACTOR_SP} \
        actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$PPO_MAX_TOKEN_LEN_PER_GPU \
        actor_rollout_ref.model.enable_gradient_checkpointing=True \
        actor_rollout_ref.actor.fsdp_config.param_offload=True \
        actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
        actor_rollout_ref.actor.fsdp_config.offload_policy=True \
        actor_rollout_ref.actor.fsdp_config.fsdp_size=$ACTOR_FSDP_SIZE \
        actor_rollout_ref.actor.clip_ratio_high=0.285 \
        actor_rollout_ref.rollout.tensor_model_parallel_size=${ROLLOUT_TP} \
        actor_rollout_ref.rollout.name=sglang_async \
        actor_rollout_ref.rollout.multi_turn.format=quark \
        actor_rollout_ref.rollout.multi_turn.enable=True \
        actor_rollout_ref.rollout.multi_turn.max_assistant_turns=${MAX_TURNS} \
        actor_rollout_ref.rollout.multi_turn.use_inference_chat_template=False \
        actor_rollout_ref.rollout.multi_turn.tokenization_sanity_check_mode=disable \
        actor_rollout_ref.hybrid_engine=True \
        actor_rollout_ref.rollout.max_num_batched_tokens=$PPO_MAX_TOKEN_LEN_PER_GPU \
        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${LOG_PROB_MIRCO_BATCH_SIZE_PER_GPU} \
        actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
        actor_rollout_ref.rollout.n=${ROLLOUT_N} \
        actor_rollout_ref.rollout.multi_turn.tool_config_path="$TOOL_CONFIG" \
        actor_rollout_ref.ref.strategy=${FSDP_TYPE} \
        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${LOG_PROB_MIRCO_BATCH_SIZE_PER_GPU} \
        actor_rollout_ref.ref.fsdp_config.param_offload=True \
        algorithm.use_kl_in_reward=${USE_KL_IN_REWARD} \
        algorithm.kl_ctrl.kl_coef=0.001 \
        algorithm.kl_penalty='low_var_kl' \
        algorithm.gamma=1.0 \
        critic.optim.lr=1e-5 \
        critic.ppo_micro_batch_size_per_gpu=${CRITIC_MIRCO_BATCH_SIZE_PER_GPU} \
        critic.strategy=${FSDP_TYPE} \
        critic.model.use_remove_padding=True \
        critic.model.path=${CRITIC_PATH} \
        critic.model.enable_gradient_checkpointing=True \
        critic.model.fsdp_config.param_offload=True \
        critic.model.fsdp_config.optimizer_offload=True \
        critic.ulysses_sequence_parallel_size=${CRITIC_SP} \
        reward_model.enable=${REWARD_MODEL_ENABLE} \
        reward_model.strategy=${FSDP_TYPE} \
        reward_model.model.type=default \
        reward_model.model.path=${REWARD_PATH} \
        reward_model.model.use_remove_padding=True \
        reward_model.max_length=$PPO_MAX_TOKEN_LEN_PER_GPU \
        reward_model.micro_batch_size_per_gpu=${REWARD_MIRCO_BATCH_SIZE_PER_GPU} \
        reward_model.model.fsdp_config.param_offload=True \
        reward_model.ulysses_sequence_parallel_size=${REWARD_SP} \
        ${CUSTOM_RM_ARGS} \
        trainer.critic_warmup=0 \
        trainer.logger=['console','tensorboard'] \
        trainer.project_name=quark_ssp_eval \
        trainer.experiment_name=eval_step${EVAL_STEP} \
        trainer.val_before_train=True \
        trainer.val_only=True \
        trainer.resume_mode=disable \
        trainer.n_gpus_per_node=8 \
        trainer.nnodes=${NNODES} \
        trainer.save_freq=999999 \
        trainer.test_freq=1 \
        trainer.default_local_dir=${SAVE_CHECKPOINT_DIR} \
        trainer.rollout_data_dir=$OUTPUT_DIR/rollout \
        trainer.validation_data_dir=$OUTPUT_DIR/val \
        trainer.total_epochs=1 \
        actor_rollout_ref.rollout.val_kwargs.n=${VAL_N} \
        actor_rollout_ref.rollout.val_kwargs.temperature=${VAL_TEMPERATURE} \
        actor_rollout_ref.rollout.val_kwargs.top_k=-1 \
        actor_rollout_ref.rollout.val_kwargs.top_p=1.0 \
        actor_rollout_ref.rollout.val_kwargs.do_sample=True \
        +actor_rollout_ref.rollout.val_kwargs.frequency_penalty=0 \
        +actor_rollout_ref.rollout.val_kwargs.repetition_penalty=1 \
        +actor_rollout_ref.rollout.repetition_penalty=1 \
        self_play.enable=True \
        self_play.lang=${LANG} \
        self_play.save_freq=999999 \
        self_play.use_rag_filter=True \
        self_play.noisy_RAG_materials=4 \
        self_play.proposer.enable=True \
        self_play.proposer.warm_up_steps=-1 \
        self_play.proposer.format_penalty=0 \
        self_play.proposer.n=1 \
        self_play.proposer.reward_type=1-acc \
        self_play.proposer.adv_estimator=grpo \
        self_play.proposer.right=1.0 \
        self_play.proposer.left=0.0 \
        self_play.extraction_failure.strategy=reuse \
        self_play.solver.enable=True \
        self_play.dynamic_sampling.enable=False \
        self_play.reward_dynamic_sampling.enable=False \
        self_play.use_search_terms_filter=False \
        self_play.extraction_failure.reuse_success_rate_threshold=1.0 \
        self_play.combine_update=False \
        self_play.mini_epochs=1 \
        self_play.extraction_failure.pool_clear_interval=10 \
        self_play.answer_pattern=question \
        self_play.validate_config=True \
        self_play.extraction_failure.keep_ratio=0

else
    retry_count=0
    while true; do
        echo "Worker rank ${NODE_RANK} attempting to join Ray head at ${RAY_MASTER_ADDR}:${MASTER_PORT} (attempt $((retry_count + 1)))"
        ray start --address=${RAY_MASTER_ADDR}:${MASTER_PORT} --block && break

        retry_count=$((retry_count + 1))
        if [ ${retry_count} -ge ${WORKER_CONNECT_MAX_RETRIES} ]; then
            echo "Worker rank ${NODE_RANK} failed to connect to Ray head after ${retry_count} attempts"
            exit 1
        fi

        echo "Worker rank ${NODE_RANK} failed to connect, retrying in ${WORKER_CONNECT_RETRY_INTERVAL_SECONDS}s"
        sleep ${WORKER_CONNECT_RETRY_INTERVAL_SECONDS}
    done
fi

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions