Description
I attempted to reproduce the Qwen2.5-7B-Instruct baseline results reported in Table 1 of the paper using the released evaluation code and dataset. My results are significantly lower than the paper's reported numbers across all benchmarks.
Environment
- Model:
Qwen2.5-7B-Instruct (no fine-tuning, used as-is)
- Evaluation script:
examples/eval_ssp.sh (adapted to skip checkpoint merging, directly set
actor_rollout_ref.model.path to the HuggingFace model)
- Test set:
test.parquet from Quark-LLM/SSP
(3,125 samples total)
- Search backend: local dense retriever with
e5-base-v2 + wiki-18 FAISS index (as described in README)
- Judge model:
Qwen2.5-32B-Instruct via LLM-as-a-Judge
VAL_N=1, VAL_TEMPERATURE=0.0 (greedy decoding)
MAX_TURNS=10
My Results
| Dataset |
reward/mean@1 |
EM |
F1 |
| NQ |
0.378 |
0.218 |
0.309 |
| TriviaQA |
0.546 |
0.422 |
0.503 |
| PopQA |
0.266 |
0.224 |
0.263 |
| HotpotQA |
0.272 |
0.190 |
0.253 |
| 2WikiMultihopQA |
0.176 |
0.150 |
0.190 |
| MuSiQue |
0.106 |
0.050 |
0.111 |
| Bamboogle |
0.328 |
0.240 |
0.315 |
Questions
-
Which metric does Table 1 report? Is it reward/mean@1 (which appears to be a soft LLM-judge score), EM, or F1? The paper mentions different scoring methods and it's not entirely clear from the table caption.
-
What search backend was used for the baseline evaluation? Was it the same local dense retriever (e5-base-v2 + wiki-18) described in the README, or a different retrieval system?
-
What judge model was used during evaluation? The reward uses LLM-as-a-Judge; the judge model and its configuration can significantly affect scores.
-
Is the test set the same? The released test.parquet contains 3,125 samples. Is this the identical test set used in the paper?
-
Any other configuration differences? For example, MAX_TURNS, MAX_RESPONSE_LENGTH, or chat template settings that might affect the Qwen2.5-7B-Instruct baseline specifically.
Any clarification would be greatly appreciated. Thank you for the great work!
Evaluation Script (examples/eval_ssp.sh)
Click to expand
#!/bin/bash
set -x
ulimit -n 65535
export WORKSPACE=/path/to/SSP
export REPO_DIR=$WORKSPACE/quarl
export PYTHONPATH=$WORKSPACE/verl:$REPO_DIR
EVAL_STEP=${1:-100}
CHECKPOINT_DIR=/path/to/SSP/checkpoints
FSDP_CKPT_PATH=${CHECKPOINT_DIR}/global_step_${EVAL_STEP}/actor
HF_MODEL_PATH=${CHECKPOINT_DIR}/global_step_${EVAL_STEP}/actor_hf
if [ ! -d "${FSDP_CKPT_PATH}" ]; then
echo "ERROR: Checkpoint not found: ${FSDP_CKPT_PATH}"
echo "Available checkpoints:"
ls ${CHECKPOINT_DIR} | grep global_step | sort -t_ -k3 -n
exit 1
fi
if [ ! -d "${HF_MODEL_PATH}" ]; then
echo "============================================"
echo " HF model not found, merging FSDP checkpoint..."
echo " Source: ${FSDP_CKPT_PATH}"
echo " Target: ${HF_MODEL_PATH}"
echo "============================================"
python -m verl.model_merger merge \
--backend fsdp \
--local_dir ${FSDP_CKPT_PATH} \
--target_dir ${HF_MODEL_PATH}
if [ $? -ne 0 ]; then
echo "ERROR: Model merging failed!"
exit 1
fi
echo "Merging done."
else
echo "HF model already exists at ${HF_MODEL_PATH}, skipping merge."
fi
echo "============================================"
echo " Evaluating checkpoint: global_step_${EVAL_STEP}"
echo " Model path: ${HF_MODEL_PATH}"
echo "============================================"
export SOURCE_ACTOR_CHECKPOINT_DIR=/path/to/model_infer
export SOURCE_ACTOR_CHECKPOINT_ITERATION_DIRNAME=Qwen2.5-7B-Instruct
export SOURCE_REWARD_CHECKPOINT_DIR=/placeholder
export SOURCE_REWARD_CHECKPOINT_ITERATION_DIRNAME=placeholder
export SOURCE_CRITIC_CHECKPOINT_DIR=/path/to/model_infer
export SOURCE_CRITIC_CHECKPOINT_ITERATION_DIRNAME=Qwen2.5-7B-Instruct
export TEST_DATA_PATH="/path/to/dataset/SSP/test"
export DATA_PATH="/path/to/dataset/SSP/test"
export EXP_NAME="eval_step${EVAL_STEP}"
BASE_OUTPUT_DIR=/path/to/SSP/output
BASE_SAVE_CHECKPOINT_DIR=/path/to/SSP/checkpoints
BASE_TENSORBOARD_LOG_DIR=/path/to/SSP/tensorboard/logs
export OUTPUT_DIR="${BASE_OUTPUT_DIR}_${EXP_NAME}"
export SAVE_CHECKPOINT_DIR="${BASE_SAVE_CHECKPOINT_DIR}_${EXP_NAME}"
export TENSORBOARD_LOG_DIR="${BASE_TENSORBOARD_LOG_DIR}_${EXP_NAME}"
export NNODES=${NNODES:-${PET_NNODES:-1}}
export RANK=${RANK:-${PET_NODE_RANK:-0}}
export MASTER_ADDR=${MASTER_ADDR:-${PET_MASTER_ADDR:-localhost}}
export MASTER_PORT=${MASTER_PORT:-${PET_MASTER_PORT:-6379}}
export QUARK_BASE_URL=http://your-judge-service:5000/v1
export QUARK_MODEL=qwen25-32b
export QUARK_SEARCH_CHAT_TEMPLATE=default
export SEARCH_IP=your-search-service-ip
export WORKSPACE=/path/to/SSP
export LOG_LEVEL=DEBUG
export SELF_PLAY_DEBUG=True
export NCCL_TIMEOUT=72000000
export TORCH_DISTRIBUTED_TIMEOUT=72000
export NCCL_WORK_FIFO_DEPTH=4194304
export LANG=C.UTF-8
export LANGUAGE=C.UTF-8
export PYTHONPATH=$WORKSPACE/verl
export REPO_DIR=$WORKSPACE/quarl
export PYTHONPATH=$PYTHONPATH:$REPO_DIR
export TENSORBOARD_DIR=${TENSORBOARD_LOG_DIR}
ACTOR_PATH=${HF_MODEL_PATH}
REWARD_PATH=${SOURCE_REWARD_CHECKPOINT_DIR}/${SOURCE_REWARD_CHECKPOINT_ITERATION_DIRNAME}
CRITIC_PATH=${HF_MODEL_PATH}
read -ra TEST_PATHS <<< "$TEST_DATA_PATH"
TEST_FILES=()
for path in "${TEST_PATHS[@]}"; do
TEST_FILES+=("${path}.parquet")
done
TEST_FILES_STR="[$(IFS=,; echo "${TEST_FILES[*]}")]"
read -ra TRAIN_PATHS <<< "$DATA_PATH"
TRAIN_FILES=()
for path in "${TRAIN_PATHS[@]}"; do
TRAIN_FILES+=("${path}.parquet")
done
TRAIN_FILES_STR="[$(IFS=,; echo "${TRAIN_FILES[*]}")]"
TOOL_CONFIG=${WORKSPACE}/examples/sglang_multiturn/config/tool_config/search_tool_config.yaml
MAX_TURNS=10
TRAIN_METHOD=grpo
BATCH_SIZE=256
MAX_PROMPT_LENGTH=4096
MAX_RESPONSE_LENGTH=8192
PPO_MAX_TOKEN_LEN_PER_GPU=$(( MAX_PROMPT_LENGTH + MAX_RESPONSE_LENGTH ))
PPO_MICRO_BATCH_PER_GPU=1
LOG_PROB_MIRCO_BATCH_SIZE_PER_GPU=1
REWARD_MIRCO_BATCH_SIZE_PER_GPU=1
CRITIC_MIRCO_BATCH_SIZE_PER_GPU=1
ROLLOUT_TP=4
ACTOR_SP=1
REWARD_SP=1
CRITIC_SP=1
ACTOR_FSDP_SIZE=-1
FSDP_TYPE=fsdp
REWARD_MODEL_ENABLE=False
VAL_N=1
VAL_TEMPERATURE=0.0
ROLLOUT_N=5
ADV_ESTIMATOR=grpo
ACTOR_USE_KL_LOSS=True
USE_KL_IN_REWARD=False
TASK_TYPE=quark_deep_search
RM_MANAGER=quark
CUSTOM_RM_ARGS="reward_model.reward_manager=${RM_MANAGER} \
+custom_reward_functions.quark_score.labels=['unknown'] \
+custom_reward_functions.quark_score.integration=sum \
quark.diff_val_reward_fn_config.reward_model.reward_manager=naive_with_prompt \
quark.diff_val_reward_fn_config.custom_reward_function.path=$REPO_DIR/reward/score/search_eval_score.py \
quark.diff_val_reward_fn_config.custom_reward_function.name=compute_score"
mkdir -p $OUTPUT_DIR/logs
mkdir -p $OUTPUT_DIR/val
echo "Output dir : $OUTPUT_DIR"
echo "TensorBoard : $TENSORBOARD_LOG_DIR"
export NODE_RANK=${RANK}
export HEAD_STARTUP_WAIT_SECONDS=${HEAD_STARTUP_WAIT_SECONDS:-15}
export HEAD_WORKER_JOIN_WAIT_SECONDS=${HEAD_WORKER_JOIN_WAIT_SECONDS:-120}
export WORKER_CONNECT_RETRY_INTERVAL_SECONDS=${WORKER_CONNECT_RETRY_INTERVAL_SECONDS:-5}
export WORKER_CONNECT_MAX_RETRIES=${WORKER_CONNECT_MAX_RETRIES:-60}
RESOLVED_MASTER_ADDR=$(getent ahostsv4 "${MASTER_ADDR}" 2>/dev/null | awk 'NR==1 {print $1}')
if [[ -n "${RESOLVED_MASTER_ADDR}" ]]; then
export RAY_MASTER_ADDR="${RESOLVED_MASTER_ADDR}"
else
export RAY_MASTER_ADDR="${MASTER_ADDR}"
fi
echo "Distributed config:"
echo " NNODES=${NNODES}, NODE_RANK=${NODE_RANK}"
echo " MASTER_ADDR=${MASTER_ADDR}, RAY_MASTER_ADDR=${RAY_MASTER_ADDR}, MASTER_PORT=${MASTER_PORT}"
if [ $NODE_RANK -eq 0 ]; then
ray start --head --node-ip-address=${RAY_MASTER_ADDR} --port=${MASTER_PORT}
sleep ${HEAD_STARTUP_WAIT_SECONDS}
if [ ! $NNODES -eq 1 ]; then
sleep ${HEAD_WORKER_JOIN_WAIT_SECONDS}
fi
python3 -m quarl.main_rl \
quark.task_type=$TASK_TYPE \
algorithm.adv_estimator=${ADV_ESTIMATOR} \
data.train_files=${TRAIN_FILES_STR} \
data.val_files=${TEST_FILES_STR} \
data.train_batch_size=${BATCH_SIZE} \
data.val_batch_size=${BATCH_SIZE} \
data.max_prompt_length=${MAX_PROMPT_LENGTH} \
data.max_response_length=${MAX_RESPONSE_LENGTH} \
data.return_raw_chat=True \
data.filter_overlong_prompts=True \
data.truncation='left' \
data.shuffle=True \
data.custom_cls.name=null \
actor_rollout_ref.model.path=${ACTOR_PATH} \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.optim.lr_warmup_steps=5 \
actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.05 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.strategy=${FSDP_TYPE} \
actor_rollout_ref.actor.ppo_mini_batch_size=128 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${PPO_MICRO_BATCH_PER_GPU} \
actor_rollout_ref.actor.use_dynamic_bsz=False \
actor_rollout_ref.actor.use_kl_loss=${ACTOR_USE_KL_LOSS} \
actor_rollout_ref.actor.kl_loss_coef=0.01 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=${ACTOR_SP} \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$PPO_MAX_TOKEN_LEN_PER_GPU \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.actor.fsdp_config.offload_policy=True \
actor_rollout_ref.actor.fsdp_config.fsdp_size=$ACTOR_FSDP_SIZE \
actor_rollout_ref.actor.clip_ratio_high=0.285 \
actor_rollout_ref.rollout.tensor_model_parallel_size=${ROLLOUT_TP} \
actor_rollout_ref.rollout.name=sglang_async \
actor_rollout_ref.rollout.multi_turn.format=quark \
actor_rollout_ref.rollout.multi_turn.enable=True \
actor_rollout_ref.rollout.multi_turn.max_assistant_turns=${MAX_TURNS} \
actor_rollout_ref.rollout.multi_turn.use_inference_chat_template=False \
actor_rollout_ref.rollout.multi_turn.tokenization_sanity_check_mode=disable \
actor_rollout_ref.hybrid_engine=True \
actor_rollout_ref.rollout.max_num_batched_tokens=$PPO_MAX_TOKEN_LEN_PER_GPU \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${LOG_PROB_MIRCO_BATCH_SIZE_PER_GPU} \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.n=${ROLLOUT_N} \
actor_rollout_ref.rollout.multi_turn.tool_config_path="$TOOL_CONFIG" \
actor_rollout_ref.ref.strategy=${FSDP_TYPE} \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${LOG_PROB_MIRCO_BATCH_SIZE_PER_GPU} \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.use_kl_in_reward=${USE_KL_IN_REWARD} \
algorithm.kl_ctrl.kl_coef=0.001 \
algorithm.kl_penalty='low_var_kl' \
algorithm.gamma=1.0 \
critic.optim.lr=1e-5 \
critic.ppo_micro_batch_size_per_gpu=${CRITIC_MIRCO_BATCH_SIZE_PER_GPU} \
critic.strategy=${FSDP_TYPE} \
critic.model.use_remove_padding=True \
critic.model.path=${CRITIC_PATH} \
critic.model.enable_gradient_checkpointing=True \
critic.model.fsdp_config.param_offload=True \
critic.model.fsdp_config.optimizer_offload=True \
critic.ulysses_sequence_parallel_size=${CRITIC_SP} \
reward_model.enable=${REWARD_MODEL_ENABLE} \
reward_model.strategy=${FSDP_TYPE} \
reward_model.model.type=default \
reward_model.model.path=${REWARD_PATH} \
reward_model.model.use_remove_padding=True \
reward_model.max_length=$PPO_MAX_TOKEN_LEN_PER_GPU \
reward_model.micro_batch_size_per_gpu=${REWARD_MIRCO_BATCH_SIZE_PER_GPU} \
reward_model.model.fsdp_config.param_offload=True \
reward_model.ulysses_sequence_parallel_size=${REWARD_SP} \
${CUSTOM_RM_ARGS} \
trainer.critic_warmup=0 \
trainer.logger=['console','tensorboard'] \
trainer.project_name=quark_ssp_eval \
trainer.experiment_name=eval_step${EVAL_STEP} \
trainer.val_before_train=True \
trainer.val_only=True \
trainer.resume_mode=disable \
trainer.n_gpus_per_node=8 \
trainer.nnodes=${NNODES} \
trainer.save_freq=999999 \
trainer.test_freq=1 \
trainer.default_local_dir=${SAVE_CHECKPOINT_DIR} \
trainer.rollout_data_dir=$OUTPUT_DIR/rollout \
trainer.validation_data_dir=$OUTPUT_DIR/val \
trainer.total_epochs=1 \
actor_rollout_ref.rollout.val_kwargs.n=${VAL_N} \
actor_rollout_ref.rollout.val_kwargs.temperature=${VAL_TEMPERATURE} \
actor_rollout_ref.rollout.val_kwargs.top_k=-1 \
actor_rollout_ref.rollout.val_kwargs.top_p=1.0 \
actor_rollout_ref.rollout.val_kwargs.do_sample=True \
+actor_rollout_ref.rollout.val_kwargs.frequency_penalty=0 \
+actor_rollout_ref.rollout.val_kwargs.repetition_penalty=1 \
+actor_rollout_ref.rollout.repetition_penalty=1 \
self_play.enable=True \
self_play.lang=${LANG} \
self_play.save_freq=999999 \
self_play.use_rag_filter=True \
self_play.noisy_RAG_materials=4 \
self_play.proposer.enable=True \
self_play.proposer.warm_up_steps=-1 \
self_play.proposer.format_penalty=0 \
self_play.proposer.n=1 \
self_play.proposer.reward_type=1-acc \
self_play.proposer.adv_estimator=grpo \
self_play.proposer.right=1.0 \
self_play.proposer.left=0.0 \
self_play.extraction_failure.strategy=reuse \
self_play.solver.enable=True \
self_play.dynamic_sampling.enable=False \
self_play.reward_dynamic_sampling.enable=False \
self_play.use_search_terms_filter=False \
self_play.extraction_failure.reuse_success_rate_threshold=1.0 \
self_play.combine_update=False \
self_play.mini_epochs=1 \
self_play.extraction_failure.pool_clear_interval=10 \
self_play.answer_pattern=question \
self_play.validate_config=True \
self_play.extraction_failure.keep_ratio=0
else
retry_count=0
while true; do
echo "Worker rank ${NODE_RANK} attempting to join Ray head at ${RAY_MASTER_ADDR}:${MASTER_PORT} (attempt $((retry_count + 1)))"
ray start --address=${RAY_MASTER_ADDR}:${MASTER_PORT} --block && break
retry_count=$((retry_count + 1))
if [ ${retry_count} -ge ${WORKER_CONNECT_MAX_RETRIES} ]; then
echo "Worker rank ${NODE_RANK} failed to connect to Ray head after ${retry_count} attempts"
exit 1
fi
echo "Worker rank ${NODE_RANK} failed to connect, retrying in ${WORKER_CONNECT_RETRY_INTERVAL_SECONDS}s"
sleep ${WORKER_CONNECT_RETRY_INTERVAL_SECONDS}
done
fi
Description
I attempted to reproduce the Qwen2.5-7B-Instruct baseline results reported in Table 1 of the paper using the released evaluation code and dataset. My results are significantly lower than the paper's reported numbers across all benchmarks.
Environment
Qwen2.5-7B-Instruct(no fine-tuning, used as-is)examples/eval_ssp.sh(adapted to skip checkpoint merging, directly setactor_rollout_ref.model.pathto the HuggingFace model)test.parquetfrom Quark-LLM/SSP(3,125 samples total)
e5-base-v2+ wiki-18 FAISS index (as described in README)Qwen2.5-32B-Instructvia LLM-as-a-JudgeVAL_N=1,VAL_TEMPERATURE=0.0(greedy decoding)MAX_TURNS=10My Results
Questions
Which metric does Table 1 report? Is it
reward/mean@1(which appears to be a soft LLM-judge score),EM, orF1? The paper mentions different scoring methods and it's not entirely clear from the table caption.What search backend was used for the baseline evaluation? Was it the same local dense retriever (e5-base-v2 + wiki-18) described in the README, or a different retrieval system?
What judge model was used during evaluation? The reward uses LLM-as-a-Judge; the judge model and its configuration can significantly affect scores.
Is the test set the same? The released
test.parquetcontains 3,125 samples. Is this the identical test set used in the paper?Any other configuration differences? For example,
MAX_TURNS,MAX_RESPONSE_LENGTH, or chat template settings that might affect the Qwen2.5-7B-Instruct baseline specifically.Any clarification would be greatly appreciated. Thank you for the great work!
Evaluation Script (
examples/eval_ssp.sh)Click to expand