Hi DSR_Suite Authors,
Thanks for the fantastic work on dynamic spatial reasoning and the comprehensive DSR_Suite framework!
I am currently working on reproducing some of the baseline results presented in your paper. Specifically, I am evaluating Qwen3-VL-8B-Instruct on the VLM4D benchmark using the Direct Output (DO) setting. However, I have noticed some discrepancies between my evaluation results and the numbers reported in your paper.
To help me align my experimental setup perfectly with yours, could you kindly clarify a few evaluation details?
- Input Frame Sampling: In Section 7.2 (Evaluation Details) of the supplementary material, it states: "During evaluation, we uniformly sample 32 frames from each input video as visual inputs for all VLMs." Could you confirm if this 32-frame uniform sampling strategy was strictly applied to the VLM4D benchmark evaluation as well?
- Prompting Details: Could you share the exact text prompt/template used for the Direct Output evaluation on VLM4D? (e.g., how the video frames and the question were formatted together).
- Evaluation Code: Do you have any plans to open-source the evaluation scripts for these benchmarks? Releasing the evaluation code would be incredibly helpful for the community to ensure fair comparisons, standardized benchmarking, and easier reproduction of your great work.
Thanks in advance for your time and clarification!
Hi DSR_Suite Authors,
Thanks for the fantastic work on dynamic spatial reasoning and the comprehensive DSR_Suite framework!
I am currently working on reproducing some of the baseline results presented in your paper. Specifically, I am evaluating Qwen3-VL-8B-Instruct on the VLM4D benchmark using the Direct Output (DO) setting. However, I have noticed some discrepancies between my evaluation results and the numbers reported in your paper.
To help me align my experimental setup perfectly with yours, could you kindly clarify a few evaluation details?
Thanks in advance for your time and clarification!