Discrepancy in Qwen3-VL-8B-Instruct evaluation on VLM4D / Request for evaluation code

Hi DSR_Suite Authors,

Thanks for the fantastic work on dynamic spatial reasoning and the comprehensive DSR_Suite framework!

I am currently working on reproducing some of the baseline results presented in your paper. Specifically, I am evaluating Qwen3-VL-8B-Instruct on the VLM4D benchmark using the Direct Output (DO) setting. However, I have noticed some discrepancies between my evaluation results and the numbers reported in your paper.

To help me align my experimental setup perfectly with yours, could you kindly clarify a few evaluation details?

- Input Frame Sampling: In Section 7.2 (Evaluation Details) of the supplementary material, it states: "During evaluation, we uniformly sample 32 frames from each input video as visual inputs for all VLMs." Could you confirm if this 32-frame uniform sampling strategy was strictly applied to the VLM4D benchmark evaluation as well?
- Prompting Details: Could you share the exact text prompt/template used for the Direct Output evaluation on VLM4D? (e.g., how the video frames and the question were formatted together).
- Evaluation Code: Do you have any plans to open-source the evaluation scripts for these benchmarks? Releasing the evaluation code would be incredibly helpful for the community to ensure fair comparisons, standardized benchmarking, and easier reproduction of your great work.

Thanks in advance for your time and clarification!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy in Qwen3-VL-8B-Instruct evaluation on VLM4D / Request for evaluation code #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discrepancy in Qwen3-VL-8B-Instruct evaluation on VLM4D / Request for evaluation code #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions