Skip to content

Discrepancy in Qwen3-VL-8B-Instruct evaluation on VLM4D / Request for evaluation code #7

@Zhu-Yakun

Description

@Zhu-Yakun

Hi DSR_Suite Authors,

Thanks for the fantastic work on dynamic spatial reasoning and the comprehensive DSR_Suite framework!

I am currently working on reproducing some of the baseline results presented in your paper. Specifically, I am evaluating Qwen3-VL-8B-Instruct on the VLM4D benchmark using the Direct Output (DO) setting. However, I have noticed some discrepancies between my evaluation results and the numbers reported in your paper.

To help me align my experimental setup perfectly with yours, could you kindly clarify a few evaluation details?

  • Input Frame Sampling: In Section 7.2 (Evaluation Details) of the supplementary material, it states: "During evaluation, we uniformly sample 32 frames from each input video as visual inputs for all VLMs." Could you confirm if this 32-frame uniform sampling strategy was strictly applied to the VLM4D benchmark evaluation as well?
  • Prompting Details: Could you share the exact text prompt/template used for the Direct Output evaluation on VLM4D? (e.g., how the video frames and the question were formatted together).
  • Evaluation Code: Do you have any plans to open-source the evaluation scripts for these benchmarks? Releasing the evaluation code would be incredibly helpful for the community to ensure fair comparisons, standardized benchmarking, and easier reproduction of your great work.

Thanks in advance for your time and clarification!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions