Hi, thanks for the great work and for releasing the paper and code!
I have one clarification question regarding the experimental setup in Section 4.1.
In the paper, the in-domain evaluation uses a 1:9 split of the five training datasets
(ActivityNet, LVBench, ScaleLong, Star, YouCook2), and the model is trained on the Video-Thinker-10K dataset during SFT and GRPO.
My question is:
Were any of the baseline models (vanilla models or reasoning models) trained, fine-tuned, or adapted using the Video-Thinker-10K dataset?