This is the dedicated word-selection pipeline for VLM-3R-DATA.
It does not depend on spar.
The pipeline reads vsibench_train and/or vstibench_train, normalizes the records, exports question-only artifacts, and runs LLM-based word selection through an OpenAI-compatible endpoint such as vLLM.
Outputs are written to:
artifacts/vlm3r_word_selection/dataset_manifest.jsonartifacts/vlm3r_word_selection/normalized_train.jsonlartifacts/vlm3r_word_selection/normalized_summary.jsonartifacts/vlm3r_word_selection/questions_only.jsonlartifacts/vlm3r_word_selection/preview_50.jsonartifacts/vlm3r_word_selection/selected_words.jsonlartifacts/vlm3r_word_selection/selection_errors.jsonl
Each row in selected_words.jsonl contains:
visible_grounded_words: words that should correspond to directly observable entities or properties in the image/videoreasoning_words: words that are useful for reasoning but are not themselves directly visible entities, such as relations, motion, order, counting, or route constraintsselected_words: backward-compatible union of the two buckets
python scripts/vlm3r_word_selection_pipeline.py \
--dataset-root VLM-3R-DATA \
--model Qwen/Qwen2.5-7B-Instruct \
--api-base http://127.0.0.1:8000/v1To prepare only:
python scripts/vlm3r_word_selection_pipeline.py \
--dataset-root VLM-3R-DATA \
--model Qwen/Qwen2.5-7B-Instruct \
--prepare-onlyIf the cluster job should start vLLM itself:
sbatch hpc/sbatch_vlm3r_word_selection.shThis full-run job uses:
partition=boost_usr_prodqos=normaltime=10:00:00gpus-per-node=4cpus-per-task=32exclusive
For a short single/debug run:
sbatch hpc/sbatch_vlm3r_word_selection_single.shThis single-run job uses:
partition=boost_usr_prodqos=boost_qos_dbgtime=00:30:00gpus-per-node=4cpus-per-task=32- default
LIMIT=50
If you already have an endpoint running:
PYTHON_BIN=python \
DATASET_ROOT=/path/to/VLM-3R-DATA \
OUTPUT_DIR=/path/to/output \
MODEL_NAME=Qwen/Qwen2.5-7B-Instruct \
API_BASE=http://127.0.0.1:8000/v1 \
START_VLLM=0 \
bash hpc/run_vlm3r_word_selection.shvsibench_trainandvstibench_trainare both supported.- The selector reads from the normalized question text and returns
selected_wordsplus a short justification. - If you want to rerun after a partial job, add
--resumein the Python command.