Try VQASynth on your image in the HF Space
Spatial Reasoning is fundamental to interacting within and navigating physical environments for embodied AI applications like robotics. However, data samples suitable for learning these capabilities are rare in AI pretraining datasets. Don't be limited by what your model can do out-of-the-box, curate any image dataset from the Huggingface Hub for Spatial VQA with tools for scene understanding.
VLMs trained using VQASynth ๐น
- estimate 3D distances between objects in an image
- describe distances colloquially, convert between common units of measurement
- answer queries about the orientation and spatial relationships between objects
- base responses on consistent references like floors and surfaces
- apply CoT "thinking" for more robust reasoning and better estimates
Fusing semantic and metric data into templated VQA chat, Vision Language Models can be instruction-tuned with low-rank adapters to enhance their baseline spatial reasoning capabilities. VQASynth ๐น provides an open-source reproduction of SpatialVLM, which describes a 3D scene reconstruction pipeline and prompt templates for enhancing the spatial reasoning abilities of VLMs including:
๐ช Object-grounded captions by point prompting with Molmo
๐ Improves metric depth estimation speed & accuracy by replacing DepthPro with VGGT.
๐ชข SAM2 replaces SAM in the localization refinement stage.
๐ง Multimodal "Thinking" by CoT Reasoning
Before running the demo scripts, ensure you have the following installed:
- Python 3.10 or later
- Docker, Docker Compose V2
- NVIDIA Container Toolkit
Use Docker Compose to transform Image datasets from Huggingface Hub into VQA datasets describing spatial relations between objects. You can process different datasets after updating the config.yaml.
Then run the spatial VQA pipeline locally with Docker:
# Authenticate to push to hub
huggingface-cli login
# Run the pipeline
cd /path/to/VQASynth
bash run.sh
You can run the colab notebook using free-tier CPU or GPU acceleration or customize your own pipeline:
from vqasynth.datasets import Dataloader
from vqasynth.embeddings import EmbeddingGenerator, TagFilter
dataloader = Dataloader(cache_dir)
dataset = dataloader.load_dataset(dataset_name)
embedding_generator = EmbeddingGenerator()
tag_filter = TagFilter()
include_tags = include_tags.strip().split(",")
exclude_tags = exclude_tags.strip().split(",")
# Extract embeddings
dataset = dataset.map(lambda example: embedding_generator.apply_transform(example, images))
# Extract tags
dataset = dataset.map(lambda example: tag_filter.apply_transform(example, include_tags + exclude_tags))
# Filter by tags
dataset_filtered = dataset.filter(
lambda example: tag_filter.filter_by_tag(
example['tag'], include_tags, exclude_tags
)
)
The resulting Huggingface dataset is in the cache directory and you can push to hub with:
dataloader.push_to_hub(final_dataset, target_repo_name)
Here are some examples:
Try SpaceLLaVA in Discord
We've hosted some notebooks visualizing and experimenting with the techniques included in this repo.
Notebook | Description | Launch |
---|---|---|
Generate Spatial VQA Dataset | Augment an HF Image Dataset with Spatial VQA | |
Spatial Reasoning with Point Clouds | Visualize point clouds and evaluate spatial relationships |
This project was inspired by or utilizes concepts discussed in the following research paper(s):
@article{chen2024spatialvlm,
title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
journal = {arXiv preprint arXiv:2401.12168},
year = {2024},
url = {https://arxiv.org/abs/2401.12168},
}
@inproceedings{wang2025vggt,
title={VGGT: Visual Geometry Grounded Transformer},
author={Wang, Jianyuan and Chen, Minghao and Karaev, Nikita and Vedaldi, Andrea and Rupprecht, Christian and Novotny, David},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}
@article{ravi2024sam2,
title={SAM 2: Segment Anything in Images and Videos},
author={Ravi, Nikhila and Gabeur, Valentin and Hu, Yuan-Ting and Hu, Ronghang and Ryali, Chaitanya and Ma, Tengyu and Khedr, Haitham and R{\"a}dle, Roman and Rolland, Chloe and Gustafson, Laura and Mintun, Eric and Pan, Junting and Alwala, Kalyan Vasudev and Carion, Nicolas and Wu, Chao-Yuan and Girshick, Ross and Doll{\'a}r, Piotr and Feichtenhofer, Christoph},
journal={arXiv preprint arXiv:2408.00714},
url={https://arxiv.org/abs/2408.00714},
year={2024}
}