Skip to content

wzzheng/DVGT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DVGT: Driving Visual Geometry Transformer

DVGT is a comprehensive autonomous driving framework that leverages dense 3D geometry as the foundation for perception and planning. This repository hosts the official implementation of the DVGT series: from reconstructing metric-scaled dense point maps across diverse datasets (DVGT-1) to introducing an efficient Vision-Geometry-Action (VGA) paradigm for online joint reconstruction and planning (DVGT-2).

Check our project pages (DVGT-1, DVGT-2) for full demo videos and interactive results!

πŸš€ DVGT-2 Demos

Demonstration Highlight & Description
Dense Scene Representation

Unlike models relying on inverse perspective mapping or sparse perception results, DVGT-2 reconstructs dense 3D geometry to provide a comprehensive and detailed scene representation.
Streaming Reconstruction & Planning

Given unposed multi-view image sequences, DVGT-2 performs joint geometry reconstruction and trajectory planning in a fully online manner for continuous and robust driving.
Global Geometry Consistency

Operating on online input sequences, DVGT-2 streamingly infers the global geometry of the entire scene, demonstrating high fidelity and temporal consistency.

πŸ“– Publications

DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

arXiv Project Page

DVGT-2 introduces an efficient Vision-Geometry-Action (VGA) paradigm for autonomous driving. By processing multi-view inputs in an online manner, it jointly performs dense 3D geometry reconstruction and trajectory planning, achieving superior 3D perception and planning capabilities with high efficiency.

DVGT-1: Driving Visual Geometry Transformer

arXiv Project Page

DVGT-1 is a universal driving visual geometry model that directly reconstructs metric-scaled dense 3D point maps from unposed multi-view images. It demonstrates robust performance and remarkable generalizability across diverse camera setups and driving scenarios, eliminating the need for post-alignment with external data.

✨ News

  • [2026/03/31] DVGT-1 & DVGT-2: Training, evaluation, and data annotation code released.
  • [2026/03/31] DVGT-2 paper released.
  • [2026/02/24] DVGT-1 is accepted to CVPR26!
  • [2025/12/19] We have released the paper, inference code, and visualization checkpoints.

πŸ“¦ Installation

We tested the code with CUDA 12.8, python3.11 and torch 2.8.0.

git clone https://github.com/wzzheng/DVGT.git
cd dvgt

conda create -n dvgt python=3.11
conda activate dvgt

pip install -r requirements.txt

cd third_party/
git clone https://github.com/facebookresearch/dinov3.git

πŸ€— Pretrained Models

Our pretrained models are available on the huggingface hub:

Version Hugging Face Model Metric scale Streaming #Params
DVGT-1 RainyNight/DVGT-1 βœ… - 1.7B
DVGT-2 RainyNight/DVGT-2 βœ… βœ… 1.8B

πŸ’‘ Minimal Code Example

Now, try the model with just a few lines of code:

import torch
from dvgt.models.architectures.dvgt1 import DVGT1
# from dvgt.models.architectures.dvgt2 import DVGT2
from dvgt.utils.load_fn import load_and_preprocess_images
from iopath.common.file_io import g_pathmgr

checkpoint_path = 'ckpt/dvgt1.pt'

device = "cuda" if torch.cuda.is_available() else "cpu"
# bfloat16 is supported on Ampere GPUs (Compute Capability 8.0+) 
dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16

# Initialize the model and load the pretrained weights.
model = DVGT1()
# model = DVGT2() # Let'2 try DVGT2
with g_pathmgr.open(checkpoint_path, "rb") as f:
    checkpoint = torch.load(f, map_location="cpu")
model.load_state_dict(checkpoint)
model = model.to(device).eval()

# Load and preprocess example images (replace with your own image paths)
image_dir = 'visual_demo_examples/openscene_log-0104-scene-0007'
images = load_and_preprocess_images(image_dir).to(device)

with torch.no_grad():
    with torch.amp.autocast(device, dtype=dtype):
        # Predict attributes including ego pose and point maps.
        predictions = model(images)

πŸ’‘ Detailed Usage

Click to expand

You can also optionally choose which attributes (branches) to predict, as shown below. This achieves the same result as the example above. This example uses a batch size of 1 (processing a single scene), but it naturally works for multiple scenes.

import torch
from dvgt.models.architectures.dvgt1 import DVGT1
# from dvgt.models.architectures.dvgt2 import DVGT2
from dvgt.utils.load_fn import load_and_preprocess_images
from iopath.common.file_io import g_pathmgr
from dvgt.utils.pose_encoding import decode_pose
from dvgt.evaluation.utils.geometry import convert_point_in_ego_0_to_ray_depth_in_ego_n

checkpoint_path = 'ckpt/dvgt1.pt'

device = "cuda" if torch.cuda.is_available() else "cpu"
# bfloat16 is supported on Ampere GPUs (Compute Capability 8.0+) 
dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16

# Initialize the model and load the pretrained weights.
model = DVGT1()
# model = DVGT2() # Let'2 try DVGT2
with g_pathmgr.open(checkpoint_path, "rb") as f:
    checkpoint = torch.load(f, map_location="cpu")
model.load_state_dict(checkpoint)
model = model.to(device).eval()

# Load and preprocess example images (replace with your own image paths)
image_dir = 'visual_demo_examples/openscene_log-0104-scene-0007'
images = load_and_preprocess_images(image_dir).to(device)

with torch.no_grad():
    with torch.amp.autocast(device, dtype=dtype):
        aggregated_tokens_list, ps_idx = model.aggregator(images)
                
    # Predict ego n to ego first
    pose_enc = model.ego_pose_head(aggregated_tokens_list)[-1]
    # Ego pose following the OpenCV convention, relative to the ego-frame of the first time step.
    ego_n_to_ego_0, _ = decode_pose(pose_enc)

    # Predict Point Maps in the ego-frame of the first time step
    point_map, point_conf = model.point_head(aggregated_tokens_list, images, ps_idx)

    # The predicted ray depth maps are originated from each ego-vehicle's position in its corresponding frame.
    ray_depth_in_ego_n = convert_point_in_ego_0_to_ray_depth_in_ego_n(point_map, ego_n_to_ego_0)    

Visualization

Run the following command to perform reconstruction and visualize the point clouds in Viser. This script requires a path to an image folder formatted as follows:

data_dir/
β”œβ”€β”€ frame_0/ (contains view images, e.g., CAM_F.jpg, CAM_B.jpg...)
β”œβ”€β”€ frame_1/
...

Note on Data Requirements:

  1. Consistency: The data must be sampled at 2Hz. All frames must contain the same number of views arranged in a fixed order.
  2. Capacity: DVGT1 inference supports up to 24 frames, while DVGT2 supports arbitrary frame lengths. Both models support an arbitrary number and order of views per frame.

You can directly download our example dataset to get started:

# around 80MB
wget https://huggingface.co/datasets/RainyNight/DVGT_demo_dataset/resolve/main/visual_demo_examples.zip
unzip visual_demo_examples.zip
python demo_viser.py \
  --model_name=DVGT1 \
  --image_folder=visual_demo_examples/openscene_log-0104-scene-0007

🌟 Data preparation

See docs/data_preparation.md

πŸ‹οΈβ€β™‚οΈ Training & Finetuning

See docs/train.md

πŸ§ͺ Evaluation

See docs/eval.md

πŸŒ‹ visualization

See docs/visualization.md

Acknowledgements

Our code is based on the following brilliant repositories:

Moge-2 CUT3R Driv3R VGGT MapAnything Pi3

Many thanks to these authors!

Citation

If you find this project helpful, please consider citing the following paper:

@article{zuo2025dvgt,
  title={DVGT: Driving Visual Geometry Transformer}, 
  author={Zuo, Sicheng and Xie, Zixun and Zheng, Wenzhao and Xu, Shaoqing and Li, Fang and Jiang, Shengyin and Chen, Long and Yang, Zhi-Xin and Lu, Jiwen},
  journal={arXiv preprint arXiv:2512.16919},
  year={2025}
}

@article{zuo2026dvgt-2,
  title={DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale}, 
  author={Zuo, Sicheng and Xie, Zixun and Zheng, Wenzhao and Xu, Shaoqing and Li, Fang and Li, Hanbing and Chen, Long and Yang, Zhi-Xin and Lu, Jiwen},
  journal={arXiv preprint arXiv:2603.xxxxx},
  year={2026}
}

About

[CVPR 2026] Visual Geometry Transformer for Autonomous Driving

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors