Skip to content

tue-mps/pmt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders

CVPR 2026 Workshop · 📄 Paper

Niccolò Cavagnero, Narges Norouzi, Gijs Dubbelman, Daan de Geus

Eindhoven University of Technology

Overview

We present the Plain Mask Transformer (PMT), a fast Transformer-based segmentation model that operates on top of frozen Vision Foundation Model (VFM) features.

Encoder-only models like EoMT and VidEoMT achieve competitive accuracy with low latency but require finetuning the full encoder, preventing the VFM from being reused for other downstream tasks.

PMT addresses this by introducing the Plain Mask Decoder (PMD): a lightweight Transformer decoder that mimics the last encoder layers of EoMT, processing queries and frozen patch tokens jointly — without touching the encoder weights.

The result: a model that keeps the encoder frozen and shareable across tasks while matching the accuracy and speed of finetuned alternatives.

Repository Structure

The codebase is organized by task domain. Image segmentation code is available now; video segmentation will be added in a future release.

pmt/
├── requirements.txt          # shared dependencies
├── image/                    # image segmentation
├── video/                    # video segmentation (coming soon)
├── model_zoo/                # pre-trained weight catalogues
│   ├── image/                # image model weights (DINOv3)
│   └── video/                # video model weights (coming soon)
└── docs/                     # project page

Installation

If you don't have Conda installed, install Miniconda and restart your shell:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Then create the environment, activate it, and install the dependencies:

conda create -n pmt python==3.13.2
conda activate pmt
python3 -m pip install -r requirements.txt

Weights & Biases (wandb) is used for experiment logging and visualization. To enable wandb, log in to your account:

wandb login

Data Preparation

  • Image datasets (COCO, ADE20K, Cityscapes): follow the instructions in the EoMT repository.
  • Video datasets (YouTube-VIS, VIPSeg, VSPW): follow the instructions in the VidEoMT repository.

Image Segmentation

Training

To train PMT from scratch, run:

python3 image/main.py fit \
  -c image/configs/coco/panoptic/pmt_l_640.yaml \
  --trainer.devices 4 \
  --data.batch_size 4 \
  --data.path /path/to/dataset

This trains PMT-L with a 640×640 input on COCO panoptic segmentation using 4 GPUs, for a total batch size of 16.

✅ Make sure the total batch size is devices × batch_size = 16
🔧 Replace /path/to/dataset with the directory containing the dataset zip files.

This configuration takes ~6 hours on 4×NVIDIA H100 GPUs, each using ~26GB VRAM.

To fine-tune a pre-trained PMT model, add:

  --model.ckpt_path /path/to/pytorch_model.bin \
  --model.load_ckpt_class_head False

🔧 Replace /path/to/pytorch_model.bin with the path to the checkpoint to fine-tune.

--model.load_ckpt_class_head False skips loading the classification head when fine-tuning on a dataset with different classes.

Evaluating

To evaluate a pre-trained PMT model, run:

python3 image/main.py validate \
  -c image/configs/coco/panoptic/pmt_l_640.yaml \
  --model.network.masked_attn_enabled False \
  --trainer.devices 4 \
  --data.batch_size 4 \
  --data.path /path/to/dataset \
  --model.ckpt_path /path/to/pytorch_model.bin

🔧 Replace /path/to/dataset with the directory containing the dataset zip files.
🔧 Replace /path/to/pytorch_model.bin with the path to the checkpoint to evaluate.

Video Segmentation

Video segmentation code will be added in a future release. The video/ directory is reserved for this purpose.

Model Zoo

We provide pre-trained weights for PMT models with DINOv3 encoders.

  • Image Models - Image segmentation with DINOv3 encoder.
  • Video Models - Coming soon.

Citation

If you find this work useful in your research, please cite it using the BibTeX entry below:

@inproceedings{cavagnero2026pmt,
  author    = {Cavagnero, Niccol\`{o} and Norouzi, Narges and Dubbelman, Gijs and {de Geus}, Daan},
  title     = {{PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders}},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
  year      = {2026},
}

Acknowledgements

This project builds upon code from the following libraries and repositories:

About

[CVPR 2026 Workshop] Official code and models for Plain Mask Transformer (PMT).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors