CVPR 2026 Workshop · 📄 Paper
Niccolò Cavagnero, Narges Norouzi, Gijs Dubbelman, Daan de Geus
Eindhoven University of Technology
We present the Plain Mask Transformer (PMT), a fast Transformer-based segmentation model that operates on top of frozen Vision Foundation Model (VFM) features.
Encoder-only models like EoMT and VidEoMT achieve competitive accuracy with low latency but require finetuning the full encoder, preventing the VFM from being reused for other downstream tasks.
PMT addresses this by introducing the Plain Mask Decoder (PMD): a lightweight Transformer decoder that mimics the last encoder layers of EoMT, processing queries and frozen patch tokens jointly — without touching the encoder weights.
The result: a model that keeps the encoder frozen and shareable across tasks while matching the accuracy and speed of finetuned alternatives.
The codebase is organized by task domain. Image segmentation code is available now; video segmentation will be added in a future release.
pmt/
├── requirements.txt # shared dependencies
├── image/ # image segmentation
├── video/ # video segmentation (coming soon)
├── model_zoo/ # pre-trained weight catalogues
│ ├── image/ # image model weights (DINOv3)
│ └── video/ # video model weights (coming soon)
└── docs/ # project page
If you don't have Conda installed, install Miniconda and restart your shell:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.shThen create the environment, activate it, and install the dependencies:
conda create -n pmt python==3.13.2
conda activate pmt
python3 -m pip install -r requirements.txtWeights & Biases (wandb) is used for experiment logging and visualization. To enable wandb, log in to your account:
wandb login- Image datasets (COCO, ADE20K, Cityscapes): follow the instructions in the EoMT repository.
- Video datasets (YouTube-VIS, VIPSeg, VSPW): follow the instructions in the VidEoMT repository.
To train PMT from scratch, run:
python3 image/main.py fit \
-c image/configs/coco/panoptic/pmt_l_640.yaml \
--trainer.devices 4 \
--data.batch_size 4 \
--data.path /path/to/datasetThis trains PMT-L with a 640×640 input on COCO panoptic segmentation using 4 GPUs, for a total batch size of 16.
✅ Make sure the total batch size is devices × batch_size = 16
🔧 Replace /path/to/dataset with the directory containing the dataset zip files.
This configuration takes ~6 hours on 4×NVIDIA H100 GPUs, each using ~26GB VRAM.
To fine-tune a pre-trained PMT model, add:
--model.ckpt_path /path/to/pytorch_model.bin \
--model.load_ckpt_class_head False🔧 Replace /path/to/pytorch_model.bin with the path to the checkpoint to fine-tune.
--model.load_ckpt_class_head Falseskips loading the classification head when fine-tuning on a dataset with different classes.
To evaluate a pre-trained PMT model, run:
python3 image/main.py validate \
-c image/configs/coco/panoptic/pmt_l_640.yaml \
--model.network.masked_attn_enabled False \
--trainer.devices 4 \
--data.batch_size 4 \
--data.path /path/to/dataset \
--model.ckpt_path /path/to/pytorch_model.bin🔧 Replace /path/to/dataset with the directory containing the dataset zip files.
🔧 Replace /path/to/pytorch_model.bin with the path to the checkpoint to evaluate.
Video segmentation code will be added in a future release. The video/ directory is reserved for this purpose.
We provide pre-trained weights for PMT models with DINOv3 encoders.
- Image Models - Image segmentation with DINOv3 encoder.
- Video Models - Coming soon.
If you find this work useful in your research, please cite it using the BibTeX entry below:
@inproceedings{cavagnero2026pmt,
author = {Cavagnero, Niccol\`{o} and Norouzi, Narges and Dubbelman, Gijs and {de Geus}, Daan},
title = {{PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders}},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
year = {2026},
}This project builds upon code from the following libraries and repositories:
- EoMT (MIT License)
- VidEoMT (MIT License)
- Hugging Face Transformers (Apache-2.0 License)
- PyTorch Image Models (timm) (Apache-2.0 License)
- PyTorch Lightning (Apache-2.0 License)
- TorchMetrics (Apache-2.0 License)
- Mask2Former (Apache-2.0 License)
- Detectron2 (Apache-2.0 License)