GitHub - whuhxb/surg-3m: Official repository for the paper "Surg-3M: A Dataset and Foundation Model for Perception in Surgical Settings".

Star ⭐ us if you like it!

News

25/March/2025. The arXiv version of the paper is released.

This is the official repository for the paper Surg-3M: A Dataset and Foundation Model for Perception in Surgical Settings.

This repository provides open access to the Surg-3M dataset, Surg-FM foundation model, and training code.

Surg-3M is a dataset of 4K surgical high-resolution videos (3M frames, when videos are sampled at 1fps) from 35 diverse surgical procedure types. Each video is annotated for multi-label classification, indicating the surgical procedures carried out in the video, and for binary classification, indicating if it is robotic or non-robotic. The dataset's annotations can be found in labels.json.

Surg-FM is an image foundation model for surgery, it receives an image as input and produces a feature vector of 1536 features as output.

If you use our dataset, model, or code in your research, please cite our paper:

@misc{che2025surg3mdatasetfoundationmodel,
      title={Surg-3M: A Dataset and Foundation Model for Perception in Surgical Settings}, 
      author={Chengan Che and Chao Wang and Tom Vercauteren and Sophia Tsoka and Luis C. Garcia-Peraza-Herrera},
      year={2025},
      eprint={2503.19740},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.19740}, 
}

Abstract

Advancements in computer-assisted surgical procedures heavily rely on accurate visual data interpretation from camera systems used during surgeries. Traditional open-access datasets focusing on surgical procedures are often limited by their small size, typically consisting of fewer than 100 videos with less than 100K images. To address these constraints, a new dataset called Surg-3M has been compiled using a novel aggregation pipeline that collects high-resolution videos from online sources. Featuring an extensive collection of over 4K surgical videos and more than 3 million high-quality images from multiple procedure types, Surg-3M offers a comprehensive resource surpassing existing alternatives in size and scope, including two novel tasks. To demonstrate the effectiveness of this dataset, we present SurgFM, a self-supervised foundation model pretrained on Surg-3M that achieves impressive results in downstream tasks such as surgical phase recognition, action recognition, and tool presence detection. Combining key components from ConvNeXt, DINO, and an innovative augmented distillation method, SurgFM exhibits exceptional performance compared to specialist architectures across various benchmarks. Our experimental results show that SurgFM outperforms state-of-the-art models in multiple downstream tasks, including significant gains in surgical phase recognition (+8.9pp, +4.7pp, and +3.9pp of Jaccard in AutoLaparo, M2CAI16, and Cholec80), action recognition (+3.1pp of mAP in CholecT50) and tool presence detection (+4.6pp of mAP in Cholec80). Moreover, even when using only half of the data, SurgFM outperforms state-of-the-art models in AutoLaparo and achieves state-of-the-art performance in Cholec80. Both Surg-3M and SurgFM have significant potential to accelerate progress towards developing autonomous robotic surgery systems.

Diversity and procedure prevalence in Surg-3M:

Install dependencies to recreate our Surg-3M dataset

Install the following dependencies in your local setup:

$ git clone [email protected]:visurg-ai/surg-3m.git
$ cd surg-3m && pip install -r requirements.txt

Models used in data curation, We provide the models used in our data curation pipeline to assist with constructing the Surg-3M dataset, including video storyboard classification models, frame classification models, and non-surgical object detection models. The models can be downloaded from 🤗 Surg3M curation models.

Surg-3M dataset

You can use our code of the data curation pipeline and provided annotation file ("labels.json") to recreate the whole Surg-3M dataset.

Get your Youtube cookie:

You need to provide a "cookies.txt" file if you want to download videos that require Youtube login.

Use the cookies extension to export your Youtube cookies as "cookies.txt".

Download the annotation file ("labels.json") and use the video downloader to download the original selected Youtube videos.

$ python3 src/video_downloader.py --video-path '../labels.json' --output 'your path to store the downloaded videos' --cookies 'your YouTube cookie file'

Curate the downloaded original videos as Surg-3M video dataset. In detail, use the video_processor to classify each frame as either 'surgical' or 'non-surgical', then remove the beginning and end segments of non-surgical content from the videos, and mask the non-surgical regions in 'surgical' frames and the entire 'non-surgical' frames.
```
$ python3 src/video_processor.py --input 'your original downloaded video storage path' --input-json '../labels.json' --output 'your path to store the curated videos and their corresponding frame annotation files' --classify-models 'frame classification model' --segment-models 'non-surgical object detection models'
```

Process the Surg-3M video dataset as Surg-3M image dataset (For foundation model pre-training).

$ python3 src/create_lmdb_Surg-3M.py --video-folder 'your directory containing the curated videos and their corresponding frame annotation files' --output-json 'your path for the json file to verify the videos and labels alignment' --lmdb-path 'your lmdb storage path'

The video processing pipeline leading to the clean videos in the Surg-3M dataset is as follows:

SurgFM model

You can download the SurgFM full checkpoint which contains backbone and projection head weights for both student and teacher networks at 🤗 SurgFM.

SurgFM training:

Follow the provided scripts to launch your own SurgFM training.

$ python3 -m torch.distributed.run --nproc_per_node=8 --nnodes=1 surgfm/surgfm.py --arch convnext_large --data_path 'Surg-3M dataset lmdb path' --output_dir 'your path to store the trained foundation model' --batch_size_per_gpu 40 --num_workers 10

How to run our SurgFM foundation model to extract features from your video frames

import torch
from PIL import Image
from model_loader import build_SurgFM

# Load the pre-trained SurgFM model
surgfm = build_SurgFM(pretrained_weights = 'your path to the SurgFM')
surgfm.eval()

# Load the image and convert it to a PyTorch tensor
img_path = 'path/to/your/image.jpg'
img = Image.open(img_path)
img = img.resize((224, 224))
img_tensor = torch.tensor(np.array(img)).unsqueeze(0).to('cuda')

# Extract features from the image
outputs = surgfm(img_tensor)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News

Abstract

Install dependencies to recreate our Surg-3M dataset

Surg-3M dataset

SurgFM model

How to run our SurgFM foundation model to extract features from your video frames

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
src		src
surgfm		surgfm
README.md		README.md
labels.json		labels.json
requirements.txt		requirements.txt

whuhxb/surg-3m

Folders and files

Latest commit

History

Repository files navigation

News

Abstract

Install dependencies to recreate our Surg-3M dataset

Surg-3M dataset

SurgFM model

How to run our SurgFM foundation model to extract features from your video frames

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages