Multi-scale Temporal Sampling Augmentation and Memory-Splitting Pruning for Promptable Video Object Segmentation and Tracking in Surgical Scenarios
📄 arXiv:2508.05829
📦 GitHub Repository
TSMS-SAM2 is a novel framework for promptable video object segmentation and tracking (VOST) in surgical videos, built on the foundation of Segment Anything Model 2 (SAM2).
The framework addresses two critical challenges in surgical scenarios:
- Rapid, irregular object motion across video frames.
- Redundancy in memory features that slows down inference and increases computational load.
To overcome these issues, TSMS-SAM2 introduces two key innovations:
- Multi-scale Temporal Sampling Augmentation (TS): dynamically samples training frames at multiple temporal intervals to enhance robustness across various motion speeds.
- Memory-Splitting & Pruning (MS): divides stored memory features into short-term and long-term groups and prunes redundant features to maintain efficiency and performance.
- 🧠 A promptable, SAM2-based video segmentation framework tailored for surgical videos.
- 🎞️ Multi-scale temporal sampling strategy for robust motion understanding.
- 🧩 Memory-splitting and pruning for efficient memory management and inference.
- 📈 Superior accuracy and efficiency on benchmark datasets (EndoVis2017, EndoVis2018).
- 🔬 Comprehensive ablation studies confirming the effectiveness of each module.
The TSMS-SAM2 framework extends the SAM2 backbone with:
- Temporal Sampling Module (TS): to diversify temporal intervals during training.
- Memory-Splitting Unit (MSU): separates recent and historical memory embeddings.
- Pruning Gate: removes redundant feature vectors based on cosine similarity.
- Promptable decoder: supports mask, box, and point prompts for flexible inference.
| Dataset | Metric | SAM2 | SAM2-TS | SAM2-MS | TSMS-SAM2 (Ours) |
|---|---|---|---|---|---|
| EndoVis2017 | Dice (%) | 92.31 | 93.78 | 94.22 | 95.24 ± 0.96 |
| EndoVis2018 | Dice (%) | 83.02 | 84.61 | 85.20 | 86.73 ± 15.46 |
- TSMS-SAM2 achieves consistent improvements in both accuracy and efficiency.
- Memory usage reduced by ~30% while maintaining high segmentation fidelity.
For detailed results and ablations, refer to the paper.
TSMS-SAM2/
├── configs/ # Configuration files (training/inference)
├── data/ # Dataset preparation scripts
├── docs/ # Documentation, diagrams, figures
├── eval/ # Evaluation metrics and benchmarking
├── inference/ # Inference and tracking scripts
├── models/ # Model definitions (SAM2 + TSMS modules)
├── training/ # Training scripts and utilities
├── utils/ # Helper functions and modules
└── README.md
git clone https://github.com/apple1986/TSMS-SAM2.git
cd TSMS-SAM2conda create -n tsms-sam2 python=3.10 -y
conda activate tsms-sam2Follow the official SAM2 installation guide
and place pretrained SAM2 weights in the checkpoints/ directory.
python training/train_tsms_sam2.py --config configs/tsms_sam2_surg.yaml --epochs 100 --batch_size 8Key Configs
temporal_scales: list of frame intervals for multi-scale sampling.memory_bank_size: maximum stored memory features.pruning_threshold: cosine similarity threshold for pruning.
python inference/predict_video.py --video_path ./samples/surgery.mp4 --checkpoint ./checkpoints/tsms_sam2_best.pth --prompt_type maskSupported prompts:
maskboxpoint
python eval/evaluate.py --pred_dir ./results/ --gt_dir ./datasets/EndoVis2017/ground_truth/- EndoVis2017: Dice = 95.24% ± 0.96
- EndoVis2018: Dice = 86.73% ± 15.46
- Memory compression: reduces memory by ≈ 30%
- Inference speed: up to 1.4× faster than SAM2 baseline
- Temporal sampling currently uses fixed scales (e.g., ×1, ×2).
Future work will explore adaptive temporal sampling guided by motion cues. - Memory pruning parameters are manually set; dynamic thresholding could further enhance efficiency.
- Evaluation limited to surgical videos — extending to general medical and natural videos is promising.
If you use TSMS-SAM2 in your research, please cite:
@article{xu2025tsms_sam2,
title = {TSMS-SAM2: Multi-scale Temporal Sampling Augmentation and Memory-Splitting Pruning for Promptable Video Object Segmentation and Tracking in Surgical Scenarios},
author = {Xu, Guoping and Shao, Hua-Chieh and Zhang, You},
journal = {arXiv preprint arXiv:2508.05829},
year = {2025}
}This work builds upon the Segment Anything Model 2 (SAM2) from Meta AI and leverages insights from the EndoVis surgical video datasets.
We thank the open-source community for their contributions to video segmentation research.
- Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning
- https://github.com/jinlab-imvr/Surgical-SAM-2
This project is released under the MIT License.
See the LICENSE file for details.
We welcome contributions!
If you find this project helpful:
- ⭐ Star the repo
- 🐛 Open an issue for bugs or feature requests
- 🔧 Submit a pull request for improvements
For questions or collaboration opportunities, please reach out via the GitHub Issues page or email the corresponding author listed in the paper.
TSMS-SAM2 — Enhancing SAM2 for Temporal Understanding in Surgical Video Segmentation
Developed by Guoping Xu, Hua-Chieh Shao, and You Zhang (2025)

