Skip to content

This repo aims to include materials (papers, codes, slides) about SAM2 (segment anything in images and videos). We are continuously improving the project. Welcome to PR the works (papers, repos) that are missed.

Notifications You must be signed in to change notification settings

GuoleiSun/Awesome-SAM2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 

Repository files navigation

Awesome SAM2 ( Segment Anything in Images and Videos)

This repo aims to include materials (papers, codes, slides) about SAM2 (segment anything in images and videos), a vision foundation model released by Meta AI . We are continuously improving the project. Welcome to PR the works (papers, repos) that are missed.

SAM2

Contents

Papers/Projects

Surveys & Reviews

Release Title Code
2024.07 Segment Anything for Videos: A Systematic Survey 📖 Repo
2024.08 Unleashing the Potential of SAM2 for Biomedical Images and Videos: A Survey 📖 Repo
2024.10 On Efficient Variants of Segment Anything Model: A Survey NA
2025.03 SAM2 for Image and Video Segmentation: A Comprehensive Survey NA
2025.07 Survey on deep learning-based weakly supervised salient object detection NA

Traditional Segmentation

Image Segmentation

Release Title Code
2024.10 Towards Natural Image Matting in the Wild via Real-Scenario Prior 🔗 Code
2024.11 CorrCLIP: Reconstructing Correlations in CLIP with Off-the-Shelf Foundation Models for Open-Vocabulary Semantic Segmentation NA
2025.03 Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement 🌐Project page
2025.04 MGD-SAM2: Multi-view Guided Detail-enhanced Segment Anything Model 2 for High-Resolution Class-agnostic Segmentation 🔗 Code
2025.04 Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding 🌐Project page
Segmentation Applications
Release Title Code
2025.03 Unveiling the Potential of Segment Anything Model 2 for RGB-Thermal Semantic Segmentation with Language Guidance 🔗 Code
2025.03 MemorySAM: Memorize Modalities and Semantics with Segment Anything Model 2 for Multi-modal Semantic Segmentation 🕒Soon
2025.03 Segment Any-Quality Images with Generative Latent Space Enhancement NA
2025.03 Superpowering Open-Vocabulary Object Detectors for X-ray Vision 🔗 Code
2025.04 Semi-automated segmentation of magnitude images in 4D flow MR scans using segment anything model 2 (SAM 2) NA
2025.04 S4M: Boosting Semi-Supervised Instance Segmentation with SAM 🌐Project page
2025.04 KAN-SAM: Kolmogorov-Arnold Network Guided Segment Anything Model for RGB-T Salient Object Detection NA
2025.04 MovSAM: A Single-image Moving Object Segmentation Framework 🔗 Code
2025.04 Few-Shot Adaptation of Grounding DINO for Agricultural Domain NA
2025.10 SinkSAM-Net: Knowledge-driven self-supervised sinkhole segmentation using topographic priors and Segment Anything Model 🌐Project page
Other Image Tasks
Release Title Code
2024.08 Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation 🔗 Code
2024.11 SAMWISE: Infusing wisdom in SAM2 for Text-Driven Video Segmentation 🔗 Code
2024.11 SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory 🔗 Code
2025.02 Text-Promtable propagation for referring medical image sequence segmentation NA
2025.04 MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking 🔗 Code
2025.05 Vision Foundation Model Embedding-Based Semantic Anomaly Detection NA
2025.05 Synthetic Data Pre-Training for Runway Damage Assessment NA
2025.05 Single-sided estimates of surface breaking porosity in additive manufacturing using multiple inspection techniques NA
2025.05 TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models 🔗 Code
2025.05 PixelThink: Towards Efficient Chain-of-Pixel Reasoning 🌐Project page
2025.05 SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target Detection 🔗 Code
2025.06 Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval 🔗 Code
2025.07 HFS-SAM2: Segment Anything Model 2 with High-Frequency Feature Supplementation for Camouflaged Object Detection 🔗 Code

Video Segmentation

Release Title Code
2024.08 Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track NA
2024.08 The 2nd Solution for LSVOS Challenge RVOS Track: Spatial-temporal Refinement for Consistent Semantic Segmentation NA
2024.08 LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS NA
2024.09 Temporally Propagated Masks and Boxes: Combining the Best of Both Worlds for Multi-Object Tracking NA
2024.10 SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree 🔗 Code
2024.11 A Distractor-Aware Memory for Visual Object Tracking with SAM2 🔗 Code
2024.11 There is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks 🔗 Code
2024.12 VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLMs 🌐Project page
2025.02 Towards Fine-grained Interactive Segmentation in Images and Videos NA
2025.03 WeGen: A Unified Model for Interactive Multimodal Generation as We Chat 🔗 Code
2025.03 Pseudo-LiDAR With Two-Dimensional Instance for Monocular Three-Dimensional Object Tracking NA
2025.03 WeakMedSAM: Weakly-Supervised Medical Image Segmentation via SAM with Sub-Class Exploration and Prompt Affinity Mining 🔗 Code
2025.03 Segment Any Motion in Videos 🌐Project page
2025.03 MemorySAM: Memorize Modalities and Semantics with Segment Anything Model 2 for Multi-modal Semantic Segmentation 🔗 Code
2025.04 SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation 🔗 Code
2025.04 DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency 🔗 Code
2025.07 HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking 🔗 Code
Referring/Reasoning Video Object Segmentation
Release Title Code
2024.08 Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation 🔗 Code
2024.11 SAMWISE: Infusing wisdom in SAM2 for Text-Driven Video Segmentation 🔗 Code
2024.11 SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory 🔗 Code
2025.02 Text-Promtable propagation for referring medical image sequence segmentation NA
2025.03 Online Reasoning Video Segmentation with Just-in-Time Digital Twins NA
2025.04 The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation NA
2025.04 GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation 🌐Project page

Other Video Tasks

Release Title Code
2025.03 MMCD: Memory-Based Multimodal Change Detection NA
2025.03 EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting 🕒Soon
2025.03 Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene 🌐Project page
2025.03 EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining 🔗 Code
2025.03 High Temporal Consistency through Semantic Similarity Propagation in Semi-Supervised Video Semantic Segmentation for Autonomous Flight 🔗 Code
2025.03 FusionSegReID: Advancing Person Re-Identification with Multimodal Retrieval and Precise Segmentation NA
2025.04 Segment Any Motion in Videos 🌐Project page
2025.05 SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation 🔗 Code
2025.04 Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting 🔗 Code
2025.04 How Can Objects Help Video-Language Understanding? 🕒Soon
2025.04 SAMJAM:Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos NA
2025.05 Research on a traffic flow statistical algorithm based on YBOVDT and SAM2 📊 Data
2025.05 One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory NA
2025.06 ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer 🔗 Code
2025.06 Track Any Object:A Granular Video Anomaly Detection Pipeline 🌐Project page
2025.06 Open-World Object Counting in Videos 🔗 Code
2025.06 Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations NA
2025.06 SAM2RL:Towards Reinforcement Learning Memory Control in Segment Anything Model 2 NA
2025.07 Visual tracking by matching points using diffusion model 🔗 Code
2025.07 Intelligent and quantitative ligament breakup event analysis in 65 kHz off-axis holographic video of swirl spray 🔗 Code
2025.07 Towards Blind Bitstream-corrupted Video Recovery: AVisual Foundation Model-driven Framework NA
2025.07 SAMITE: Position Prompted SAM2 with Calibrated Memory for Visual Object Tracking 🔗 Code

Audio-visual segmentation (AVS)

Release Title Code
2025.02 Audio visual segmentation through text embeddings NA

Medical Domain

Medical Video & 3D Segmentation

Release Title Code
2024.08 Segment anything in medical images and videos: Benchmark and deployment 🔗 Code
2024.08 SAM 2 in Robotic Surgery: An Empirical Evaluation for Robustness and Generalization in Surgical Video Segmentation NA
2024.08 Performance and Non-adversarial Robustness of the Segment Anything Model 2 in Surgical Video Segmentation NA
2024.08 Novel adaptation of video segmentation to 3D MRI: efficient zero-shot knee segmentation with SAM2 NA
2024.08 Biomedical SAM 2: Segment anything in biomedical images and videos 🔗 Code
2024.08 Polyp SAM 2: Advancing Zero-shot Polyp Segmentation in Colorectal Cancer Detection 🔗 Code
2024.08 Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning 🔗 Code
2024.08 Performance and Non-adversarial Robustness of the Segment Anything Model 2 in Surgical Video Segmentation NA
2024.09 SAM-OCTA2: Layer Sequence OCTA Segmentation with Fine-tuned Segment Anything Model 2 🔗 Code
2024.09 Self-Prompting Polyp Segmentation in Colonoscopy using Hybrid Yolo-SAM 2 Model 🔗 Code
2024.10 A-MFST: Adaptive Multi-Flow Sparse Tracker for Real-Time Tissue Tracking Under Occlusion NA
2024.10 ECHOPulse: ECG controlled echocardio-grams video generation 🔗 Code
2024.11 Phase-Informed Tool Segmentation for Manual Small-Incision Cataract Surgery NA
2025.02 SASVi - Segment Any Surgical Video 🔗 Code
2025.02 Less is More? Revisiting the Importance of Frame Rate in Real-Time Zero-Shot Surgical Video Segmentation
2025.03 SurgiSAM2: Fine-tuning a foundational model for surgical video anatomy segmentation and detection 🔗 Code(& dataset)
2025.03 Surgical Gaussian Surfels: Highly Accurate Real-time Surgical Scene Rendering 🔗 Code
2025.03 Rethinking Few-Shot Medical Image Segmentation by SAM2: A Training-Free Framework with Augmentative Prompting and Dynamic Matching NA
2025.03 Self-Prompting Driven SAM2 for 3D Medical Image Segmentation NA
2025.04 RP-SAM2: Refining Point Prompts for Stable Surgical Instrument Segmentation 🔗 Code
2025.04 Agglomerating Large Vision Encoders via Distillation for VFSS Segmentation NA
2025.04 MedSAM2: Segment Anything in 3D Medical Images and Videos 🔗 Code
2025.05 Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos 🕒Soon
2025.05 Adapting Segment Anything 2 for Diabetic Retinopathy Lesion Segmentation NA
2025.07 Beyond Rigid AI: Towards Natural Human-Machine Symbiosis for Interoperative Surgical Assistance NA
2025.07 Towards Affordable Tumor Segmentation and Visualization for 3D Breast MRI Using SAM2 NA

Medical Image Segmentation

Release Title Code
2024.08 SAM & SAM 2 in 3D Slicer: SegmentWithSAM Extension for Annotating Medical Images 🔗 Code
2024.08 SAM2-PATH: A better segment anything model for semantic segmentation in digital pathology 🔗 Code
2024.08 Is SAM 2 Better than SAM in Medical Image Segmentation? NA
2024.08 A Short Review and Evaluation of SAM2's Performance in 3D CT Image Segmentation 🔗 Code
2024.08 Interactive 3D Medical Image Segmentation 🔗 Code
2024.08 SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation 🔗 Code
2024.08 SAM2-Adapter: Evaluating & Adapting Segment Anything 2 in Downstream Tasks: Camouflage, Shadow, Medical Image Segmentation, and More 🔗 Code
2024.08 Retrieval-augmented Few-shot Medical Image Segmentation with Foundation Models
2024.10 SAM-Swin: SAM-Driven Dual-Swin Transformers with Adaptive Lesion Enhancement for Laryngo-Pharyngeal Tumor Detection 🔗 Code
2024.11 A multi-task learning model for clinically interpretable sesamoiditis grading NA
2024.11 Zero-shot capability of SAM-family models for bone segmentation in CT scans NA
2024.11 SAM-I2I: Unleash the Power of Segment Anything Model for Medical Image Translation NA
2024.12 Medical SAM 2: Segment Medical Images As Video Via Segment Anything Model 2 🔗 Code
2025.03 Self-Prompting Driven SAM2 for 3D Medical Image Segmentation NA
2025.03 Research on recognition of diabetic retinopathy hemorrhage lesions based on fine tuning of segment anything model NA
2025.03 RP-SAM2: Refining Point Prompts for Stable Surgical Instrument Segmentation 🔗 Code
2025.04 HRMedSeg: Unlocking High-resolution Medical Image segmentation via Memory-efficient Attention Modeling 🔗 Code
2025.04 Prompt Once, Segment Everything: Leveraging SAM 2 Potential for Infinite Medical Image Segmentation with a Single Prompt 🔗 Code
2025.05 ReSurgSAM2: Referring Segment Anything in Surgical Video via Credible Long-term Tracking 🔗 Code
2025.06 MorphSAM: Learning the Morphological Prompts from Atlases for Spine Image Segmentation NA
2025.06 SynPo: Boosting Training-Free Few-Shot Medical Segmentation via High-Quality Negative Prompts 🌐Project page
2025.06 Detection of Breast Cancer Lumpectomy Margin with SAM-incorporated Forward-Forward Contrastive Learning 🔗 Code
2025.07 Speckle2Self: Self-Supervised Ultrasound Speckle Reduction Without Clean Data 🕒Soon

Other Medical Applications

Release Title Code
2025.03 Flip Learning: Weakly supervised erase to segment nodules in breast ultrasound NA
2025.03 From Monocular Vision to Autonomous Action:Guiding Tumor Resection via 3D Reconstruction NA
2025.03 Operating Room Workflow Analysis via Reasoning Segmentation over Digital Twins NA
2025.03 Early Detection and Classification of Lung Cancer using Segment Anything Model 2 and Dense Net NA
2025.04 Zero-Shot 4D Lidar Panoptic Segmentation NA
2025.04 SYNTHFM: Training Modality-Agnostic Foundation Models for Medical Image Segmentation Without Real Medical Data NA
2025.04 VoxelFeat: Voxel-wise foundation model features NA
2025.06 Leadership Assessment in Pediatric Intensive Care Unit Team Training NA

Graph Learning

Release Title Code
2025.03 Universal Scene Graph Generation 🌐Project page

Camouflaged Object Detection (COD)

Video COD

Release Title Code
2024.07 Evaluating SAM2's Role in Camouflaged Object Detection: From SAM to SAM2 🔗 Code
2024.09 When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation 🔗 Code
2025.03 CamSAM2: Segment Anything Accurately in Camouflaged Videos 🔗 Code
2025.04 CamoSAM2: Motion-Appearance Induced Auto-Refining Prompts for Video Camouflaged Object Detection NA
2025.04 CamoSAM2: Motion-Appearance Induced Auto-Refining Prompts for Video Camouflaged Object Detection NA

Image COD

Release Title Code
2024.08 SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation 🔗 Code
2024.08 SAM2-Adapter: Evaluating & Adapting Segment Anything 2 in Downstream Tasks: Camouflage, Shadow, Medical Image Segmentation, and More 🔗 Code

Remote Sensing

Release Title Code
2024.11 DED-SAM: Adapting Segment Anything Model 2 for Dual Encoder-Decoder Change Detection NA
2025.01 Prompt-Based Segmentation at Multiple Resolutions and Lighting Conditions using Segment Anything Model 2 NA
2025.03 Customized SAM 2 for Referring Remote Sensing Image Segmentation NA
2025.05 InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition 🔗 Code
2025.06 Baltimore Atlas: FreqWeaver Adapter for Semi-supervised Ultra-high Spatial Resolution Land Cover Classification NA
2025.06 Bundle adjustment for multi-source Mars orbiter imagery with generalized control constraints NA
2025.07 Leveraging SAM 2 and LiDAR for Automated Individual Tree Crown Delineation: A Comparative Evaluation of Prompting Methods 🔗 Code
2025.07 Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment 🔗 Code
2025.07 CSW-SAM: a cross-scale algorithm for very-high-resolution water body segmentation based on segment anything model 2 NA
2025.07 A Fine Agricultural Flood Segmentation Model For HJ-2E S-band SAR Data NA

Mesh or Point Cloud / 3D Processing

Mesh or Point Cloud Segmentation

Release Title Code
2024.08 Segment Any Mesh: Zero-shot Mesh Part Segmentation via Lifting Segment Anything 2 to 3D 🔗 Code
2024.11 Object and Contact Point Tracking in Demonstrations Using 3D Gaussian Splatting NA
2024.11 Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking 🌐Project page
2025.03 Segment-then-Splat: A Unified Approach for 3D Open-Vocabulary Segmentation based on Gaussian Splatting 🌐Project page
2025.04 DSM: Building A Diverse Semantic Map for 3D Visual Grounding 🌐Project page
2025.07 GraphSeg: Constructing Segmented 3D Representations via Graph Edge Addition and Contraction NA

Mesh or Point Cloud Reconstruction

Release Title Code
2024.12 Deblur4DGS: 4D Gaussian Splatting from Blurry Monocular Videos 🌐Project page
2024.11 Updating Dynamic 3D Scene Graphs from Egocentric Observations 🌐Project page
2025.02 Inter3D: A Benchmark and Strong Baseline for Human-Interactive 3D Object Reconstruction 🔗 Code

Other Applications

Release Title Code
2025.03 LP-Gaussians: Learnable Parametric Gaussian Splatting for Efficient Dynamic Reconstruction of Single-View Scenes 🌐Project page
2025.03 DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction 🌐Project page
2025.03 Free Your Hands: Lightweight Relightable Turntable Capture Pipeline NA
2025.03 WildSeg3D: Segment Any 3D Objects in the Wild from 2D Images 🕒Soon
2025.03 Pseudo-LiDAR With Two-Dimensional Instance for Monocular Three-Dimensional Object Tracking NA
2025.03 SED-MVS: Segmentation-Driven and Edge-Aligned Deformation Multi-View Stereo with Depth Restoration and Occlusion Constraint NA
2025.03 SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining 🔗 Code
2025.03 COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting 🔗 Code
2025.03 Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields 🌐Project page
2025.03 Semantic Consistent Language Gaussian Splatting for Point-Level Open-vocabulary Querying 🌐Project page
2025.04 Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting 🌐Project page
2025.04 FMLGS: Fast Multilevel Language Embedded Gaussians for Part-level Interactive Agents NA
2025.06 GENMANIP: LLM-driven Simulation for Generalizable Instruction-Following Manipulation 🌐Project page
2025.05 Constructing a 3D Town from a Single Image 🌐Project page
2025.06 GenMOJO: Robust Multi-Object 4D Generation for In-the-wild Videos 🌐Project page
2025.06 CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image 🌐Project page
2025.06 BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing 🌐Project page
2025.07 LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion 🌐Project page
2025.07 Consistent Bokeh for Multi-View Images With 3D Gaussian Splatting 🔗 Code
2025.07 Defect segmentation and 3D reconstruction in concrete structures using SAM 2 and 3D Gaussian splatting Upon Request
2025.07 Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints 🔗 Code
2025.07 MG-Mono: A Lightweight Multi-Granularity Method for Self-Supervised Monocular Depth Estimation 🔗 Code

Image or Video Generation & Editing

Release Title Code
2024.10 AdaptiveDrag: Semantic-Driven Dragging on Diffusion-Based Image Editing 🔗 Code
2024.11 VideoDirector: Precise Video Editing via Text-to-Video Models 🌐Project page
2024.11 VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing 🌐Project page
2024.11 Generative Omnimatte: Learning to Decompose Video into Layers 🌐Project page
2024.12 InterDyn: Controllable Interactive Dynamics with Video Diffusion Models 🌐Project page
2025.01 MovieCharacter: A Tuning-Free Framework for Controllable Character Video Synthesis 🌐Project page
2025.01 BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations 🌐Project page
2025.03 TransVDM: Motion-Constrained Video Diffusion Model for Transparent Video Synthesis NA
2025.03 Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias 🔗 Code
2025.03 Unified Dense Prediction of Video Diffusion NA
2025.03 DreamInsert: Zero-Shot Image-to-Video Object Insertion from A Single Image 🕒Soon
2025.03 ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis 🌐Project page
2025.03 FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing 🌐Project page
2025.03 MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance 🌐Project page
2025.03 Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting 🌐Project page
2025.03 Multi-Subject and Motion Customization of Text-to-Video Diffusion Models 🌐Project page
2025.04 DreamFuse: Adaptive Image Fusion with Diffusion Transformer 🌐Project page
2025.04 Enhanced Semantic Extraction and Guidance for UGC Image Super Resolution 🔗 Code
2025.06 Keyframe-Guided Creative Video Inpainting 🌐Project page
2025.06 OmniGen2: Exploration to Advanced Multimodal Generation 🌐Project page
2025.07 Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning 🔗 Code
2025.07 Enhanced Velocity Field Modeling for Gaussian Video Reconstruction NA

Simultaneous Localization and Mapping (SLAM / VO)

Release Title Code
2024.11 OVO-SLAM: Open-Vocabulary Online Simultaneous Localization and Mapping NA
2025.06 MCOO-SLAM: A Multi-Camera Omnidirectional Object SLAM System NA
2025.07 VISTA: Monocular Segmentation-Based Mapping for Appearance and View-Invariant Global Localization NA

Light Field Segmentation

Release Title Code
2024.11 Segment Anything in Light Fields for Real-Time Applications via Constrained Prompting 🔗 Code

Robotics

Release Title Code
2024.10 A Pipeline for Segmenting and Structuring RGB-D Data for Robotics Applications NA
2024.10 VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model 🌐Project page
2025.02 Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos 🌐Project page
2025.02 Map Space Belief Prediction for Manipulation-Enhanced Mapping (To be released)
2025.02 Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids 🌐Project page
2025.03 DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping 🌐Project page
2025.03 Autonomous Dissection in Robotic Cholecystectomy NA
2025.03 MetaFold: Language-Guided Multi-Category Garment Folding Framework via Trajectory Generation and Foundation Model 🌐Project page
2025.03 LuciBot: Automated Robot Policy Learning from Generated Videos 🌐Project page
2025.03 IMPACT : Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models 🌐Project page
2025.03 VISO-Grasp: Vision-Language Informed Spatial Object-centric 6-DoF Active View Planning and Grasping in Clutter and Invisibility NA
2025.04 Slot-Level Robotic Placement via Visual Imitation from Single Human Video 🌐Project page
2025.04 Entangled chip removal utilizing mass-spring model with mobile manipulator NA
2025.05 Symbolically-Guided Visual Plan Inference from Uncurated Video Data NA
2025.05 Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer 🌐Project page
2025.05 Grasp the Invisibility by Vision-Language guided Active View Planning NA
2025.07 Geometry-aware 4D Video Generation for Robot Manipulation 🌐Project page
2025.07 Object-Centric Mobile Manipulation through SAM2-Guided Perception and Imitation Learning NA
2025.07 GraspGen: A Diffusion-based Framework for 6-DOF Grasping 🌐Project page
2025.07 RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping 🔗 Code

Adaptation, Compression & Edge Applications

Release Title Code
2025.03 LVMScissor: Split and Schedule Large Vision Model Inference on Mobile Edges via Salp Swarm Algorithm NA
2025.03 SALT: Parameter-Efficient Fine-Tuning via Singular Value Adaptation with Low-Rank Transformation 🔗 Code
2025.04 Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models NA
2025.05 Deploying Vision Foundation AI Models on the Edge. The SAM2 Experience NA

Training

Datasets

Release Title Code
2025.02 SurgPose: a Dataset for Articulated Robotic Surgical Tool Pose Estimation and Tracking 🌐Project page
2025.02 The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition NA
2025.02 Picking the Cream of the Crop:Visual-Centric Data Selection with Collaborative Agents 🔗 Code
2025.03 Phantom: Training Robots Without Robots Using Only Human Videos 🌐Project page
2025.03 Scalable Real2Sim: Physics-Aware Asset Generation Via Robotic Pick-and-Place Setups 🌐Project page
2025.03 What Are You Doing? A Closer Look at Controllable Human Video Generation 🔗 Code
2025.03 Instrument-Splatting: Controllable Photorealistic Reconstruction of Surgical Instruments Using Gaussian Splatting 🕒Soon
2025.03 Referring to Any Person 🌐Project page
2025.03 AUTV: Creating Underwater Video Datasets with Pixel-wise Annotations NA
2025.03 DynOPETs: A Versatile Benchmark for Dynamic Object Pose Estimation and Tracking in Moving Camera Scenarios 🌐Project page
2025.05 A fusion network for multi-modality medical image registration with progressive feature alignment 🔗 Code
2025.04 InteractAvatar: Modeling Hand-Face Interaction in Photorealistic Avatars with Deformable Gaussians NA
2025.04 VideoSPatS: Video SPatiotemporal Splines for Disentangled Occlusion, Appearance and Motion Modeling and Editing 🌐Project page
2025.04 UrbanWaste: In-the-Bin Dataset for Waste Disposal Inspection with Multi-Granularity Hierarchical Labels 🌐Project page
2025.06 HD-EPIC: A Highly-Detailed Egocentric Video Dataset 🌐Project page
2025.06 GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities 🌐Project page
2025.06 INTERNSPATIAL: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models NA
2025.06 BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos 🌐Project page
2025.06 SAM4D:Segment Anything in Camera and LiDAR Streams 🌐Project page
2025.06 XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation 🌐Project page
2025.07 A New Dataset and Performance Benchmark for Real-time Spacecraft Segmentation in Onboard Flight Computers 🔗 Code
2025.07 Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation 🌐Project page

Used for Data Augmentation (/Tool)

Release Title Code
2025.02 Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach 🌐Project page
2025.03 A Taxonomy for Evaluating Generalist Robot Policies 🌐Project page
2025.03 CRESTE: Scalable Mapless Navigation with Internet Scale Priors and Counterfactual Guidance 🌐Project page
2025.03 Shaken, Not Stirred: A Novel Dataset for Visual Understanding of Glasses in Human-Robot Bartending Tasks 🌐Project page
2025.03 YOLOE: Real-Time Seeing Anything 🔗 Code
2025.03 VACE: All-in-One Video Creation and Editing 🌐Project page
2025.03 V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video 🌐Project page
2025.03 Better Together: Unified Motion Capture and 3D Avatar Reconstruction NA
2025.03 CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance NA
2025.03 RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance 🌐Project page
2025.03 Evaluating the FLUX.1 Synthetic Data on YOLOv9 for AI-Powered Poultry Farming NA
2025.03 Any2Caption : Interpreting Any Condition to Caption for Controllable Video Generation 🌐Project page
2025.05 LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration 🔗 Code
2025.04 VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning 🌐Project page
2025.04 M2Flow: A Motion Information Fusion Framework for Enhanced Unsupervised Optical Flow Estimation in Autonomous Driving NA
2025.05 Interspatial Attention for Efficient 4D Human Video Generation 🌐Project page
2025.06 Real-Time Per-Garment Virtual Try-On with Temporal Consistency for Loose-Fitting Garments NA
2025.06 Impact of Synthetic Data from Diffusion Models on Weed Detection Performance NA
2025.06 VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models 🌐Project page
2025.06 Building Software for Analyzing Muck Piles After Blasting in Laboratory Conditions with Integrated Artificial Intelligence NA
2025.06 WeedSwin hierarchical vision transformer with SAM-2 for multi-stage weed detection and classification On Request
2025.07 Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data 🔗 Code
2025.07 RCG: Safety-Critical Scenario Generation for Robust Autonomous Driving via Real-World Crash Grounding 🕒Soon

Training Helper

Release Title Code
2025.03 DINeMo: Learning Neural Mesh Models with no 3D Annotations 🌐Project page
2025.04 CoProSketch: Controllable and Progressive Sketch Generation with Diffusion Model NA
2025.04 Aligning Anime Video Generation with Human Feedback 🕒Soon
2025.04 OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding 🌐Project page
2025.06 HunyuanVideo-HOMA Generic Human-Object Interaction in Multimodal Driven Human Animation 🌐Project page
2025.06 Enhancing Visual Localization with Cross-Domain Image Generation 🌐Project page
2025.07 RoboBrain 2.0: See Better. Think Harder. Do Smarter. 🌐Project page
2025.07 Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents 🔗 Code

Performance Evaluations

Release Title Code
2025.02 Vector-Quantized Vision Foundation Models for Object-Centric Learning NA
2025.04 WorldScore: A Unified Evaluation Benchmark for World Generation 🌐Project page
2025.04 BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting NA
2025.05 UWSAM: Segment Anything Model Guided Underwater Instance Segmentation and A Large-scale Benchmark Dataset 🔗 Code
2025.05 Leveraging Segment Anything Model 2 (SAM 2) to optimize segmentation for synthetic data quality in high-clutter baggage NA
2025.05 Synergistic Enhancement: A Study on the Design of Large Models Assisted by End-toEnd Road Damage Prompt Network and Methods for Quantification of Damage Morphological Features NA
2025.06 Towards Scalable and Generalizable Earth Observation Data Mining via Foundation Model Composition NA
2025.06 AI-Driven MRI-based Brain Tumour Segmentation Benchmarking NA
2025.07 Amulti-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level 🔗 Code
2025.07 enLLASD: An ensemble deep learning framework to automate derivation of lower-limb alignments for skeletal dysplasia NA
2025.07 Semantic Segmentation of iPS Cells: Case Study on Model Complexity in Biomedical Imaging NA

Post Processing

Release Title Code
2025.03 Easi3R: Estimating Disentangled Motion from DUSt3R Without Training 🌐Project page
2025.04 Multi-identity Human Image Animation with Structural Video Diffusion NA
2025.06 Vid-CamEdit: Video Camera Trajectory Editing with Generative Rendering from Estimated Geometry 🌐Project page
2025.06 Leader360V: A Large-scale, Real-world 360 Video Dataset for Multi-task Learning in Diverse Environments NA
2025.07 Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction 🌐Project page

Robustness

Release Title Code
2025.04 Robust SAM: On the Adversarial Robustness of Vision Foundation Models NA

Unique Applications/Usage

Release Title Code
2024.09 Towards Robust Automation of Surgical Systems via Digital Twin-based Scene Representations from Foundation Models NA
2024.09 Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models 🔗 Code
2024.09 Point of Interest Recognition and Tracking in Aerial Video during Live Cycling Broadcasts NA
2024.10 ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting 🔗 Code
2024.10 GRS: Generating Robotic Simulation Tasks from Real-World Images NA
2024.10 Iterative Optimization Annotation Pipeline and ALSS-YOLO-Seg for Efficient Banana Plantation Segmentation in UAV Imagery NA
2024.10 Next Best Sense: Guiding Vision and Touch with FisherRF for 3D Gaussian Splatting 🔗 Code
2025.01 Zero-Shot Pupil Segmentation with SAM 2: A Case Study of Over 14 Million Images 📊Data
2025.02 Best Foot Forward: Robust Foot Reconstruction in-the-wild
2025.03 ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment 🌐Project page
2025.03 JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse 🌐Project page
2025.04 MORPHEUS: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments 🌐Project page
2025.05 Air-Ground Collaboration for Language-Specified Missions in Unknown Environments NA
2025.06 In Situ Detection and Measurement of Broccoli Heads Under Different Lighting Conditions Using Proximal Remote Sensing NA
2025.07 Zero-Shot Recognition of Test Tube Types by Automatically Collecting and Labeling RGB Data NA
2025.07 Box Pose and Shape Estimation and Domain Adaptation for Large-Scale Warehouse Automation NA
2025.07 Phys2Real: Physically-Informed Gaussian Splatting for Adaptive Sim-to-Real Transfer in Robotic Manipulation NA

About

This repo aims to include materials (papers, codes, slides) about SAM2 (segment anything in images and videos). We are continuously improving the project. Welcome to PR the works (papers, repos) that are missed.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •