Overview
Image feature extraction models encode images into dense vector representations using self-supervised vision transformers. DINOv2 (Meta) produces high-quality visual embeddings without task-specific fine-tuning, achieving strong results on classification, segmentation, and retrieval through linear probing alone. DINO-ViTS16 is the original self-distillation predecessor with strong spatial feature quality.
Agent Scenarios
- Visual search agent: retrieve visually similar images from a large gallery by comparing DINOv2 embeddings — product search, reverse image lookup, or asset management
- Few-shot recognition agent: classify new object categories from just 1–5 examples by comparing embeddings, without retraining
- Multimodal RAG agent: index image embeddings alongside text embeddings in a shared vector store for cross-modal retrieval
- Anomaly detection agent: flag manufacturing defects or unusual scenes by measuring embedding distance from a reference distribution of normal images
ModelKit Integration
wmk config → wmk build (ONNX export) → wmk perf → wmk eval
EP Coverage Status
| Model |
QNN |
OV |
VitisAI |
| facebook/dinov2-giant |
FAIL |
FAIL |
PASS |
| facebook/dino-vits16 |
PASS |
PASS |
FAIL |
Two distinct failure patterns: dinov2-giant (largest model) fails on QNN + OV; dino-vits16 fails on VitisAI only.
Acceptance Criteria
Overview
Image feature extraction models encode images into dense vector representations using self-supervised vision transformers. DINOv2 (Meta) produces high-quality visual embeddings without task-specific fine-tuning, achieving strong results on classification, segmentation, and retrieval through linear probing alone. DINO-ViTS16 is the original self-distillation predecessor with strong spatial feature quality.
Agent Scenarios
ModelKit Integration
EP Coverage Status
Acceptance Criteria