Skip to content

[Task] image-feature-extraction model support #276

Description

@chinazhangchao

Overview

Image feature extraction models encode images into dense vector representations using self-supervised vision transformers. DINOv2 (Meta) produces high-quality visual embeddings without task-specific fine-tuning, achieving strong results on classification, segmentation, and retrieval through linear probing alone. DINO-ViTS16 is the original self-distillation predecessor with strong spatial feature quality.

Agent Scenarios

  • Visual search agent: retrieve visually similar images from a large gallery by comparing DINOv2 embeddings — product search, reverse image lookup, or asset management
  • Few-shot recognition agent: classify new object categories from just 1–5 examples by comparing embeddings, without retraining
  • Multimodal RAG agent: index image embeddings alongside text embeddings in a shared vector store for cross-modal retrieval
  • Anomaly detection agent: flag manufacturing defects or unusual scenes by measuring embedding distance from a reference distribution of normal images

ModelKit Integration

wmk config → wmk build (ONNX export) → wmk perf → wmk eval

EP Coverage Status

Model QNN OV VitisAI
facebook/dinov2-giant FAIL FAIL PASS
facebook/dino-vits16 PASS PASS FAIL

Two distinct failure patterns: dinov2-giant (largest model) fails on QNN + OV; dino-vits16 fails on VitisAI only.

Acceptance Criteria

  • facebook/dinov2-giant
  • facebook/dino-vits16

Metadata

Metadata

Labels

No fields configured for Feature.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions