feat: implement image-feature-extraction evaluator by zhenchaoni · Pull Request #346 · microsoft/winml-cli

zhenchaoni · 2026-04-14T09:09:13Z

Image Feature Extraction Evaluator Design

1. Overview

Image feature extraction models (DINOv2, DINO, ViT-in21k, etc.) are vision backbone encoders. They take an image and produce a dense embedding vector — they have no classification head and cannot directly predict class labels.

Example:

Input:  PIL Image of a golden retriever (224×224)
Output: [0.12, -0.45, 0.78, ...]   (768-dimensional embedding vector)

To evaluate embedding quality, we use a k-Nearest Neighbor (kNN) classifier: for each image, we find the most similar images in the dataset and predict the label by weighted majority vote among the neighbors.

2. Usage

uv run winml eval \
    -m ~/.cache/winml/artifacts/facebook_dinov2-small/imgfeat_f79371ee88ad37b4_model.onnx \
    --model-id facebook/dinov2-small \
    --device npu \
    --task image-feature-extraction \
    --dataset timm/mini-imagenet \
    --split test \
    --samples 1000

Output:

╭───────────────────────────────────╮
│ Evaluation: facebook/dinov2-small │
╰───────────────────────────────────╯

Task:       image-feature-extraction
Device:     npu
Dataset:    timm/mini-imagenet
Samples:    1000

┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Metric            ┃   Value ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ knn_top1_accuracy │ 85.8000 │
│ knn_top5_accuracy │ 97.0000 │
└───────────────────┴─────────┘

3. Model Output

The HuggingFace image-feature-extraction pipeline returns a nested list:

pipe(image) -> [[[float, ...]]]    shape: [1, num_tokens, hidden_dim]

For a ViT-base model with 224×224 input and 16×16 patches, this is [1, 197, 768]:

Token 0: CLS token — a single vector representing the entire image
Tokens 1–196: patch tokens — local representations for each 16×16 region

The evaluator extracts token 0 (the CLS token) as the image embedding. This is the standard protocol used by DINOv2, DINO, and ViT papers for downstream evaluation.

4. Metric

kNN top-1 accuracy is the primary metric, with top-5 accuracy as a secondary metric.

Given $N$ images with embeddings ${e_i}$ and ground-truth labels ${y_i}$:

Compute cosine similarity between all pairs: $\text{sim}(e_i, e_j) = \frac{e_i \cdot e_j}{|e_i| |e_j|}$
For each image $i$, find its $k$ nearest neighbors (excluding itself)
Predict label via distance-weighted majority vote: each neighbor votes for its own label, and its vote is weighted by how similar it is to the query image. The class with the highest total vote weight wins.

Example (k=5 neighbors for a query image):

Neighbor Label Similarity (weight)

img_42 dog 0.95

img_87 dog 0.91

img_11 cat 0.88

img_55 dog 0.85

img_23 fox 0.80

Vote totals: dog = 0.95 + 0.91 + 0.85 = 2.71, cat = 0.88, fox = 0.80
→ Predicted label = dog
Report accuracy:

top1_accuracy = correct_predictions / N × 100

For top-5, the prediction is correct if $y_i$ appears among the 5 classes with highest accumulated vote weight.

Why kNN accuracy?

It is the standard evaluation protocol. DINOv2, DINO, and MAE papers all report kNN accuracy as their primary embedding quality metric. Using the same protocol makes our results directly comparable to published results.
No training required. Unlike linear probing (which trains a classifier on top of frozen embeddings), kNN requires no training — it measures the raw quality of the embedding geometry. If similar images land near each other in embedding space, accuracy is high.
Sensitive to ONNX conversion quality. If quantization or graph optimization degrades the embedding space, nearby neighbors change, and accuracy drops proportionally. A delta of 0.5% between ONNX and PyTorch baseline reliably indicates the conversion preserved embedding quality.

Parameters:

$k = 10$ neighbors (default, auto-capped to $N-1$ if dataset is smaller)
Cosine similarity (embeddings are L2-normalized before comparison)
Weighted voting (closer neighbors have more influence)

5. Implementation

Evaluator (`WinMLImageFeatureExtractionEvaluator`)

Iterates over the dataset and computes one embedding per image using the HF image-feature-extraction pipeline. The pipeline returns all token vectors; the evaluator takes the first token (CLS token) as the image-level embedding. All embeddings and labels are collected into numpy arrays and passed to the kNN metric.

kNN Metric (`KNNAccuracyMetric`)

The metric has two steps:

Predict labels (_predict_labels): L2-normalize all embeddings, compute the full cosine similarity matrix via a single matrix multiply, then for each image find its k nearest neighbors using argpartition. Each neighbor votes for its label weighted by similarity. The class with the highest total vote weight becomes the predicted label. A ranked list of top-5 classes is also produced.
Compute accuracy (_compute_accuracy): Compare predicted labels against ground-truth labels. Count how many top-1 predictions match (top-1 accuracy) and how many ground-truth labels appear in the top-5 voted classes (top-5 accuracy). Report both as percentages.

Default Dataset

timm/mini-imagenet (test split), 1000 samples with shuffle. The dataset is perfectly balanced: 100 classes with 50 images each (5000 total). After shuffling and sampling 1000, each class has ~10 images — sufficient density for k=10 kNN voting and ensures accuracy is not skewed by class imbalance.

Time breakdown (1000 images, DINOv2-small):

Embedding: ~30s on NPU, ~120s on CPU (the bottleneck)
kNN computation: <10ms (single matrix multiply + voting loop)
Memory: ~10MB (1000×768 float32 embeddings + 1000×1000 similarity matrix)

6. Evaluation Results

Model	ONNX (NPU)	Baseline (CPU)	Delta	Verdict
facebook/dinov2-small	85.80%	86.20%	-0.4%	PASS
facebook/dinov2-base	89.20%	89.50%	-0.3%	PASS
facebook/dinov2-large	90.80%	91.10%	-0.3%	PASS
facebook/dino-vits16	79.90%	79.70%	+0.2%	PASS
facebook/dino-vitb16	82.70%	83.20%	-0.5%	PASS
google/vit-base-patch16-224-in21k	92.10%	91.90%	+0.2%	PASS
microsoft/rad-dino	94.85%	94.67%	+0.17%	PASS
StanfordAIMI/dinov2-base-xray-224	93.47%	93.64%	-0.17%	PASS

All models pass with deltas within ±0.5% — quantized ONNX on NPU preserves embedding quality across general vision and radiology domain models.

…feature-extraction-eval

zhenchaoni added 2 commits April 14, 2026 15:20

Implement image-feature-extraction eval

8464657

Resolve comments

30be655

zhenchaoni requested a review from a team as a code owner April 14, 2026 09:09

zhenchaoni linked an issue Apr 14, 2026 that may be closed by this pull request

[Task] image-feature-extraction model support #276

Closed

2 tasks

zhenchaoni removed a link to an issue Apr 14, 2026

[Task] image-feature-extraction model support #276

Closed

2 tasks

zhenchaoni linked an issue Apr 14, 2026 that may be closed by this pull request

Support image-feature-extraction model evaluation #347

Closed

DingmaomaoBJTU mentioned this pull request Apr 21, 2026

feat: implement fill-mask model task evaluator #341

Merged

DingmaomaoBJTU reviewed Apr 21, 2026

View reviewed changes

Comment thread src/winml/modelkit/eval/image_feature_extraction_evaluator.py Outdated

Comment thread scripts/e2e_eval/testsets/models_with_acc.json

Comment thread tests/unit/eval/test_image_feature_extraction_evaluator.py

zhenchaoni added 2 commits April 21, 2026 16:58

Resolve comments

1a74de0

Merge branch 'main' into private/zhenni/image-feature-extraction-eval

bbf86fe

DingmaomaoBJTU approved these changes Apr 22, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into private/zhenni/image-…

df1841a

…feature-extraction-eval

DingmaomaoBJTU approved these changes Apr 22, 2026

View reviewed changes

zhenchaoni merged commit c4dd355 into main Apr 22, 2026
9 checks passed

zhenchaoni deleted the private/zhenni/image-feature-extraction-eval branch April 22, 2026 04:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement image-feature-extraction evaluator#346

feat: implement image-feature-extraction evaluator#346
zhenchaoni merged 5 commits into
mainfrom
private/zhenni/image-feature-extraction-eval

zhenchaoni commented Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Neighbor	Label	Similarity (weight)
img_42	dog	0.95
img_87	dog	0.91
img_11	cat	0.88
img_55	dog	0.85
img_23	fox	0.80

Uh oh!

Conversation

zhenchaoni commented Apr 14, 2026

Image Feature Extraction Evaluator Design

1. Overview

2. Usage

3. Model Output

4. Metric

5. Implementation

Evaluator (WinMLImageFeatureExtractionEvaluator)

kNN Metric (KNNAccuracyMetric)

Default Dataset

6. Evaluation Results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Evaluator (`WinMLImageFeatureExtractionEvaluator`)

kNN Metric (`KNNAccuracyMetric`)