Skip to content

feat: implement image-feature-extraction evaluator#346

Merged
zhenchaoni merged 5 commits into
mainfrom
private/zhenni/image-feature-extraction-eval
Apr 22, 2026
Merged

feat: implement image-feature-extraction evaluator#346
zhenchaoni merged 5 commits into
mainfrom
private/zhenni/image-feature-extraction-eval

Conversation

@zhenchaoni

Copy link
Copy Markdown
Member

Image Feature Extraction Evaluator Design

1. Overview

Image feature extraction models (DINOv2, DINO, ViT-in21k, etc.) are vision backbone encoders. They take an image and produce a dense embedding vector — they have no classification head and cannot directly predict class labels.

Example:

Input:  PIL Image of a golden retriever (224×224)
Output: [0.12, -0.45, 0.78, ...]   (768-dimensional embedding vector)

To evaluate embedding quality, we use a k-Nearest Neighbor (kNN) classifier: for each image, we find the most similar images in the dataset and predict the label by weighted majority vote among the neighbors.

2. Usage

uv run winml eval \
    -m ~/.cache/winml/artifacts/facebook_dinov2-small/imgfeat_f79371ee88ad37b4_model.onnx \
    --model-id facebook/dinov2-small \
    --device npu \
    --task image-feature-extraction \
    --dataset timm/mini-imagenet \
    --split test \
    --samples 1000

Output:

╭───────────────────────────────────╮
│ Evaluation: facebook/dinov2-small │
╰───────────────────────────────────╯

Task:       image-feature-extraction
Device:     npu
Dataset:    timm/mini-imagenet
Samples:    1000

┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Metric            ┃   Value ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ knn_top1_accuracy │ 85.8000 │
│ knn_top5_accuracy │ 97.0000 │
└───────────────────┴─────────┘

3. Model Output

The HuggingFace image-feature-extraction pipeline returns a nested list:

pipe(image) -> [[[float, ...]]]    shape: [1, num_tokens, hidden_dim]

For a ViT-base model with 224×224 input and 16×16 patches, this is [1, 197, 768]:

  • Token 0: CLS token — a single vector representing the entire image
  • Tokens 1–196: patch tokens — local representations for each 16×16 region

The evaluator extracts token 0 (the CLS token) as the image embedding. This is the standard protocol used by DINOv2, DINO, and ViT papers for downstream evaluation.

4. Metric

kNN top-1 accuracy is the primary metric, with top-5 accuracy as a secondary metric.

Given $N$ images with embeddings ${e_i}$ and ground-truth labels ${y_i}$:

  1. Compute cosine similarity between all pairs: $\text{sim}(e_i, e_j) = \frac{e_i \cdot e_j}{|e_i| |e_j|}$

  2. For each image $i$, find its $k$ nearest neighbors (excluding itself)

  3. Predict label via distance-weighted majority vote: each neighbor votes for its own label, and its vote is weighted by how similar it is to the query image. The class with the highest total vote weight wins.

    Example (k=5 neighbors for a query image):

    Neighbor Label Similarity (weight)
    img_42 dog 0.95
    img_87 dog 0.91
    img_11 cat 0.88
    img_55 dog 0.85
    img_23 fox 0.80

    Vote totals: dog = 0.95 + 0.91 + 0.85 = 2.71, cat = 0.88, fox = 0.80
    → Predicted label = dog

  4. Report accuracy:

top1_accuracy = correct_predictions / N × 100

For top-5, the prediction is correct if $y_i$ appears among the 5 classes with highest accumulated vote weight.

Why kNN accuracy?

  • It is the standard evaluation protocol. DINOv2, DINO, and MAE papers all report kNN accuracy as their primary embedding quality metric. Using the same protocol makes our results directly comparable to published results.

  • No training required. Unlike linear probing (which trains a classifier on top of frozen embeddings), kNN requires no training — it measures the raw quality of the embedding geometry. If similar images land near each other in embedding space, accuracy is high.

  • Sensitive to ONNX conversion quality. If quantization or graph optimization degrades the embedding space, nearby neighbors change, and accuracy drops proportionally. A delta of 0.5% between ONNX and PyTorch baseline reliably indicates the conversion preserved embedding quality.

Parameters:

  • $k = 10$ neighbors (default, auto-capped to $N-1$ if dataset is smaller)
  • Cosine similarity (embeddings are L2-normalized before comparison)
  • Weighted voting (closer neighbors have more influence)

5. Implementation

Evaluator (WinMLImageFeatureExtractionEvaluator)

Iterates over the dataset and computes one embedding per image using the HF image-feature-extraction pipeline. The pipeline returns all token vectors; the evaluator takes the first token (CLS token) as the image-level embedding. All embeddings and labels are collected into numpy arrays and passed to the kNN metric.

kNN Metric (KNNAccuracyMetric)

The metric has two steps:

  1. Predict labels (_predict_labels): L2-normalize all embeddings, compute the full cosine similarity matrix via a single matrix multiply, then for each image find its k nearest neighbors using argpartition. Each neighbor votes for its label weighted by similarity. The class with the highest total vote weight becomes the predicted label. A ranked list of top-5 classes is also produced.

  2. Compute accuracy (_compute_accuracy): Compare predicted labels against ground-truth labels. Count how many top-1 predictions match (top-1 accuracy) and how many ground-truth labels appear in the top-5 voted classes (top-5 accuracy). Report both as percentages.

Default Dataset

timm/mini-imagenet (test split), 1000 samples with shuffle. The dataset is perfectly balanced: 100 classes with 50 images each (5000 total). After shuffling and sampling 1000, each class has ~10 images — sufficient density for k=10 kNN voting and ensures accuracy is not skewed by class imbalance.

Time breakdown (1000 images, DINOv2-small):

  • Embedding: ~30s on NPU, ~120s on CPU (the bottleneck)
  • kNN computation: <10ms (single matrix multiply + voting loop)
  • Memory: ~10MB (1000×768 float32 embeddings + 1000×1000 similarity matrix)

6. Evaluation Results

Model ONNX (NPU) Baseline (CPU) Delta Verdict
facebook/dinov2-small 85.80% 86.20% -0.4% PASS
facebook/dinov2-base 89.20% 89.50% -0.3% PASS
facebook/dinov2-large 90.80% 91.10% -0.3% PASS
facebook/dino-vits16 79.90% 79.70% +0.2% PASS
facebook/dino-vitb16 82.70% 83.20% -0.5% PASS
google/vit-base-patch16-224-in21k 92.10% 91.90% +0.2% PASS
microsoft/rad-dino 94.85% 94.67% +0.17% PASS
StanfordAIMI/dinov2-base-xray-224 93.47% 93.64% -0.17% PASS

All models pass with deltas within ±0.5% — quantized ONNX on NPU preserves embedding quality across general vision and radiology domain models.

@zhenchaoni zhenchaoni requested a review from a team as a code owner April 14, 2026 09:09
@zhenchaoni zhenchaoni linked an issue Apr 14, 2026 that may be closed by this pull request
2 tasks
@zhenchaoni zhenchaoni linked an issue Apr 14, 2026 that may be closed by this pull request
Comment thread src/winml/modelkit/eval/image_feature_extraction_evaluator.py Outdated
Comment thread scripts/e2e_eval/testsets/models_with_acc.json
Comment thread tests/unit/eval/test_image_feature_extraction_evaluator.py
@zhenchaoni zhenchaoni merged commit c4dd355 into main Apr 22, 2026
9 checks passed
@zhenchaoni zhenchaoni deleted the private/zhenni/image-feature-extraction-eval branch April 22, 2026 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support image-feature-extraction model evaluation

2 participants