feat: implement image-feature-extraction evaluator#346
Merged
Conversation
2 tasks
2 tasks
DingmaomaoBJTU
approved these changes
Apr 22, 2026
…feature-extraction-eval
DingmaomaoBJTU
approved these changes
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Image Feature Extraction Evaluator Design
1. Overview
Image feature extraction models (DINOv2, DINO, ViT-in21k, etc.) are vision backbone encoders. They take an image and produce a dense embedding vector — they have no classification head and cannot directly predict class labels.
Example:
To evaluate embedding quality, we use a k-Nearest Neighbor (kNN) classifier: for each image, we find the most similar images in the dataset and predict the label by weighted majority vote among the neighbors.
2. Usage
Output:
3. Model Output
The HuggingFace
image-feature-extractionpipeline returns a nested list:For a ViT-base model with 224×224 input and 16×16 patches, this is
[1, 197, 768]:The evaluator extracts token 0 (the CLS token) as the image embedding. This is the standard protocol used by DINOv2, DINO, and ViT papers for downstream evaluation.
4. Metric
kNN top-1 accuracy is the primary metric, with top-5 accuracy as a secondary metric.
Given$N$ images with embeddings ${e_i}$ and ground-truth labels ${y_i}$ :
Compute cosine similarity between all pairs:$\text{sim}(e_i, e_j) = \frac{e_i \cdot e_j}{|e_i| |e_j|}$
For each image$i$ , find its $k$ nearest neighbors (excluding itself)
Predict label via distance-weighted majority vote: each neighbor votes for its own label, and its vote is weighted by how similar it is to the query image. The class with the highest total vote weight wins.
Example (k=5 neighbors for a query image):
Vote totals: dog = 0.95 + 0.91 + 0.85 = 2.71, cat = 0.88, fox = 0.80
→ Predicted label = dog
Report accuracy:
top1_accuracy = correct_predictions / N × 100For top-5, the prediction is correct if$y_i$ appears among the 5 classes with highest accumulated vote weight.
Why kNN accuracy?
It is the standard evaluation protocol. DINOv2, DINO, and MAE papers all report kNN accuracy as their primary embedding quality metric. Using the same protocol makes our results directly comparable to published results.
No training required. Unlike linear probing (which trains a classifier on top of frozen embeddings), kNN requires no training — it measures the raw quality of the embedding geometry. If similar images land near each other in embedding space, accuracy is high.
Sensitive to ONNX conversion quality. If quantization or graph optimization degrades the embedding space, nearby neighbors change, and accuracy drops proportionally. A delta of 0.5% between ONNX and PyTorch baseline reliably indicates the conversion preserved embedding quality.
Parameters:
5. Implementation
Evaluator (
WinMLImageFeatureExtractionEvaluator)Iterates over the dataset and computes one embedding per image using the HF
image-feature-extractionpipeline. The pipeline returns all token vectors; the evaluator takes the first token (CLS token) as the image-level embedding. All embeddings and labels are collected into numpy arrays and passed to the kNN metric.kNN Metric (
KNNAccuracyMetric)The metric has two steps:
Predict labels (
_predict_labels): L2-normalize all embeddings, compute the full cosine similarity matrix via a single matrix multiply, then for each image find its k nearest neighbors usingargpartition. Each neighbor votes for its label weighted by similarity. The class with the highest total vote weight becomes the predicted label. A ranked list of top-5 classes is also produced.Compute accuracy (
_compute_accuracy): Compare predicted labels against ground-truth labels. Count how many top-1 predictions match (top-1 accuracy) and how many ground-truth labels appear in the top-5 voted classes (top-5 accuracy). Report both as percentages.Default Dataset
timm/mini-imagenet(test split), 1000 samples with shuffle. The dataset is perfectly balanced: 100 classes with 50 images each (5000 total). After shuffling and sampling 1000, each class has ~10 images — sufficient density for k=10 kNN voting and ensures accuracy is not skewed by class imbalance.Time breakdown (1000 images, DINOv2-small):
6. Evaluation Results
All models pass with deltas within ±0.5% — quantized ONNX on NPU preserves embedding quality across general vision and radiology domain models.