feat: Implement zero-shot-image-classification evaluator#380
Merged
Conversation
1e0b385 to
7b82bcf
Compare
vortex-captain
approved these changes
Apr 24, 2026
vortex-captain
approved these changes
Apr 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Zero-Shot Image Classification — Design
1. What the task is
Zero-shot image classification predicts a label for an image without any task-specific fine-tuning. A dual-encoder model (CLIP-family, SigLIP) produces an image embedding and a set of per-class text embeddings, then picks the class whose text embedding has the highest cosine similarity with the image. The class vocabulary is defined at inference time, not at training time — hence zero-shot.
The same model supports any vocabulary. Swap the candidate class list and you have a different classifier; no re-training.
2. Split-encoder ONNX export
CLIP/SigLIP are dual-encoder architectures. They fuse only at the cosine-similarity step — the image tower and text tower run independently and do not share activations. We export each as a separate ONNX file:
image-encoderimage-feature-extractionimage_embeds(CLIP WithProjection) /pooler_output(SigLIP)text-encoderfeature-extractiontext_embeds(CLIP WithProjection) /pooler_output(SigLIP)Build-time orchestration:
WinMLModelForZeroShotImageClassification.("clip", "zero-shot-image-classification")and("siglip", "zero-shot-image-classification").forward()runs the two sub-sessions and calculate the cosine similarity.ZeroShotImageClassificationPipeline.3. CLI: passing the two ONNX files
winml evalaccepts the composite model as repeated-m role=pathpairs:4. Evaluation
We leverage OpenAI CLIP's zero-shot classification methodology on CIFAR-100 (
uoft-cs/cifar100,testsplit). We use classification accuracy as the metric to detect regression.Steps:
"This is a photo of {label}."self.pipe(image, candidate_labels=prompts). The output will contain the cosine similarity between the image and each label.fine_label. We report top-1 and top-5 viaTopKAccuracyMetric.Models evaluated (CIFAR-100 test, 1000 samples, NPU). "Reported" numbers come from the OpenAI CLIP paper and CLIP_benchmark / open_clip results — our baselines track them to within single-template variance, which confirms the methodology matches the standard.
All NPU-vs-baseline deltas are within normal quantization-noise territory (±~4pp).
5. Prompt template
OpenAI's paper curates different prompt ensembles per dataset (e.g. 18 templates for CIFAR-100,
"a photo of a {}, a type of food."for Food101, 48 action templates for UCF101). Ensembling recovers 3-7pp on absolute accuracy.We do not ship this. Our evaluator is for regression detection, not paper reproduction — identical preprocessing between the fp32 and quantized runs is what matters, and a single default template (
"This is a photo of {}.") satisfies that.6. Key files in this changelist
zero_shot_image_classification.pyimage-encoder+text-encoderONNX sessions and producesZeroShotImageClassifierOutputfor the HF pipeline.zero_shot_image_classification_evaluator.pyZeroShotImageClassificationPipelineand reports top-1 / top-5 accuracy.siglip.pyeval.py-m role=pathpairs, with role-name validation against the composite class's_SUB_MODEL_CONFIG.