feat: Implement zero-shot-image-classification evaluator by zhenchaoni · Pull Request #380 · microsoft/winml-cli

zhenchaoni · 2026-04-22T07:56:39Z

Zero-Shot Image Classification — Design

1. What the task is

Zero-shot image classification predicts a label for an image without any task-specific fine-tuning. A dual-encoder model (CLIP-family, SigLIP) produces an image embedding and a set of per-class text embeddings, then picks the class whose text embedding has the highest cosine similarity with the image. The class vocabulary is defined at inference time, not at training time — hence zero-shot.

The same model supports any vocabulary. Swap the candidate class list and you have a different classifier; no re-training.

2. Split-encoder ONNX export

CLIP/SigLIP are dual-encoder architectures. They fuse only at the cosine-similarity step — the image tower and text tower run independently and do not share activations. We export each as a separate ONNX file:

Role	HF task used for export	Output name
`image-encoder`	`image-feature-extraction`	`image_embeds` (CLIP WithProjection) / `pooler_output` (SigLIP)
`text-encoder`	`feature-extraction`	`text_embeds` (CLIP WithProjection) / `pooler_output` (SigLIP)

Build-time orchestration:

Composite class: WinMLModelForZeroShotImageClassification.
Registered for ("clip", "zero-shot-image-classification") and ("siglip", "zero-shot-image-classification").
forward() runs the two sub-sessions and calculate the cosine similarity.
Drop-in compatible with the HF ZeroShotImageClassificationPipeline.

3. CLI: passing the two ONNX files

winml eval accepts the composite model as repeated -m role=path pairs:

winml eval \
  -m image-encoder=path/to/image_encoder.onnx \
  -m text-encoder=path/to/text_encoder.onnx \
  --model-id openai/clip-vit-base-patch16 \
  --task zero-shot-image-classification \
  --device npu \
  --dataset uoft-cs/cifar100 \
  --split test \
  --samples 1000 \
  --column input_column=img \
  --column label_column=fine_label

4. Evaluation

We leverage OpenAI CLIP's zero-shot classification methodology on CIFAR-100 (uoft-cs/cifar100, test split). We use classification accuracy as the metric to detect regression.

Steps:

Build prompt list from the 100 CIFAR-100 class names using HF pipeline's default template: "This is a photo of {label}."
Run HF pipeline per image: self.pipe(image, candidate_labels=prompts). The output will contain the cosine similarity between the image and each label.
Top-1 predicted label = candidate with highest cosine similarity.
Accuracy = fraction of samples where predicted label == dataset ground-truth fine_label. We report top-1 and top-5 via TopKAccuracyMetric.

  class names ─▶ prompts ─┐
                          ▼
  image ─────────▶ HF pipeline ─▶ top-1 label ─▶ compare ─▶ accuracy
                                                    ▲
                                                    │
                                            ground truth

Models evaluated (CIFAR-100 test, 1000 samples, NPU). "Reported" numbers come from the OpenAI CLIP paper and CLIP_benchmark / open_clip results — our baselines track them to within single-template variance, which confirms the methodology matches the standard.

Model	HF baseline (fp32, CPU)	ONNX quantized (NPU)	Δ	Reported
openai/clip-vit-base-patch32	58.4	60.7	+2.3	~60
openai/clip-vit-base-patch16	62.6	63.4	+0.8	~67
openai/clip-vit-large-patch14	73.6	73.7	+0.1	~77
openai/clip-vit-large-patch14-336	73.4	73.1	−0.3	~78
laion/CLIP-ViT-B-32-laion2B-s34B-b79K	73.0	69.3	−3.7	~75
laion/CLIP-ViT-H-14-laion2B-s32B-b79K	78.7	78.3	−0.4	~82
google/siglip-base-patch16-224	68.8	66.6	−2.2	~73
google/siglip-so400m-patch14-384	80.9	79.7	−1.2	~81

All NPU-vs-baseline deltas are within normal quantization-noise territory (±~4pp).

5. Prompt template

OpenAI's paper curates different prompt ensembles per dataset (e.g. 18 templates for CIFAR-100, "a photo of a {}, a type of food." for Food101, 48 action templates for UCF101). Ensembling recovers 3-7pp on absolute accuracy.

We do not ship this. Our evaluator is for regression detection, not paper reproduction — identical preprocessing between the fp32 and quantized runs is what matters, and a single default template ("This is a photo of {}.") satisfies that.

6. Key files in this changelist

File	Description
`zero_shot_image_classification.py`	New composite model class that orchestrates the split `image-encoder` + `text-encoder` ONNX sessions and produces `ZeroShotImageClassifierOutput` for the HF pipeline.
`zero_shot_image_classification_evaluator.py`	New evaluator that delegates per-sample inference to the HF `ZeroShotImageClassificationPipeline` and reports top-1 / top-5 accuracy.
`siglip.py`	New HF export config for SigLIP, registering the split text / vision sub-encoders with Optimum.
`eval.py`	CLI updated to accept composite models via repeated `-m role=path` pairs, with role-name validation against the composite class's `_SUB_MODEL_CONFIG`.

zhenchaoni requested a review from a team as a code owner April 22, 2026 07:56

zhenchaoni requested review from tezheng and vortex-captain April 22, 2026 08:06

vortex-captain reviewed Apr 22, 2026

View reviewed changes

Comment thread src/winml/modelkit/models/winml/composite_model.py Outdated

Base automatically changed from reny/multi_model to main April 23, 2026 04:13

Implement zero-shot-image-classification evaluator

7b82bcf

zhenchaoni force-pushed the zhenni/zsimg_eval branch from 1e0b385 to 7b82bcf Compare April 23, 2026 06:43

Fix unit test

364105a