Skip to content

feat: Implement zero-shot-image-classification evaluator#380

Merged
zhenchaoni merged 5 commits into
mainfrom
zhenni/zsimg_eval
Apr 29, 2026
Merged

feat: Implement zero-shot-image-classification evaluator#380
zhenchaoni merged 5 commits into
mainfrom
zhenni/zsimg_eval

Conversation

@zhenchaoni

Copy link
Copy Markdown
Member

Zero-Shot Image Classification — Design

1. What the task is

Zero-shot image classification predicts a label for an image without any task-specific fine-tuning. A dual-encoder model (CLIP-family, SigLIP) produces an image embedding and a set of per-class text embeddings, then picks the class whose text embedding has the highest cosine similarity with the image. The class vocabulary is defined at inference time, not at training time — hence zero-shot.

The same model supports any vocabulary. Swap the candidate class list and you have a different classifier; no re-training.

2. Split-encoder ONNX export

CLIP/SigLIP are dual-encoder architectures. They fuse only at the cosine-similarity step — the image tower and text tower run independently and do not share activations. We export each as a separate ONNX file:

Role HF task used for export Output name
image-encoder image-feature-extraction image_embeds (CLIP WithProjection) / pooler_output (SigLIP)
text-encoder feature-extraction text_embeds (CLIP WithProjection) / pooler_output (SigLIP)

Build-time orchestration:

  • Composite class: WinMLModelForZeroShotImageClassification.
  • Registered for ("clip", "zero-shot-image-classification") and ("siglip", "zero-shot-image-classification").
  • forward() runs the two sub-sessions and calculate the cosine similarity.
  • Drop-in compatible with the HF ZeroShotImageClassificationPipeline.

3. CLI: passing the two ONNX files

winml eval accepts the composite model as repeated -m role=path pairs:

winml eval \
  -m image-encoder=path/to/image_encoder.onnx \
  -m text-encoder=path/to/text_encoder.onnx \
  --model-id openai/clip-vit-base-patch16 \
  --task zero-shot-image-classification \
  --device npu \
  --dataset uoft-cs/cifar100 \
  --split test \
  --samples 1000 \
  --column input_column=img \
  --column label_column=fine_label

4. Evaluation

We leverage OpenAI CLIP's zero-shot classification methodology on CIFAR-100 (uoft-cs/cifar100, test split). We use classification accuracy as the metric to detect regression.

Steps:

  1. Build prompt list from the 100 CIFAR-100 class names using HF pipeline's default template: "This is a photo of {label}."
  2. Run HF pipeline per image: self.pipe(image, candidate_labels=prompts). The output will contain the cosine similarity between the image and each label.
  3. Top-1 predicted label = candidate with highest cosine similarity.
  4. Accuracy = fraction of samples where predicted label == dataset ground-truth fine_label. We report top-1 and top-5 via TopKAccuracyMetric.
  class names ─▶ prompts ─┐
                          ▼
  image ─────────▶ HF pipeline ─▶ top-1 label ─▶ compare ─▶ accuracy
                                                    ▲
                                                    │
                                            ground truth

Models evaluated (CIFAR-100 test, 1000 samples, NPU). "Reported" numbers come from the OpenAI CLIP paper and CLIP_benchmark / open_clip results — our baselines track them to within single-template variance, which confirms the methodology matches the standard.

Model HF baseline (fp32, CPU) ONNX quantized (NPU) Δ Reported
openai/clip-vit-base-patch32 58.4 60.7 +2.3 ~60
openai/clip-vit-base-patch16 62.6 63.4 +0.8 ~67
openai/clip-vit-large-patch14 73.6 73.7 +0.1 ~77
openai/clip-vit-large-patch14-336 73.4 73.1 −0.3 ~78
laion/CLIP-ViT-B-32-laion2B-s34B-b79K 73.0 69.3 −3.7 ~75
laion/CLIP-ViT-H-14-laion2B-s32B-b79K 78.7 78.3 −0.4 ~82
google/siglip-base-patch16-224 68.8 66.6 −2.2 ~73
google/siglip-so400m-patch14-384 80.9 79.7 −1.2 ~81

All NPU-vs-baseline deltas are within normal quantization-noise territory (±~4pp).

5. Prompt template

OpenAI's paper curates different prompt ensembles per dataset (e.g. 18 templates for CIFAR-100, "a photo of a {}, a type of food." for Food101, 48 action templates for UCF101). Ensembling recovers 3-7pp on absolute accuracy.

We do not ship this. Our evaluator is for regression detection, not paper reproduction — identical preprocessing between the fp32 and quantized runs is what matters, and a single default template ("This is a photo of {}.") satisfies that.

6. Key files in this changelist

File Description
zero_shot_image_classification.py New composite model class that orchestrates the split image-encoder + text-encoder ONNX sessions and produces ZeroShotImageClassifierOutput for the HF pipeline.
zero_shot_image_classification_evaluator.py New evaluator that delegates per-sample inference to the HF ZeroShotImageClassificationPipeline and reports top-1 / top-5 accuracy.
siglip.py New HF export config for SigLIP, registering the split text / vision sub-encoders with Optimum.
eval.py CLI updated to accept composite models via repeated -m role=path pairs, with role-name validation against the composite class's _SUB_MODEL_CONFIG.

@zhenchaoni zhenchaoni requested a review from a team as a code owner April 22, 2026 07:56
Comment thread src/winml/modelkit/models/winml/composite_model.py Outdated
Base automatically changed from reny/multi_model to main April 23, 2026 04:13
Comment thread scripts/e2e_eval/run_eval.py Outdated
Comment thread src/winml/modelkit/eval/metrics/top_k_accuracy.py Outdated
Comment thread src/winml/modelkit/models/winml/zero_shot_image_classification.py
Comment thread src/winml/modelkit/models/winml/zero_shot_image_classification.py
Comment thread src/winml/modelkit/models/winml/zero_shot_image_classification.py Outdated
Comment thread src/winml/modelkit/models/winml/zero_shot_image_classification.py Outdated
@zhenchaoni zhenchaoni linked an issue Apr 29, 2026 that may be closed by this pull request
@zhenchaoni zhenchaoni merged commit fc47ad8 into main Apr 29, 2026
9 checks passed
@zhenchaoni zhenchaoni deleted the zhenni/zsimg_eval branch April 29, 2026 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluate zero shot image classification

2 participants