microsoft · xieofxie · Jun 2, 2026 · May 29, 2026 · May 29, 2026 · May 29, 2026
@@ -0,0 +1,144 @@
+# microsoft/swin-large-patch4-window7-224
+
+End-to-end build + accuracy + latency walkthrough for
+`microsoft/swin-large-patch4-window7-224` (task: `image-classification`)
+on the NPU, using the `timm/mini-imagenet` `test` split as the dataset.
+
+Run all commands from the `ModelKit` repo root.
+
+---
+
+## 1. Build the model on NPU
+
+Two steps: `winml config` generates a build config JSON, then
+`winml build` consumes it. `--precision w8a16` is the default NPU
+precision; the build produces a QDQ-quantized ONNX that executes on
+the NPU.
+
+```powershell
+winml config `
+  -m microsoft/swin-large-patch4-window7-224 `
+  --task image-classification `
+  --device npu `
+  --ep openvino `
+  --precision w8a16 `
+  -o build_config.json
+```
+
+```powershell
+winml build `
+  -c build_config.json `
+  -m microsoft/swin-large-patch4-window7-224 `
+  --device npu `
+  --ep openvino `
+  --use-cache
+```
+
+Artifacts land under
+`~/.cache/winml/artifacts/microsoft_swin-large-patch4-window7-224/` —
+the file to evaluate is `imgcls_*_quantized.onnx`.
+
+---
+
+## 2. Evaluate on NPU with `winml eval`
+
+The `timm/mini-imagenet` dataset is downloaded automatically from the
+HuggingFace Hub by `winml eval` — no separate dataset build step is
+needed.
+
+Pass the ONNX file to `-m` and the HuggingFace model ID to `--model-id`
+(needed for the image processor). `--output` writes a JSON file
+containing the parsed metrics:
+
+```powershell
+winml eval `
+  -m $HOME/.cache/winml/artifacts/microsoft_swin-large-patch4-window7-224/imgcls_<hash>_quantized.onnx `
+  --model-id microsoft/swin-large-patch4-window7-224 `
+  --task image-classification `
+  --device npu `
+  --ep openvino `
+  --dataset timm/mini-imagenet `
+  --split test `
+  --samples 1000 `
+  --output winml_eval_output.json
+```
+
+Replace `<hash>` with the actual filename produced by step 1.
+
+The accuracy value is `metrics.accuracy` inside
+`winml_eval_output.json`.
+
+---
+
+## 3. Measure latency with `winml perf`
+
+`winml perf` benchmarks the quantized ONNX directly using random
+inputs derived from the model's I/O configuration. Point `-m` at the
+same `*_quantized.onnx` produced in step 1. `--warmup` iterations are
+excluded from the statistics; `--iterations` is the measured sample
+count.
+
+```powershell
+winml perf `
+  -m $HOME/.cache/winml/artifacts/microsoft_swin-large-patch4-window7-224/imgcls_<hash>_quantized.onnx `
+  --device npu `
+  --ep openvino `
+  --warmup 10 `
+  --iterations 100 `
+  -o winml_perf_output.json
+```
+
+The output JSON contains `latency_ms` (`mean`, `min`, `max`, `p50`,
+`p90`, `p95`, `p99`, `std`) and `throughput` (`samples_per_sec`,
+`batches_per_sec`). Mean and p50 latency are the headline numbers;
+report them alongside the device and precision used.
+
+---
+
+## 4. Evaluate the original PyTorch model
+
+`run_pytorch_baseline.py` loads the HuggingFace checkpoint with native
+PyTorch on CPU and emits the same metric so the two runs are directly
+comparable. The last stdout line is a single JSON object:
+`{"metric": "accuracy", "value": <float>, "num_samples": <int>}`.
+
+Pass `--perf-iterations N` (and optionally `--perf-warmup K`, default
+`10`) to also measure PyTorch inference latency. When `N > 0`, the
+script reuses the HuggingFace pipeline on the first dataset sample,
+runs `K` untimed warmup iterations, then `N` timed iterations, and
+emits a latency JSON line on stdout immediately before the metric
+line. The metric line is still the final stdout line.
+
+```powershell
+uv run python scripts/e2e_eval/run_pytorch_baseline.py `
+  --model microsoft/swin-large-patch4-window7-224 `
+  --task image-classification `
+  --device cpu `
+  --num-samples 1000 `
+  --dataset timm/mini-imagenet `
+  --split test `
+  --winml-metric-key accuracy `
+  --perf-warmup 10 `
+  --perf-iterations 100
+```
+
+The latency JSON line has the same `mean_ms` / `min_ms` / `max_ms` /
+`p50_ms` / `p90_ms` / `p95_ms` / `p99_ms` keys as `winml perf` so the
+two runs can be compared directly.
+
+---
+
+## 5. Comparing the results
+
+For WinML, the accuracy value comes from `metrics.accuracy` in
+`winml_eval_output.json` while for the PyTorch baseline, it comes from
+the last stdout line. Latency comes from `latency_ms` in
+`winml_perf_output.json` for WinML and from the latency JSON line on
+stdout for the PyTorch baseline.
+
+Result on CPU Intel(R) Core(TM) Ultra 7 258V:
+
+| Model | Device | Precision | accuracy | mean latency (ms) | p50 latency (ms) | Size (MB) |
+|---|---|---|---|---|---|---|
+| PyTorch | CPU | fp32 | 0.837 | 662.3 | 647.9 | 750 |
+| WinML (ONNX) | OpenVINO NPU | w8a16 (QDQ) | 0.836 | 64.9 | 64.3 | 193 |
@@ -0,0 +1,252 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""Run one image-classification inference with the WinML-built ONNX.
+
+Mirrors the HuggingFace Swin Transformer usage example
+(https://huggingface.co/docs/transformers/main/en/model_doc/swin) but
+loads the quantized ONNX produced by ``winml build`` (step 1 of the
+README) via :class:`WinMLAutoModel` instead of the original PyTorch
+checkpoint.
+
+The script preprocesses one image, runs inference, prints the top-5
+predicted classes (HF-docs format), and writes an annotated image with
+the top-1 label drawn in the corner so the result is visually
+verifiable.
+
+Usage::
+
+    uv run python examples/microsoft_swin-large-patch4-window7-224/example.py `
+      --onnx $HOME/.cache/winml/artifacts/microsoft_swin-large-patch4-window7-224/`
+            `imgcls_<hash>_quantized.onnx
+"""
+
+from __future__ import annotations
+
+import argparse
+from pathlib import Path
+
+import numpy as np
+import torch
+from PIL import Image, ImageDraw, ImageFont
+from transformers import AutoConfig, AutoImageProcessor
+
+from winml.modelkit import WinMLAutoModel
+
+
+HF_MODEL_ID = "microsoft/swin-large-patch4-window7-224"
+DEFAULT_DATASET = "timm/mini-imagenet"
+DEFAULT_DATASET_SPLIT = "test"
+
+
+def parse_args() -> argparse.Namespace:
+    """Parse command-line arguments."""
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--onnx",
+        required=True,
+        type=Path,
+        help="Path to the quantized ONNX produced by step 1 of the README "
+        "(e.g. imgcls_<hash>_quantized.onnx).",
+    )
+    parser.add_argument(
+        "--device",
+        default="npu",
+        choices=["auto", "npu", "gpu", "cpu"],
+        help="Target device (default: npu).",
+    )
+    parser.add_argument(
+        "--ep",
+        default="openvino",
+        help="Execution provider alias (default: openvino).",
+    )
+    parser.add_argument(
+        "--image",
+        type=Path,
+        default=None,
+        help="Local image path. If omitted, streams the first image from "
+        f"the {DEFAULT_DATASET} {DEFAULT_DATASET_SPLIT} split.",
+    )
+    parser.add_argument(
+        "--top-k",
+        type=int,
+        default=5,
+        help="Number of top predictions to print (default: 5).",
+    )
+    parser.add_argument(
+        "--output",
+        type=Path,
+        default=Path("prediction.png"),
+        help="Where to write the annotated image (default: prediction.png).",
+    )
+    return parser.parse_args()
+
+
+def load_image(image_arg: Path | None) -> tuple[Image.Image, str | None]:
+    """Load an image and (when streamed from the eval dataset) its WordNet synset.
+
+    Returns ``(image, true_synset)``. ``true_synset`` is the WordNet ID
+    (e.g. ``"n01532829"``) for the dataset's labelled class, used as the
+    universal bridge between the dataset's class indexing and the model's.
+    ``None`` when the user supplied a custom ``--image``.
+    """
+    if image_arg is not None:
+        return Image.open(image_arg.expanduser()).convert("RGB"), None
+
+    from datasets import load_dataset
+
+    # streaming=True so we only fetch the first sample instead of downloading
+    # the whole split. The ClassLabel feature (and its .names list) is still
+    # available on the streamed dataset, so we can recover the WordNet synset
+    # for the sample's integer label. trust_remote_code=False refuses to run
+    # any dataset-bundled loading script.
+    dataset = load_dataset(
+        DEFAULT_DATASET,
+        split=DEFAULT_DATASET_SPLIT,
+        streaming=True,
+        trust_remote_code=False,
+    )
+    sample = next(iter(dataset))
+
+    image = sample["image"]
+    if not isinstance(image, Image.Image):
+        image = Image.fromarray(np.asarray(image))
+    image = image.convert("RGB")
+
+    label_value = sample.get("label")
+    label_feature = dataset.features.get("label")
+    if label_value is None or label_feature is None or not hasattr(label_feature, "names"):
+        return image, None
+    return image, label_feature.names[int(label_value)]
+
+
+def imagenet_synset_to_id() -> dict[str, int]:
+    """Map WordNet synset ID -> ImageNet-1k class id (0-999).
+
+    Uses ``timm.data.ImageNetInfo`` so we don't have to ship the 1000-entry
+    list inline. The mapping is the canonical ImageNet-1k ordering that
+    the model was trained against.
+
+    Requires the optional ``timm`` package (imported lazily here, like
+    ``datasets`` in ``load_image``); raises a clear error if it is missing.
+    """
+    try:
+        from timm.data import ImageNetInfo
+    except ImportError as e:
+        raise ImportError(
+            "imagenet_synset_to_id() requires the 'timm' package. "
+            "Install it with `pip install timm`."
+        ) from e
+
+    info = ImageNetInfo()
+    return {synset: idx for idx, synset in enumerate(info.label_names())}
+
+
+def draw_top_prediction(
+    image: Image.Image,
+    label: str,
+    score: float,
+) -> Image.Image:
+    """Draw the top-1 label + confidence on a copy of ``image``."""
+    annotated = image.copy()
+    draw = ImageDraw.Draw(annotated)
+    try:
+        font = ImageFont.truetype("arial.ttf", size=max(14, annotated.height // 30))
+    except OSError:
+        font = ImageFont.load_default()
+
+    caption = f"{label} ({score:.2f})"
+    tx0, ty0, tx1, ty1 = draw.textbbox((10, 10), caption, font=font)
+    pad = 6
+    draw.rectangle(
+        [(tx0 - pad, ty0 - pad), (tx1 + pad, ty1 + pad)],
+        fill=(0, 0, 0),
+    )
+    draw.text((10, 10), caption, fill=(255, 255, 255), font=font)
+    return annotated
+
+
+def main() -> None:
+    """Load the quantized ONNX, run one inference, print + save the result."""
+    args = parse_args()
+
+    image, true_synset = load_image(args.image)
+    image_processor = AutoImageProcessor.from_pretrained(HF_MODEL_ID)
+
+    # skip_build=True uses the ONNX as-is; it has already been optimized
+    # and quantized by `winml build`. use_cache=False avoids touching the
+    # winml artifact cache for this read-only example.
+    model = WinMLAutoModel.from_pretrained(
+        args.onnx.expanduser(),
+        task="image-classification",
+        device=args.device,
+        ep=args.ep,
+        skip_build=True,
+        use_cache=False,
+    )
+
+    # Match the processor's output size to the ONNX's static input shape so
+    # pixel_values matches (B, C, H, W) exactly.
+    input_shapes = (model.io_config.get("input_shapes") or [[]])[0]
+    # Only applies to 4D image inputs (B, C, H, W); skip for other shapes.
+    if len(input_shapes) == 4:
+        _, _, h, w = input_shapes
+        image_processor.size = {"height": h, "width": w}
+
+    inputs = image_processor(images=image, return_tensors="pt")
+    outputs = model(pixel_values=inputs["pixel_values"])
+
+    # logits: (1, num_classes). softmax → probabilities, then top-k.
+    logits = outputs.logits
+    probs = torch.softmax(logits, dim=-1)[0]
+    top_k = min(args.top_k, probs.numel())
+    top_scores, top_ids = torch.topk(probs, k=top_k)
+
+    # WinML's bare-ONNX path doesn't attach an HF config to the model, so
+    # pull id2label from the HF hub for human-readable label names.
+    id2label = AutoConfig.from_pretrained(HF_MODEL_ID).id2label
+
+    top_ids_list = top_ids.tolist()
+    top_label_names = [
+        id2label.get(label_id, str(label_id)) for label_id in top_ids_list
+    ]
+
+    # Resolve the dataset's WordNet synset to an ImageNet-1k class id so we
+    # can compare against the model's prediction. The dataset (e.g.
+    # timm/mini-imagenet) often uses its own 0..N indexing over a subset of
+    # ImageNet-1k, so the raw integer label from the dataset does NOT match
+    # the model's class id — the synset is the universal bridge.
+    true_label_id: int | None = None
+    if true_synset is not None:
+        synset_to_id = imagenet_synset_to_id()
+        true_label_id = synset_to_id.get(true_synset)
+
+    if true_synset is not None:
+        if true_label_id is not None:
+            true_label_name = id2label.get(true_label_id, str(true_label_id))
+            print(f"True label:  {true_label_name} (synset={true_synset}, id={true_label_id})")
+        else:
+            print(f"True label:  synset={true_synset} (not in ImageNet-1k vocabulary)")
+    else:
+        print("True label:  unknown (custom --image)")
+    print(f"\nTop {top_k} predictions:")
+    for rank, (label, score) in enumerate(
+        zip(top_label_names, top_scores.tolist(), strict=True), start=1,
+    ):
+        print(f"  {rank}. {label} ({score:.4f})")
+
+    if true_label_id is not None:
+        verdict = "PASS" if top_ids_list[0] == true_label_id else "FAIL"
+        print(f"\nVerdict (top-1): {verdict}")
+
+    annotated = draw_top_prediction(image, top_label_names[0], float(top_scores[0].item()))
+    output_path = args.output.expanduser()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    annotated.save(output_path)
+    print(f"\nAnnotated image written to {output_path}")
+
+
+if __name__ == "__main__":
+    main()