feat: add winml serve & winml run with MCP support by DingmaomaoBJTU · Pull Request #354 · microsoft/winml-cli

DingmaomaoBJTU · 2026-04-15T06:17:59Z

Summary

Add a local REST inference server (winml serve) and one-shot CLI runner (winml run) with MCP integration for Claude Desktop.

Design

Architecture — Phased Server

winml serve                                     → Phase 0: CLI-as-API wrapper
winml serve --model microsoft/resnet-50         → Phase 1: warm single-model inference
winml serve --model resnet-50 --idle-timeout 300 → Phase 2: idle auto-unload/reload
winml serve --model resnet-50 --multi           → Phase 3: multi-model slot manager + LRU eviction

All phases share one FastAPI app, selected at startup based on args. Routes depend only on the ModelManager protocol — Phase 1→3 migration is a drop-in swap.

commands/serve.py  ─── CLI entry point (click)
commands/run.py    ─── one-shot inference (--connect to serve or embedded)
serve/
├── app.py             ─── FastAPI app factory + routes (Phase 1-3)
├── cli_api.py         ─── Phase 0 CLI wrapper
├── manager.py         ─── ModelManager protocol, SingleModelManager, ModelSlotManager
├── schema.py          ─── Pydantic request/response models
├── static/index.html  ─── Demo UI dashboard
inference/
├── engine.py          ─── InferenceEngine (HF Pipeline + ORT backend)
├── tasks.py           ─── TASK_REGISTRY: 30 HF tasks + 5 aliases
├── schema_generator.py─── OpenAI/MCP tool definitions
├── types.py           ─── PredictionResult, Prediction
scripts/mcp_server.py  ─── Standalone MCP server for Claude Desktop

See docs/serve-architecture.md for full architecture diagrams.
See docs/design-input-schema.md for the registry-driven input schema design.

Registry-Driven Input Schema (30 HF Tasks)

All input parsing is driven by TASK_REGISTRY — zero hardcoded task branches. Each task defines user_inputs (what the user provides) and PipelineMapping (how to call the HF pipeline):

Category	Tasks	CLI shortcut
Single image (6)	image-classification, object-detection, segmentation, depth, features, image-to-image	`--file`
Single text (9)	text-classification, token-clf, fill-mask, text-gen, text2text, features, summarization, translation, TTS	`--text`
Single audio (2)	audio-classification, ASR	`--file`
Single video (1)	video-classification	`--file`
Image + text (4)	VQA, doc-QA, image-text-to-text, image-to-text	`--file --text`
Image + json (2)	zero-shot-image-clf, zero-shot-object-det	`--file -I candidate_labels='[...]'`
Audio + json (1)	zero-shot-audio-clf	`--file -I candidate_labels='[...]'`
Text + text (1)	question-answering	`-I question=... -I context=...`
Text + json (2)	zero-shot-clf, table-QA	`--text -I candidate_labels/table='...'`
Image pair (1)	keypoint-matching	`-I image_0=@a -I image_1=@b`
Image + spatial (1)	mask-generation (SAM)	`--file [-I input_points='...']`

`winml run` — Embedded + Connect Mode

winml run --model microsoft/resnet-50 --file cat.jpg

By default, runs embedded inference (loads model in-process). With --connect, probes a running winml serve instance first:

Probes localhost:8000/v1/health (0.5s timeout)
If a matching winml serve is running → routes request there (zero model load time)
Otherwise → falls back to embedded inference

All input combinations (--file, --text, --input, -P) are forwarded to the server — file uploads go to /v1/predict/file (with extra inputs via the inputs form field), text/named inputs go to /v1/predict (JSON).

REST API — `/v1/predict/file` Input Support

The file upload endpoint supports structured form fields beyond file and text:

Field	Type	Purpose
`file`	binary	Media file (image/audio/video), max 20 MB
`text`	string	Auto-mapped to sole text field in schema
`inputs`	JSON string	Additional named inputs (e.g., `{"candidate_labels": ["cat","dog"]}`)
`params`	JSON string	Pipeline parameters (e.g., `{"top_k": 5}`)

This enables zero-shot tasks and SAM via file upload:

curl -F "file=@cat.jpg" -F 'inputs={"candidate_labels":["cat","dog"]}' localhost:8000/v1/predict/file

Coverage: /v1/predict (JSON) covers 30/30 tasks. /v1/predict/file covers 17/30 (all tasks with exactly 1 binary input).

MCP Server

Standalone script (scripts/mcp_server.py) that doesn't import winml.modelkit to avoid heavy ML dependency timeouts. Queries /v1/models at startup and generates one MCP tool per loaded model.

Other Changes

NPU auto-precision: int8 → w8a16 (uint8 weights + uint16 activations)
hub_models.json: expanded catalog with quantization metadata, accuracy drop_pct → delta_display
CLI options: ep_option removed free_text, device_option added SUPPORTED_DEVICES_WITH_AUTO

Documentation

docs/winml-run-task-coverage.md — CLI task coverage table + flowchart
docs/winml-serve-task-coverage.md — REST API task coverage table + flowchart
docs/design-input-schema.md — Registry-driven input schema design (updated)
docs/serve-architecture.md — Architecture diagrams (updated)

Sample Output

`winml run --model microsoft/resnet-50 --file cat.jpg`

Task:    image-classification
Model:   microsoft/resnet-50
Device:  auto

Results:
   1. tabby                          0.4312
   2. tiger_cat                      0.3876
   3. Egyptian_cat                   0.0911

Latency: 45.2ms

REST API

# Image classification (file upload)
curl -X POST localhost:8000/v1/predict/file -F "file=@cat.jpg"

# Zero-shot (file + labels)
curl -X POST localhost:8000/v1/predict/file \
  -F "file=@cat.jpg" \
  -F 'inputs={"candidate_labels":["cat","dog","bird"]}'

# Question answering (JSON)
curl -X POST localhost:8000/v1/predict \
  -H "Content-Type: application/json" \
  -d '{"inputs":{"question":"Who?","context":"Tim Cook is CEO."}}'

# Multi-model management
curl -X POST localhost:8000/v1/models \
  -d '{"model_id":"ProsusAI/finbert","task":"text-classification"}'

…MCP support Add a phased local inference server and one-shot CLI runner: - Phase 0: CLI-as-API wrapper (winml serve with no model arg) - Phase 1: warm single-model inference (winml serve <model>) - Phase 2: idle-timeout auto-unload/reload - Phase 3: multi-model slot manager with LRU eviction (--multi) - winml run: one-shot inference with auto-connect to running server - MCP server: standalone Claude Desktop integration (scripts/mcp_server.py) - Demo UI: interactive web dashboard at /demo - NPU auto-precision changed from int8 to w8a16 - hub_models.json catalog expanded with quantization metadata

…to qiowu/add_winml_serve

…WinML-ModelKit into qiowu/add_winml_serve

- fix CodeQL warnings: add ... to Protocol methods, reformat empty except comments - remove cuda from valid EPs (not supported) - rename --reload to --auto-reload in serve command - rename --model-url to --server-url in MCP server script - add --host option to winml run for remote server auto-connect - move onnxruntime-windowsml to optional extras to avoid conflict with onnxruntime-qnn

…it tests Bug fixes: - Fix race condition in predict() task override: use local variables instead of mutating self, so concurrent threads via run_in_executor don't clobber each other's schema/mapping/task state - Add per-slot asyncio.Lock to ModelSlotManager.borrow() to serialize inference per model (mirrors SingleModelManager pattern) - Fix XSS in dashboard: add escJs() for inline onclick handlers, enhance escHtml() to escape quotes, escape all user-controllable interpolations - Fix --connect --text for non-"text" tasks: probe GET /v1/schema to resolve the correct field name (e.g. "question" for QA) - Fix _print_result mutating caller's dict by stripping mask in-place - Fix _postprocess_segmentation KeyError on missing "label" key - Narrow except Exception to specific types in model load endpoint - Set _model_id for .onnx files in load_schema_only PR #354 review feedback (timenick): - Fix test imports to use public API (winml.modelkit.inference) per CLAUDE.md - Address concurrency, XSS, --connect, and import comments New tests (56 cases): - tests/unit/inference/test_engine.py: param discovery, schema override, task override thread-safety, load_schema_only, kwarg filtering - tests/unit/inference/test_tasks.py: masked_mean_pool, segmentation and sentence-similarity postprocess, registry entries - tests/unit/inference/test_pipeline.py: tokenizer padding patterns A/B/C, image processor adaptation, multi-modal shape scanning - tests/unit/commands/test_run_spec.py: segmentation formatting, example command --task flag, _print_result no-mutation

timenick

Import rule violations (CLAUDE.md)

Per the CLAUDE.md source-code rule, imports in src/ must route through the package's __init__.py public API. inference/__init__.py already exports BINARY_TYPES, TASK_REGISTRY, and PredictionResult in __all__, but these spots bypass it and reach into submodules:

src/winml/modelkit/commands/run.py lines 69, 151, 348 — from ..inference.tasks import BINARY_TYPES → from ..inference import BINARY_TYPES
src/winml/modelkit/serve/app.py line 42 — from ..inference.types import PredictionResult → from ..inference import PredictionResult
src/winml/modelkit/serve/app.py lines 722, 777, 801 — from ..inference.tasks import TASK_REGISTRY/BINARY_TYPES → from ..inference import ...
src/winml/modelkit/serve/schema_generator.py line 17 — from ..inference.tasks import BINARY_TYPES, TASK_REGISTRY → from ..inference import ...

run.py:610 already uses from ..inference import InferenceEngine correctly — the newer spots just need to match that pattern.

🤖 Generated with Claude Code

DingmaomaoBJTU requested a review from a team as a code owner April 15, 2026 06:18

github-advanced-security AI found potential problems Apr 15, 2026

View reviewed changes

xieofxie reviewed Apr 15, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/run.py Outdated

xieofxie reviewed Apr 15, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/serve.py Outdated

xieofxie reviewed Apr 15, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/serve.py Outdated

xieofxie reviewed Apr 15, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/serve.py

xieofxie reviewed Apr 15, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/run.py

xieofxie reviewed Apr 15, 2026

View reviewed changes

Comment thread src/winml/modelkit/serve/app.py

DingmaomaoBJTU added 6 commits April 16, 2026 15:47

fix comments

d295f28

fix comments and enrich schema support

a625a51

Merge branch 'main' into qiowu/add_winml_serve

702b83e

Merge branch 'main' of https://github.com/microsoft/WinML-ModelKit in…

24386db

…to qiowu/add_winml_serve

fix image processing bug

10dbd68

Merge branch 'qiowu/add_winml_serve' of https://github.com/microsoft/…

5f45dab

…WinML-ModelKit into qiowu/add_winml_serve

DingmaomaoBJTU mentioned this pull request Apr 20, 2026

Refactor: extract common CLI options into shared decorators #369

Open

7 tasks

DingmaomaoBJTU added 2 commits April 20, 2026 14:13

update

4cd2916

merge

dde82a7