Skip to content

feat: add winml serve & winml run with MCP support#354

Merged
DingmaomaoBJTU merged 14 commits into
mainfrom
qiowu/add_winml_serve
Apr 22, 2026
Merged

feat: add winml serve & winml run with MCP support#354
DingmaomaoBJTU merged 14 commits into
mainfrom
qiowu/add_winml_serve

Conversation

@DingmaomaoBJTU

@DingmaomaoBJTU DingmaomaoBJTU commented Apr 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Add a local REST inference server (winml serve) and one-shot CLI runner (winml run) with MCP integration for Claude Desktop.

Design

Architecture — Phased Server

winml serve                                     → Phase 0: CLI-as-API wrapper
winml serve --model microsoft/resnet-50         → Phase 1: warm single-model inference
winml serve --model resnet-50 --idle-timeout 300 → Phase 2: idle auto-unload/reload
winml serve --model resnet-50 --multi           → Phase 3: multi-model slot manager + LRU eviction

All phases share one FastAPI app, selected at startup based on args. Routes depend only on the ModelManager protocol — Phase 1→3 migration is a drop-in swap.

commands/serve.py  ─── CLI entry point (click)
commands/run.py    ─── one-shot inference (--connect to serve or embedded)
serve/
├── app.py             ─── FastAPI app factory + routes (Phase 1-3)
├── cli_api.py         ─── Phase 0 CLI wrapper
├── manager.py         ─── ModelManager protocol, SingleModelManager, ModelSlotManager
├── schema.py          ─── Pydantic request/response models
├── static/index.html  ─── Demo UI dashboard
inference/
├── engine.py          ─── InferenceEngine (HF Pipeline + ORT backend)
├── tasks.py           ─── TASK_REGISTRY: 30 HF tasks + 5 aliases
├── schema_generator.py─── OpenAI/MCP tool definitions
├── types.py           ─── PredictionResult, Prediction
scripts/mcp_server.py  ─── Standalone MCP server for Claude Desktop

See docs/serve-architecture.md for full architecture diagrams.
See docs/design-input-schema.md for the registry-driven input schema design.

Registry-Driven Input Schema (30 HF Tasks)

All input parsing is driven by TASK_REGISTRY — zero hardcoded task branches. Each task defines user_inputs (what the user provides) and PipelineMapping (how to call the HF pipeline):

Category Tasks CLI shortcut
Single image (6) image-classification, object-detection, segmentation, depth, features, image-to-image --file
Single text (9) text-classification, token-clf, fill-mask, text-gen, text2text, features, summarization, translation, TTS --text
Single audio (2) audio-classification, ASR --file
Single video (1) video-classification --file
Image + text (4) VQA, doc-QA, image-text-to-text, image-to-text --file --text
Image + json (2) zero-shot-image-clf, zero-shot-object-det --file -I candidate_labels='[...]'
Audio + json (1) zero-shot-audio-clf --file -I candidate_labels='[...]'
Text + text (1) question-answering -I question=... -I context=...
Text + json (2) zero-shot-clf, table-QA --text -I candidate_labels/table='...'
Image pair (1) keypoint-matching -I image_0=@a -I image_1=@b
Image + spatial (1) mask-generation (SAM) --file [-I input_points='...']

winml run — Embedded + Connect Mode

winml run --model microsoft/resnet-50 --file cat.jpg

By default, runs embedded inference (loads model in-process). With --connect, probes a running winml serve instance first:

  1. Probes localhost:8000/v1/health (0.5s timeout)
  2. If a matching winml serve is running → routes request there (zero model load time)
  3. Otherwise → falls back to embedded inference

All input combinations (--file, --text, --input, -P) are forwarded to the server — file uploads go to /v1/predict/file (with extra inputs via the inputs form field), text/named inputs go to /v1/predict (JSON).

REST API — /v1/predict/file Input Support

The file upload endpoint supports structured form fields beyond file and text:

Field Type Purpose
file binary Media file (image/audio/video), max 20 MB
text string Auto-mapped to sole text field in schema
inputs JSON string Additional named inputs (e.g., {"candidate_labels": ["cat","dog"]})
params JSON string Pipeline parameters (e.g., {"top_k": 5})

This enables zero-shot tasks and SAM via file upload:

curl -F "file=@cat.jpg" -F 'inputs={"candidate_labels":["cat","dog"]}' localhost:8000/v1/predict/file

Coverage: /v1/predict (JSON) covers 30/30 tasks. /v1/predict/file covers 17/30 (all tasks with exactly 1 binary input).

MCP Server

Standalone script (scripts/mcp_server.py) that doesn't import winml.modelkit to avoid heavy ML dependency timeouts. Queries /v1/models at startup and generates one MCP tool per loaded model.

Other Changes

  • NPU auto-precision: int8w8a16 (uint8 weights + uint16 activations)
  • hub_models.json: expanded catalog with quantization metadata, accuracy drop_pctdelta_display
  • CLI options: ep_option removed free_text, device_option added SUPPORTED_DEVICES_WITH_AUTO

Documentation

Sample Output

winml run --model microsoft/resnet-50 --file cat.jpg

Task:    image-classification
Model:   microsoft/resnet-50
Device:  auto

Results:
   1. tabby                          0.4312
   2. tiger_cat                      0.3876
   3. Egyptian_cat                   0.0911

Latency: 45.2ms

REST API

# Image classification (file upload)
curl -X POST localhost:8000/v1/predict/file -F "file=@cat.jpg"

# Zero-shot (file + labels)
curl -X POST localhost:8000/v1/predict/file \
  -F "file=@cat.jpg" \
  -F 'inputs={"candidate_labels":["cat","dog","bird"]}'

# Question answering (JSON)
curl -X POST localhost:8000/v1/predict \
  -H "Content-Type: application/json" \
  -d '{"inputs":{"question":"Who?","context":"Tim Cook is CEO."}}'

# Multi-model management
curl -X POST localhost:8000/v1/models \
  -d '{"model_id":"ProsusAI/finbert","task":"text-classification"}'

…MCP support

Add a phased local inference server and one-shot CLI runner:

- Phase 0: CLI-as-API wrapper (winml serve with no model arg)
- Phase 1: warm single-model inference (winml serve <model>)
- Phase 2: idle-timeout auto-unload/reload
- Phase 3: multi-model slot manager with LRU eviction (--multi)
- winml run: one-shot inference with auto-connect to running server
- MCP server: standalone Claude Desktop integration (scripts/mcp_server.py)
- Demo UI: interactive web dashboard at /demo
- NPU auto-precision changed from int8 to w8a16
- hub_models.json catalog expanded with quantization metadata
@DingmaomaoBJTU DingmaomaoBJTU requested a review from a team as a code owner April 15, 2026 06:18
Comment thread src/winml/modelkit/serve/manager.py Fixed
Comment thread src/winml/modelkit/serve/manager.py Fixed
Comment thread src/winml/modelkit/serve/manager.py Fixed
Comment thread src/winml/modelkit/serve/manager.py Fixed
Comment thread src/winml/modelkit/serve/manager.py Fixed
Comment thread src/winml/modelkit/serve/manager.py Fixed
Comment thread src/winml/modelkit/serve/manager.py Fixed
Comment thread src/winml/modelkit/serve/manager.py Fixed
Comment thread src/winml/modelkit/commands/run.py Outdated
Comment thread src/winml/modelkit/commands/serve.py Outdated
Comment thread src/winml/modelkit/commands/serve.py Outdated
Comment thread src/winml/modelkit/commands/serve.py
Comment thread src/winml/modelkit/commands/run.py
Comment thread src/winml/modelkit/serve/app.py
Comment thread scripts/mcp_server.py Outdated
Comment thread src/winml/modelkit/commands/run.py
Comment thread src/winml/modelkit/commands/serve.py
Comment thread src/winml/modelkit/data/hub_models.json
Comment thread src/winml/modelkit/inference/tasks.py
Comment thread src/winml/modelkit/serve/app.py Outdated
Comment thread src/winml/modelkit/serve/schema.py
Comment thread pyproject.toml
Comment thread pyproject.toml
- fix CodeQL warnings: add ... to Protocol methods, reformat empty except comments
- remove cuda from valid EPs (not supported)
- rename --reload to --auto-reload in serve command
- rename --model-url to --server-url in MCP server script
- add --host option to winml run for remote server auto-connect
- move onnxruntime-windowsml to optional extras to avoid conflict with onnxruntime-qnn
Comment thread src/winml/modelkit/serve/app.py
Comment thread tests/unit/serve/test_predict_file_inputs.py Outdated
Comment thread src/winml/modelkit/commands/run.py Outdated
Comment thread src/winml/modelkit/serve/static/index.html Outdated
…it tests

Bug fixes:
- Fix race condition in predict() task override: use local variables instead
  of mutating self, so concurrent threads via run_in_executor don't clobber
  each other's schema/mapping/task state
- Add per-slot asyncio.Lock to ModelSlotManager.borrow() to serialize
  inference per model (mirrors SingleModelManager pattern)
- Fix XSS in dashboard: add escJs() for inline onclick handlers, enhance
  escHtml() to escape quotes, escape all user-controllable interpolations
- Fix --connect --text for non-"text" tasks: probe GET /v1/schema to resolve
  the correct field name (e.g. "question" for QA)
- Fix _print_result mutating caller's dict by stripping mask in-place
- Fix _postprocess_segmentation KeyError on missing "label" key
- Narrow except Exception to specific types in model load endpoint
- Set _model_id for .onnx files in load_schema_only

PR #354 review feedback (timenick):
- Fix test imports to use public API (winml.modelkit.inference) per CLAUDE.md
- Address concurrency, XSS, --connect, and import comments

New tests (56 cases):
- tests/unit/inference/test_engine.py: param discovery, schema override,
  task override thread-safety, load_schema_only, kwarg filtering
- tests/unit/inference/test_tasks.py: masked_mean_pool, segmentation and
  sentence-similarity postprocess, registry entries
- tests/unit/inference/test_pipeline.py: tokenizer padding patterns A/B/C,
  image processor adaptation, multi-modal shape scanning
- tests/unit/commands/test_run_spec.py: segmentation formatting, example
  command --task flag, _print_result no-mutation
Comment thread src/winml/modelkit/commands/run.py Fixed
Comment thread tests/unit/inference/test_pipeline.py Fixed
Comment thread tests/unit/inference/test_pipeline.py Fixed
Comment thread tests/unit/inference/test_pipeline.py Fixed
Comment thread tests/unit/inference/test_pipeline.py Fixed
Comment thread tests/unit/inference/test_pipeline.py Fixed
Comment thread tests/unit/inference/test_pipeline.py Fixed
Comment thread tests/unit/inference/test_pipeline.py Fixed

@timenick timenick left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import rule violations (CLAUDE.md)

Per the CLAUDE.md source-code rule, imports in src/ must route through the package's __init__.py public API. inference/__init__.py already exports BINARY_TYPES, TASK_REGISTRY, and PredictionResult in __all__, but these spots bypass it and reach into submodules:

  • src/winml/modelkit/commands/run.py lines 69, 151, 348 — from ..inference.tasks import BINARY_TYPESfrom ..inference import BINARY_TYPES
  • src/winml/modelkit/serve/app.py line 42 — from ..inference.types import PredictionResultfrom ..inference import PredictionResult
  • src/winml/modelkit/serve/app.py lines 722, 777, 801 — from ..inference.tasks import TASK_REGISTRY/BINARY_TYPESfrom ..inference import ...
  • src/winml/modelkit/serve/schema_generator.py line 17 — from ..inference.tasks import BINARY_TYPES, TASK_REGISTRYfrom ..inference import ...

run.py:610 already uses from ..inference import InferenceEngine correctly — the newer spots just need to match that pattern.

🤖 Generated with Claude Code

Comment thread src/winml/modelkit/serve/static/index.html
Comment thread src/winml/modelkit/commands/serve.py
Comment thread src/winml/modelkit/commands/serve.py
@DingmaomaoBJTU DingmaomaoBJTU merged commit a3348bc into main Apr 22, 2026
9 checks passed
@DingmaomaoBJTU DingmaomaoBJTU deleted the qiowu/add_winml_serve branch April 22, 2026 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants