feat: add winml serve & winml run with MCP support#354
Merged
Conversation
…MCP support Add a phased local inference server and one-shot CLI runner: - Phase 0: CLI-as-API wrapper (winml serve with no model arg) - Phase 1: warm single-model inference (winml serve <model>) - Phase 2: idle-timeout auto-unload/reload - Phase 3: multi-model slot manager with LRU eviction (--multi) - winml run: one-shot inference with auto-connect to running server - MCP server: standalone Claude Desktop integration (scripts/mcp_server.py) - Demo UI: interactive web dashboard at /demo - NPU auto-precision changed from int8 to w8a16 - hub_models.json catalog expanded with quantization metadata
xieofxie
reviewed
Apr 15, 2026
xieofxie
reviewed
Apr 15, 2026
xieofxie
reviewed
Apr 15, 2026
xieofxie
reviewed
Apr 15, 2026
xieofxie
reviewed
Apr 15, 2026
xieofxie
reviewed
Apr 15, 2026
…to qiowu/add_winml_serve
…WinML-ModelKit into qiowu/add_winml_serve
7 tasks
xieofxie
reviewed
Apr 20, 2026
xieofxie
reviewed
Apr 20, 2026
xieofxie
reviewed
Apr 20, 2026
xieofxie
reviewed
Apr 20, 2026
xieofxie
reviewed
Apr 20, 2026
xieofxie
reviewed
Apr 20, 2026
xieofxie
reviewed
Apr 20, 2026
xieofxie
reviewed
Apr 20, 2026
xieofxie
reviewed
Apr 20, 2026
- fix CodeQL warnings: add ... to Protocol methods, reformat empty except comments - remove cuda from valid EPs (not supported) - rename --reload to --auto-reload in serve command - rename --model-url to --server-url in MCP server script - add --host option to winml run for remote server auto-connect - move onnxruntime-windowsml to optional extras to avoid conflict with onnxruntime-qnn
xieofxie
approved these changes
Apr 21, 2026
timenick
reviewed
Apr 21, 2026
…it tests Bug fixes: - Fix race condition in predict() task override: use local variables instead of mutating self, so concurrent threads via run_in_executor don't clobber each other's schema/mapping/task state - Add per-slot asyncio.Lock to ModelSlotManager.borrow() to serialize inference per model (mirrors SingleModelManager pattern) - Fix XSS in dashboard: add escJs() for inline onclick handlers, enhance escHtml() to escape quotes, escape all user-controllable interpolations - Fix --connect --text for non-"text" tasks: probe GET /v1/schema to resolve the correct field name (e.g. "question" for QA) - Fix _print_result mutating caller's dict by stripping mask in-place - Fix _postprocess_segmentation KeyError on missing "label" key - Narrow except Exception to specific types in model load endpoint - Set _model_id for .onnx files in load_schema_only PR #354 review feedback (timenick): - Fix test imports to use public API (winml.modelkit.inference) per CLAUDE.md - Address concurrency, XSS, --connect, and import comments New tests (56 cases): - tests/unit/inference/test_engine.py: param discovery, schema override, task override thread-safety, load_schema_only, kwarg filtering - tests/unit/inference/test_tasks.py: masked_mean_pool, segmentation and sentence-similarity postprocess, registry entries - tests/unit/inference/test_pipeline.py: tokenizer padding patterns A/B/C, image processor adaptation, multi-modal shape scanning - tests/unit/commands/test_run_spec.py: segmentation formatting, example command --task flag, _print_result no-mutation
timenick
reviewed
Apr 22, 2026
timenick
left a comment
Collaborator
There was a problem hiding this comment.
Import rule violations (CLAUDE.md)
Per the CLAUDE.md source-code rule, imports in src/ must route through the package's __init__.py public API. inference/__init__.py already exports BINARY_TYPES, TASK_REGISTRY, and PredictionResult in __all__, but these spots bypass it and reach into submodules:
src/winml/modelkit/commands/run.pylines 69, 151, 348 —from ..inference.tasks import BINARY_TYPES→from ..inference import BINARY_TYPESsrc/winml/modelkit/serve/app.pyline 42 —from ..inference.types import PredictionResult→from ..inference import PredictionResultsrc/winml/modelkit/serve/app.pylines 722, 777, 801 —from ..inference.tasks import TASK_REGISTRY/BINARY_TYPES→from ..inference import ...src/winml/modelkit/serve/schema_generator.pyline 17 —from ..inference.tasks import BINARY_TYPES, TASK_REGISTRY→from ..inference import ...
run.py:610 already uses from ..inference import InferenceEngine correctly — the newer spots just need to match that pattern.
🤖 Generated with Claude Code
timenick
approved these changes
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a local REST inference server (
winml serve) and one-shot CLI runner (winml run) with MCP integration for Claude Desktop.Design
Architecture — Phased Server
All phases share one FastAPI app, selected at startup based on args. Routes depend only on the
ModelManagerprotocol — Phase 1→3 migration is a drop-in swap.See docs/serve-architecture.md for full architecture diagrams.
See docs/design-input-schema.md for the registry-driven input schema design.
Registry-Driven Input Schema (30 HF Tasks)
All input parsing is driven by
TASK_REGISTRY— zero hardcoded task branches. Each task definesuser_inputs(what the user provides) andPipelineMapping(how to call the HF pipeline):--file--text--file--file--file --text--file -I candidate_labels='[...]'--file -I candidate_labels='[...]'-I question=... -I context=...--text -I candidate_labels/table='...'-I image_0=@a -I image_1=@b--file [-I input_points='...']winml run— Embedded + Connect ModeBy default, runs embedded inference (loads model in-process). With
--connect, probes a runningwinml serveinstance first:localhost:8000/v1/health(0.5s timeout)winml serveis running → routes request there (zero model load time)All input combinations (
--file,--text,--input,-P) are forwarded to the server — file uploads go to/v1/predict/file(with extra inputs via theinputsform field), text/named inputs go to/v1/predict(JSON).REST API —
/v1/predict/fileInput SupportThe file upload endpoint supports structured form fields beyond
fileandtext:filetextinputs{"candidate_labels": ["cat","dog"]})params{"top_k": 5})This enables zero-shot tasks and SAM via file upload:
Coverage:
/v1/predict(JSON) covers 30/30 tasks./v1/predict/filecovers 17/30 (all tasks with exactly 1 binary input).MCP Server
Standalone script (
scripts/mcp_server.py) that doesn't importwinml.modelkitto avoid heavy ML dependency timeouts. Queries/v1/modelsat startup and generates one MCP tool per loaded model.Other Changes
int8→w8a16(uint8 weights + uint16 activations)drop_pct→delta_displayep_optionremovedfree_text,device_optionaddedSUPPORTED_DEVICES_WITH_AUTODocumentation
Sample Output
winml run --model microsoft/resnet-50 --file cat.jpgREST API