fix inference bugs, add cache-aware e2e quality gates#381
Merged
Conversation
…t tests
Bug fixes:
- Fix NER numpy.float32 serialization: add _sanitize_numpy() to convert
numpy scalars to Python types before pydantic/JSON serialization
- Fix cache-key-prefixed build directory loading: _find_build_artifacts()
supports both plain (model.onnx) and prefixed ({key}_model.onnx) layouts
- Fix task-mismatch cache loading: _find_build_artifacts(task=) filters
by manifest task to avoid using wrong ONNX (e.g. feature-extraction
model for text-classification request); fallback to HF rebuild
- Redirect stdout to stderr during engine.load() in CLI to prevent
build-pipeline prints from contaminating --format json output
E2E quality gates (187 tests, all passing):
- Merge test_run_e2e.py + test_run_hub_models_e2e.py into 3-tier
quality gate: Feature (14) + Schema (35) + Inference (35)
- Add test_serve_e2e.py: 4-tier quality gate with 103 tests
- Cache-aware: tests use ~/.cache/winml/artifacts/ when available,
falling back to HF model ID for uncached models
- Fix TestClient fixture: use context manager to trigger FastAPI lifespan
- Move shared test_image/runner fixtures to conftest.py
- Per-file timeout(3600) for e2e tests, global 300s unchanged
Unit tests:
- TestSanitizeNumpy: 6 tests for numpy scalar conversion
- TestFindBuildArtifacts: 7 tests including task filtering
- TestNormalizeNEROutput: verify NER output is JSON-serializable
timenick
reviewed
Apr 23, 2026
timenick
left a comment
Collaborator
There was a problem hiding this comment.
Inline comments on 7 issues flagged against CLAUDE.md / tests/CLAUDE.md. Core fixes (_sanitize_numpy, _find_build_artifacts task filter, HF rebuild fallback) look correct — findings are mostly test-layout duplication plus a couple of behavior nits.
🤖 Generated with Claude Code
- Move shared e2e helpers (_HUB_DATA, _unique_pairs, _PAIRS, _pytest_id, _find_cache_dir, _resolve_model_arg, _SAMPLE_TEXT, _TEXT_BY_FIELD) to conftest.py as single source of truth (comments 1+2) - Remove duplicate test_image fixture from test_serve_e2e.py (comment 1) - Replace pytest.skip with pytest.xfail for missing-input cases so coverage gaps are visible, not hidden (comment 3) - Fix redirect_stdout comment to match unconditional behavior (comment 4) - Document why _extract_json is needed despite redirect_stdout — some C-extension code writes directly to fd 1, bypassing sys.stdout (comment 5) - Raise FileNotFoundError when task=None and multiple different-task variants exist in cache dir, preventing silent wrong-model load (comment 6) - Broaden except to (FileNotFoundError, JSONDecodeError, KeyError) so corrupt manifests trigger HF rebuild fallback (comment 7)
timenick
reviewed
Apr 23, 2026
timenick
left a comment
Collaborator
There was a problem hiding this comment.
Re-review of the feedback commit. All 7 previous comments were addressed substantively (helpers moved to conftest, duplicate fixture removed, xfail with actionable reasons, docstrings, _find_build_artifacts task-ambiguity guard, broadened except). Unit tests pass locally. One new blocker was introduced by moving the helper into conftest.py — see inline.
🤖 Generated with Claude Code
pytest_id in conftest.py triggers pluggy PluginValidationError because pytest treats any conftest function starting with pytest_ as a hook.
ssss141414
pushed a commit
that referenced
this pull request
Apr 27, 2026
## Summary
- Fix NER `numpy.float32` serialization error in serve path —
`_sanitize_numpy()` converts numpy scalars to Python types before
pydantic/JSON serialization
- Fix cache-key-prefixed build directory loading —
`_find_build_artifacts()` supports both plain (`model.onnx`) and
cache-prefixed (`{key}_model.onnx`) layouts, enabling `engine.load()` to
load directly from `~/.cache/winml/artifacts/`
- Fix task-mismatch cache loading — `_find_build_artifacts(task=)`
filters by manifest task, avoiding wrong ONNX model (e.g.
feature-extraction head for text-classification); falls back to HF
rebuild when no match found
- Redirect stdout→stderr during `engine.load()` in CLI to prevent
build-pipeline prints from contaminating `--format json` output
- Merge `test_run_e2e.py` + `test_run_hub_models_e2e.py` into 3-tier
quality gate (Feature 14 + Schema 35 + Inference 35 = 84 tests)
- Add `test_serve_e2e.py` 4-tier quality gate (103 tests)
- Cache-aware e2e: tests use `~/.cache/winml/artifacts/` when available,
per-file `timeout(3600)` for e2e
- 14 new unit tests for `_sanitize_numpy`, `_find_build_artifacts`, NER
output normalization
All 187 e2e tests passing (run 84 + serve 103).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
numpy.float32serialization error in serve path —_sanitize_numpy()converts numpy scalars to Python types before pydantic/JSON serialization_find_build_artifacts()supports both plain (model.onnx) and cache-prefixed ({key}_model.onnx) layouts, enablingengine.load()to load directly from~/.cache/winml/artifacts/_find_build_artifacts(task=)filters by manifest task, avoiding wrong ONNX model (e.g. feature-extraction head for text-classification); falls back to HF rebuild when no match foundengine.load()in CLI to prevent build-pipeline prints from contaminating--format jsonoutputtest_run_e2e.py+test_run_hub_models_e2e.pyinto 3-tier quality gate (Feature 14 + Schema 35 + Inference 35 = 84 tests)test_serve_e2e.py4-tier quality gate (103 tests)~/.cache/winml/artifacts/when available, per-filetimeout(3600)for e2e_sanitize_numpy,_find_build_artifacts, NER output normalizationAll 187 e2e tests passing (run 84 + serve 103).
🤖 Generated with Claude Code