feat(telemetry): Phase 3 — CLI auto-instrumentation, persistent cache, .winml paths by timenick · Pull Request #399 · microsoft/winml-cli

timenick · 2026-04-27T07:28:31Z

Summary

PR 3 of 4 of the ModelKit telemetry rollout. Three concerns landed in
one PR after follow-up discussion:

Phase 3 proper — wire telemetry into the CLI

ActionGroup (subclass of click.Group) wraps each registered
subcommand's invoke to emit ModelKitHeartbeat (once per CLI
invocation), ModelKitAction on completion, and ModelKitError on
unhandled exception. No per-command decoration is required.
LazyGroup in cli.py now extends ActionGroup, so winml is both
lazy-loading and auto-instrumented without any startup-time regression.
cli.py registers Telemetry.shutdown via ctx.call_on_close to flush
the BatchLogRecordProcessor before exit, guarded so it never builds
a fresh singleton on the way out.
Heartbeat emission lives inside the wrapped subcommand invoke so
winml --help, winml --version, and winml <cmd> --help short-circuit
inside Click's parameter parsing and never materialize the Telemetry
singleton.
No --disable-telemetry flag, no env-var override, no winml telemetry
subcommand. Consent is via editing the config file directly.

Close Phase 2 cache-wiring gap

Phase 2 added base64+json encoding helpers and an msvcrt file lock to
utils.py but never wired them up. The result: a network blip during a
CLI invocation lost the entire batch, and the design's "no telemetry data
is lost due to transient network issues" guarantee was unmet.

New telemetry/_cache.py with _PersistentCache (append-or-drain,
locked) at %USERPROFILE%\.winml elemetry\modelkit.cache. Honors
MODELKIT_TELEMETRY_CACHE_DIR env-var override (developer-facing only).
OneCollectorLogExporter now drains the cache on first export
(best-effort, single shot per process), persists the current batch on
POST failure, and re-caches cached envelopes that fail recovery.
Telemetry clears the cache when init resolves to disabled (consent
declined / empty iKey / init crash) so a disabled session never
resends events the user has since opted out of.
tests/unit/telemetry/test_cache_integration.py locks in the
end-to-end design contract — exists specifically to catch the kind of
wiring gap that slipped Phase 2 review.

Unify user-config paths under `.winml`

Phase 2 shipped consent at %USERPROFILE%\.modelkit\config.json. With
the cache also living under %USERPROFILE%\.winml elemetry\, both
move under .winml for a single inspectable directory. No migration
shim — Phase 2 hasn't shipped to end users.

Official build pipeline

.pipelines/modelkit-official-build.yml: build sdist BEFORE iKey
injection (sdist stays clean for PyPI), inject key into
constants.py from the INSTRUMENTATION_KEY secret variable, build
wheel AFTER injection (wheel carries the key), then verify both:
wheel contains a non-empty key, sdist contains the empty placeholder.

Notes

Plan called for cls=ActionGroup directly on the main group; aligned
with the existing LazyGroup (added in MVP v2 for winml --help
startup speed) by making LazyGroup extend ActionGroup.
Heartbeat originally specified inside ActionGroup.invoke; moved
into the wrapped subcommand invoke because ctx.invoked_subcommand
is set by click.Group.invoke after it's called, and putting
heartbeat at the subcommand entry point also correctly skips
cli build --help paths.
Cache wiring is a Phase 2 follow-up that was never assigned to a
later phase; surfaced during this PR's review and pulled in here
rather than left as a known gap.

… on ctx close

… early exit

…evel imports - Move _reset_singleton (autouse, with safe shutdown) and env-setup enabled_telemetry to tests/unit/telemetry/conftest.py - Add running_telemetry fixture for tests that need an eagerly initialized instance - Drop duplicated fixtures from test_telemetry_init.py, test_telemetry_emit.py, test_telemetry_shutdown.py, test_click_group.py - Mirror autouse reset in tests/integration/telemetry/conftest.py - Switch test imports of Telemetry / ActionGroup to package-level form per tests/CLAUDE.md import rules

…shutdown test - cli.py: import the telemetry submodule directly (one alias hop instead of walking parent-package -> submodule). - click_group.py: drop dead defensive None-check in _param; click.Context.params is always a dict. - test_cli_integration: rename test_help_path_does_not_materialize... to test_shutdown_telemetry_does_not_materialize_singleton and call the function directly. The previous version short-circuited inside Click's --help eager option, so call_on_close was never registered and the guard was never actually exercised.

…leanup - Remove inner try/except wrappers around telemetry.log_heartbeat / log_action / log_error in click_group.py — Phase 2's Telemetry methods already swallow exceptions internally and are documented to never raise. The outer wrappers were defense-in-depth duplication. - Drop the now-unused 'import logging' and _LOGGER constant. - Replace 'noqa: D102' on resolve_command with a one-line docstring. - Rename _capture_logger -> _with_mock_logger in test_click_group.py to match the existing _with_mock_provider helper in test_telemetry_shutdown.

…wheel build Adds .pipelines/inject_ikey.py — a small helper that replaces the empty placeholder in src/winml/modelkit/telemetry/constants.py with the real iKey at build time. Reads the key from the INSTRUMENTATION_KEY env var (mapped from the pipeline's secret variable via the script step's `env:` block), fails loudly if the placeholder is missing or the env var is empty, and never logs the key value. Wires the helper into modelkit-official-build.yml as a new step between 'Install build tools' and 'Build sdist and wheel'. Dev installs and source checkouts continue to ship an empty placeholder, so only official wheels carry the key.

A single string-replace in one file doesn't justify a separate Python helper. Move the logic into an inline PowerShell step that matches the existing 'Copy runtime check rules' pattern in the same file. Same fail-fast contract (env empty / placeholder missing throws), same no-key-in-logs property, one fewer file to maintain.

…eel iKey Restructure the official build steps to match the security flow: 1. Build sdist BEFORE iKey injection (with empty placeholder). Sdist exposes source to anyone who downloads it, so it must never carry the real iKey. Lands in $(Agent.TempDirectory) so it is NOT picked up as a published artifact. 2. Inject iKey into constants.py. 3. Build wheel ONLY (--wheel) into ob_outputDirectory. This is the only artifact that gets published. 4. Verify the produced wheel contains a non-empty iKey by reading constants.py directly out of the .whl zip. 5. twine check the wheel. Net effect: the published wheel always carries the iKey when INSTRUMENTATION_KEY is provided to the pipeline; the sdist (if anyone retrieves it from build logs) stays clean.

# Conflicts: # .pipelines/modelkit-official-build.yml

The official build now produces only the wheel — sdist served no purpose beyond a packaging sanity check, and skipping it shaves two build steps and removes the Agent.TempDirectory dance. Source distribution is still possible via dev installs / source checkout.

PyPI distribution needs both wheel and sdist. Restore the sdist build into ob_outputDirectory (so the drop carries it for the release / PyPI publish step). Critical ordering: sdist is built BEFORE iKey injection so the source archive ships with the empty placeholder; wheel is built AFTER injection so it carries the real iKey. Strengthen the verify step to inspect both archives: - wheel must contain a non-empty iKey - sdist must contain the empty placeholder (defensive against any future change that accidentally rebuilds sdist post-inject) Broaden 'twine check' glob to include the .tar.gz.

Phase 2 added base64+json encoding helpers and an msvcrt file lock to utils.py but never wired them up. The result: a network blip during a CLI invocation lost the entire batch, and the design's 'no telemetry data is lost due to transient network issues' guarantee was unmet. This change: - Adds telemetry/_cache.py with _PersistentCache (append-or-drain, locked) and a configurable path. Default location is %USERPROFILE%\.winml\telemetry\modelkit.json. - Honors MODELKIT_TELEMETRY_CACHE_DIR env-var override (developer- facing only; not user-controllable consent). - Wires OneCollectorLogExporter to drain the cache on first export (best-effort, single shot per process), persist current batch on POST failure, and re-cache cached envelopes that fail recovery. - Wires Telemetry singleton init to clear the cache when telemetry resolves to disabled (consent declined / empty iKey / init crash) so a disabled session never resends events the user opted out of. - Adds tests/unit/telemetry/test_cache.py for the cache primitive. - Adds tests/unit/telemetry/test_cache_integration.py for the end-to-end design contract — exists specifically to lock in the integration that slipped through the original Phase 2 review.

Phase 2 shipped consent at %USERPROFILE%\.modelkit\config.json. Now that the cache lives at %USERPROFILE%\.winml\telemetry\modelkit.json, move consent under the same root for a single inspectable directory. No migration logic — Phase 2 hasn't shipped to end users; internal testers will see the first-run prompt one more time. Acceptable trade-off for a clean naming.

… paths Both consent.py and _cache.py had the same fallback pattern: profile = USERPROFILE or HOMEDRIVE+HOMEPATH return Path(profile) / .winml / ... which silently resolves to a CWD-relative path when all three env vars are missing (rare, but possible in stripped service-account containers). - Hoist _resolve_user_home() to utils.py with a None return when no home is resolvable. - consent: _CONFIG_PATH becomes Optional[Path]; _load_config and _write_stored_consent treat None as 'no persistence' instead of crashing or writing to CWD. - _cache: _cache_dir / _cache_file / _PersistentCache._path all become Optional; every public method short-circuits to a no-op when path is None. Tests for the no-home case added. - Stale '.modelkit' docstring in test_cli_integration cleaned up.

The file content is base64-encoded JSON per line, not a JSON document. The .cache extension is honest about that and avoids confusing users who open the file expecting parseable JSON.

DingmaomaoBJTU · 2026-04-28T09:47:53Z

Code review

Found 2 issues:

sys.exit() failures are logged as success=True. In wrapped_invoke, the except Exception block does not catch SystemExit (which inherits from BaseException, not Exception). Commands that call sys.exit(1) — such as analyze.py which uses sys.exit(1) through sys.exit(4) — will have their ModelKitAction event emitted with success=True in the finally block, because the success = False assignment is never reached.

https://github.com/microsoft/WinML-ModelKit/blob/49b69e538c4c15956dd34aeeef17e0bee8534163/src/winml/modelkit/telemetry/click_group.py#L74-L90

_reset_singleton fixture is duplicated across two conftest.py files with nearly identical implementations. tests/CLAUDE.md says "Never: Duplicate fixtures across multiple conftest.py files" and "Put shared fixtures in the narrowest conftest.py that covers all consumers." The fixture should live in a single conftest.py at a common ancestor (e.g. tests/conftest.py).

https://github.com/microsoft/WinML-ModelKit/blob/49b69e538c4c15956dd34aeeef17e0bee8534163/tests/unit/telemetry/conftest.py#L15-L34

https://github.com/microsoft/WinML-ModelKit/blob/49b69e538c4c15956dd34aeeef17e0bee8534163/tests/integration/telemetry/conftest.py#L18-L35

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

SystemExit inherits from BaseException, not Exception, so sys.exit(1) slipped past the wrapped_invoke handler in ActionGroup. The finally block emitted ModelKitAction with success=True for every command that exited non-zero (e.g. analyze using sys.exit(1) through sys.exit(2)). Catch SystemExit explicitly and mark non-zero codes as failure. No log_error — SystemExit is intentional, not a crash.

The autouse singleton-reset fixture was duplicated across tests/unit/telemetry/conftest.py and tests/integration/telemetry/ conftest.py. Per tests/CLAUDE.md, shared fixtures should live in the narrowest conftest.py that covers all consumers — tests/conftest.py. The fixture is a no-op for tests where _INSTANCE is None, so the global autouse adds no real cost to non-telemetry tests but closes the door on accidental singleton leak from any future test that touches Telemetry.

Three test modules still pointed readers to the old `_reset_singleton` name in `tests/unit/telemetry/conftest.py`. Update them to the post-hoist name (`_reset_telemetry_singleton`) and location (`tests/conftest.py`).

`ctx.exit(N)` raises `click.exceptions.Exit`, which inherits from RuntimeError and so is caught by the catch-all `except Exception` handler. That handler unconditionally calls `log_error` and forces `success=False` — wrong on both counts: * `ctx.exit(0)` is a clean intentional exit, not a crash, and was being recorded with success=False and a spurious ModelKitError. * `ctx.exit(N)` non-zero is an intentional failure exit, not a Python crash, and was emitting ModelKitError with a meaningless stack trace. Mirror the SystemExit handler: success reflects the exit code, no log_error. Currently latent — only `run.py` uses `ctx.exit` and `run` is in `_DISABLED_COMMANDS` — but Click documents `ctx.exit` as the recommended exit mechanism, so any future command that adopts it would otherwise re-introduce the bug.

timenick · 2026-04-29T03:24:05Z

Code review

Found 2 issues:

sys.exit() failures are logged as success=True. In wrapped_invoke, the except Exception block does not catch SystemExit (which inherits from BaseException, not Exception). Commands that call sys.exit(1) — such as analyze.py which uses sys.exit(1) through sys.exit(4) — will have their ModelKitAction event emitted with success=True in the finally block, because the success = False assignment is never reached.

https://github.com/microsoft/WinML-ModelKit/blob/49b69e538c4c15956dd34aeeef17e0bee8534163/src/winml/modelkit/telemetry/click_group.py#L74-L90

_reset_singleton fixture is duplicated across two conftest.py files with nearly identical implementations. tests/CLAUDE.md says "Never: Duplicate fixtures across multiple conftest.py files" and "Put shared fixtures in the narrowest conftest.py that covers all consumers." The fixture should live in a single conftest.py at a common ancestor (e.g. tests/conftest.py).

https://github.com/microsoft/WinML-ModelKit/blob/49b69e538c4c15956dd34aeeef17e0bee8534163/tests/unit/telemetry/conftest.py#L15-L34

https://github.com/microsoft/WinML-ModelKit/blob/49b69e538c4c15956dd34aeeef17e0bee8534163/tests/integration/telemetry/conftest.py#L18-L35

🤖 Generated with Claude Code

If this code review was useful, please react with 👍. Otherwise, react with 👎.

All addressed

## Summary PR 4 of 4 of the ModelKit telemetry rollout. User-facing documentation only: - **`docs/Privacy.md`** (new): canonical privacy statement — what the three event types (`ModelKitHeartbeat` / `ModelKitAction` / `ModelKitError`) carry, common context attributes, the explicit NOT-collected list, the consent model (first-run prompt + `%USERPROFILE%\.winml\config.json`), CI auto-disable env vars, the failure cache file location, and the dev-install no-op invariant. - **`README.md`**: short Data / Telemetry section summarizing the consent model and linking to `docs/Privacy.md`. Paths in both docs reflect Phase 3 reality (`.winml`, not `.modelkit`) and the persistent failure cache wired in PR #399. The dev-only `MODELKIT_TELEMETRY_CACHE_DIR` override is intentionally omitted from the end-user docs. ## Plan reference - [Phase 4 plan](docs/superpowers/plans/2026-04-17-modelkit-telemetry-phase4.md) - Phase 1 #367 / Phase 2 #371 / Phase 3 #399 — all merged

timenick added 3 commits April 27, 2026 14:32

feat(telemetry): add ActionGroup auto-instrumentation for Click commands

c50d292

feat(telemetry): expose ActionGroup from package __init__

0fd5912

feat(telemetry): wire ActionGroup into LazyGroup + Telemetry.shutdown…

6075672

… on ctx close

timenick requested a review from a team as a code owner April 27, 2026 07:28

github-advanced-security AI found potential problems Apr 27, 2026

View reviewed changes

Comment thread src/winml/modelkit/cli.py Fixed

Comment thread tests/unit/telemetry/test_click_group.py Fixed

Comment thread tests/unit/telemetry/test_click_group.py Fixed

timenick added 15 commits April 27, 2026 15:35

fix(telemetry): annotate intentional empty-except blocks for CodeQL

1afb979

fix(telemetry): guard shutdown against materializing the singleton on…

7d39178

… early exit

Merge remote-tracking branch 'origin/main' into zhiwang/telemetry-phase3

3df6918

Merge remote-tracking branch 'origin/main' into zhiwang/telemetry-phase3

84f56b4

# Conflicts: # .pipelines/modelkit-official-build.yml

Merge branch 'main' into zhiwang/telemetry-phase3

c32836c

timenick changed the title ~~feat(telemetry): Phase 3 — wire ActionGroup auto-instrumentation into the CLI~~ feat(telemetry): Phase 3 — CLI auto-instrumentation, persistent cache, .winml paths Apr 28, 2026

github-advanced-security AI found potential problems Apr 28, 2026

View reviewed changes

Comment thread src/winml/modelkit/telemetry/_cache.py Fixed

timenick added 4 commits April 28, 2026 14:10

fix(telemetry): annotate _PersistentCache.clear empty-except for CodeQL

363c372

rename: telemetry cache file modelkit.json -> modelkit.cache

55bec97

The file content is base64-encoded JSON per line, not a JSON document. The .cache extension is honest about that and avoids confusing users who open the file expecting parseable JSON.

Merge branch 'main' into zhiwang/telemetry-phase3

49b69e5

timenick added 3 commits April 29, 2026 09:15

review: update stale fixture-location comments after hoist

ab05293

Three test modules still pointed readers to the old `_reset_singleton` name in `tests/unit/telemetry/conftest.py`. Update them to the post-hoist name (`_reset_telemetry_singleton`) and location (`tests/conftest.py`).

DingmaomaoBJTU approved these changes Apr 29, 2026

View reviewed changes

timenick merged commit 77d553b into main Apr 29, 2026
9 checks passed

timenick deleted the zhiwang/telemetry-phase3 branch April 29, 2026 03:24

timenick mentioned this pull request Apr 29, 2026

docs(telemetry): Phase 4 — Privacy statement + README section #417

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(telemetry): Phase 3 — CLI auto-instrumentation, persistent cache, .winml paths#399

feat(telemetry): Phase 3 — CLI auto-instrumentation, persistent cache, .winml paths#399
timenick merged 26 commits into
mainfrom
zhiwang/telemetry-phase3

timenick commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DingmaomaoBJTU commented Apr 28, 2026

Uh oh!

timenick commented Apr 29, 2026

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

timenick commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Phase 3 proper — wire telemetry into the CLI

Close Phase 2 cache-wiring gap

Unify user-config paths under .winml

Official build pipeline

Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DingmaomaoBJTU commented Apr 28, 2026

Code review

Uh oh!

timenick commented Apr 29, 2026

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timenick commented Apr 27, 2026 •

edited

Loading

Unify user-config paths under `.winml`