Skip to content

feat(telemetry): Phase 3 — CLI auto-instrumentation, persistent cache, .winml paths#399

Merged
timenick merged 26 commits into
mainfrom
zhiwang/telemetry-phase3
Apr 29, 2026
Merged

feat(telemetry): Phase 3 — CLI auto-instrumentation, persistent cache, .winml paths#399
timenick merged 26 commits into
mainfrom
zhiwang/telemetry-phase3

Conversation

@timenick

@timenick timenick commented Apr 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

PR 3 of 4 of the ModelKit telemetry rollout. Three concerns landed in
one PR after follow-up discussion:

Phase 3 proper — wire telemetry into the CLI

  • ActionGroup (subclass of click.Group) wraps each registered
    subcommand's invoke to emit ModelKitHeartbeat (once per CLI
    invocation), ModelKitAction on completion, and ModelKitError on
    unhandled exception. No per-command decoration is required.
  • LazyGroup in cli.py now extends ActionGroup, so winml is both
    lazy-loading and auto-instrumented without any startup-time regression.
  • cli.py registers Telemetry.shutdown via ctx.call_on_close to flush
    the BatchLogRecordProcessor before exit, guarded so it never builds
    a fresh singleton on the way out.
  • Heartbeat emission lives inside the wrapped subcommand invoke so
    winml --help, winml --version, and winml <cmd> --help short-circuit
    inside Click's parameter parsing and never materialize the Telemetry
    singleton.
  • No --disable-telemetry flag, no env-var override, no winml telemetry
    subcommand. Consent is via editing the config file directly.

Close Phase 2 cache-wiring gap

Phase 2 added base64+json encoding helpers and an msvcrt file lock to
utils.py but never wired them up. The result: a network blip during a
CLI invocation lost the entire batch, and the design's "no telemetry data
is lost due to transient network issues" guarantee was unmet.

  • New telemetry/_cache.py with _PersistentCache (append-or-drain,
    locked) at %USERPROFILE%\.winml elemetry\modelkit.cache. Honors
    MODELKIT_TELEMETRY_CACHE_DIR env-var override (developer-facing only).
  • OneCollectorLogExporter now drains the cache on first export
    (best-effort, single shot per process), persists the current batch on
    POST failure, and re-caches cached envelopes that fail recovery.
  • Telemetry clears the cache when init resolves to disabled (consent
    declined / empty iKey / init crash) so a disabled session never
    resends events the user has since opted out of.
  • tests/unit/telemetry/test_cache_integration.py locks in the
    end-to-end design contract — exists specifically to catch the kind of
    wiring gap that slipped Phase 2 review.

Unify user-config paths under .winml

Phase 2 shipped consent at %USERPROFILE%\.modelkit\config.json. With
the cache also living under %USERPROFILE%\.winml elemetry\, both
move under .winml for a single inspectable directory. No migration
shim — Phase 2 hasn't shipped to end users.

Official build pipeline

  • .pipelines/modelkit-official-build.yml: build sdist BEFORE iKey
    injection (sdist stays clean for PyPI), inject key into
    constants.py from the INSTRUMENTATION_KEY secret variable, build
    wheel AFTER injection (wheel carries the key), then verify both:
    wheel contains a non-empty key, sdist contains the empty placeholder.

Notes

  • Plan called for cls=ActionGroup directly on the main group; aligned
    with the existing LazyGroup (added in MVP v2 for winml --help
    startup speed) by making LazyGroup extend ActionGroup.
  • Heartbeat originally specified inside ActionGroup.invoke; moved
    into the wrapped subcommand invoke because ctx.invoked_subcommand
    is set by click.Group.invoke after it's called, and putting
    heartbeat at the subcommand entry point also correctly skips
    cli build --help paths.
  • Cache wiring is a Phase 2 follow-up that was never assigned to a
    later phase; surfaced during this PR's review and pulled in here
    rather than left as a known gap.

@timenick timenick requested a review from a team as a code owner April 27, 2026 07:28
Comment thread src/winml/modelkit/cli.py Fixed
Comment thread tests/unit/telemetry/test_click_group.py Fixed
Comment thread tests/unit/telemetry/test_click_group.py Fixed
timenick added 15 commits April 27, 2026 15:35
…evel imports

- Move _reset_singleton (autouse, with safe shutdown) and env-setup
  enabled_telemetry to tests/unit/telemetry/conftest.py
- Add running_telemetry fixture for tests that need an eagerly
  initialized instance
- Drop duplicated fixtures from test_telemetry_init.py,
  test_telemetry_emit.py, test_telemetry_shutdown.py, test_click_group.py
- Mirror autouse reset in tests/integration/telemetry/conftest.py
- Switch test imports of Telemetry / ActionGroup to package-level form
  per tests/CLAUDE.md import rules
…shutdown test

- cli.py: import the telemetry submodule directly (one alias hop instead
  of walking parent-package -> submodule).
- click_group.py: drop dead defensive None-check in _param;
  click.Context.params is always a dict.
- test_cli_integration: rename test_help_path_does_not_materialize... to
  test_shutdown_telemetry_does_not_materialize_singleton and call the
  function directly. The previous version short-circuited inside Click's
  --help eager option, so call_on_close was never registered and the
  guard was never actually exercised.
…leanup

- Remove inner try/except wrappers around telemetry.log_heartbeat /
  log_action / log_error in click_group.py — Phase 2's Telemetry methods
  already swallow exceptions internally and are documented to never
  raise. The outer wrappers were defense-in-depth duplication.
- Drop the now-unused 'import logging' and _LOGGER constant.
- Replace 'noqa: D102' on resolve_command with a one-line docstring.
- Rename _capture_logger -> _with_mock_logger in test_click_group.py to
  match the existing _with_mock_provider helper in test_telemetry_shutdown.
…wheel build

Adds .pipelines/inject_ikey.py — a small helper that replaces the empty
placeholder in src/winml/modelkit/telemetry/constants.py with the real
iKey at build time. Reads the key from the INSTRUMENTATION_KEY env var
(mapped from the pipeline's secret variable via the script step's
`env:` block), fails loudly if the placeholder is missing or the env
var is empty, and never logs the key value.

Wires the helper into modelkit-official-build.yml as a new step
between 'Install build tools' and 'Build sdist and wheel'. Dev installs
and source checkouts continue to ship an empty placeholder, so only
official wheels carry the key.
A single string-replace in one file doesn't justify a separate Python
helper. Move the logic into an inline PowerShell step that matches the
existing 'Copy runtime check rules' pattern in the same file. Same
fail-fast contract (env empty / placeholder missing throws), same
no-key-in-logs property, one fewer file to maintain.
…eel iKey

Restructure the official build steps to match the security flow:

  1. Build sdist BEFORE iKey injection (with empty placeholder).
     Sdist exposes source to anyone who downloads it, so it must
     never carry the real iKey. Lands in $(Agent.TempDirectory)
     so it is NOT picked up as a published artifact.
  2. Inject iKey into constants.py.
  3. Build wheel ONLY (--wheel) into ob_outputDirectory. This is
     the only artifact that gets published.
  4. Verify the produced wheel contains a non-empty iKey by
     reading constants.py directly out of the .whl zip.
  5. twine check the wheel.

Net effect: the published wheel always carries the iKey when
INSTRUMENTATION_KEY is provided to the pipeline; the sdist
(if anyone retrieves it from build logs) stays clean.
# Conflicts:
#	.pipelines/modelkit-official-build.yml
The official build now produces only the wheel — sdist served no
purpose beyond a packaging sanity check, and skipping it shaves two
build steps and removes the Agent.TempDirectory dance. Source
distribution is still possible via dev installs / source checkout.
PyPI distribution needs both wheel and sdist. Restore the sdist build
into ob_outputDirectory (so the drop carries it for the release / PyPI
publish step). Critical ordering: sdist is built BEFORE iKey injection
so the source archive ships with the empty placeholder; wheel is built
AFTER injection so it carries the real iKey.

Strengthen the verify step to inspect both archives:
  - wheel must contain a non-empty iKey
  - sdist must contain the empty placeholder (defensive against any
    future change that accidentally rebuilds sdist post-inject)

Broaden 'twine check' glob to include the .tar.gz.
Phase 2 added base64+json encoding helpers and an msvcrt file lock to
utils.py but never wired them up. The result: a network blip during a
CLI invocation lost the entire batch, and the design's 'no telemetry
data is lost due to transient network issues' guarantee was unmet.

This change:
  - Adds telemetry/_cache.py with _PersistentCache (append-or-drain,
    locked) and a configurable path. Default location is
    %USERPROFILE%\.winml\telemetry\modelkit.json.
  - Honors MODELKIT_TELEMETRY_CACHE_DIR env-var override (developer-
    facing only; not user-controllable consent).
  - Wires OneCollectorLogExporter to drain the cache on first export
    (best-effort, single shot per process), persist current batch on
    POST failure, and re-cache cached envelopes that fail recovery.
  - Wires Telemetry singleton init to clear the cache when telemetry
    resolves to disabled (consent declined / empty iKey / init crash)
    so a disabled session never resends events the user opted out of.
  - Adds tests/unit/telemetry/test_cache.py for the cache primitive.
  - Adds tests/unit/telemetry/test_cache_integration.py for the
    end-to-end design contract — exists specifically to lock in the
    integration that slipped through the original Phase 2 review.
Phase 2 shipped consent at %USERPROFILE%\.modelkit\config.json. Now
that the cache lives at %USERPROFILE%\.winml\telemetry\modelkit.json,
move consent under the same root for a single inspectable directory.

No migration logic — Phase 2 hasn't shipped to end users; internal
testers will see the first-run prompt one more time. Acceptable
trade-off for a clean naming.
@timenick timenick changed the title feat(telemetry): Phase 3 — wire ActionGroup auto-instrumentation into the CLI feat(telemetry): Phase 3 — CLI auto-instrumentation, persistent cache, .winml paths Apr 28, 2026
Comment thread src/winml/modelkit/telemetry/_cache.py Fixed
… paths

Both consent.py and _cache.py had the same fallback pattern:
  profile = USERPROFILE or HOMEDRIVE+HOMEPATH
  return Path(profile) / .winml / ...
which silently resolves to a CWD-relative path when all three env
vars are missing (rare, but possible in stripped service-account
containers).

  - Hoist _resolve_user_home() to utils.py with a None return when
    no home is resolvable.
  - consent: _CONFIG_PATH becomes Optional[Path]; _load_config and
    _write_stored_consent treat None as 'no persistence' instead of
    crashing or writing to CWD.
  - _cache: _cache_dir / _cache_file / _PersistentCache._path all
    become Optional; every public method short-circuits to a no-op
    when path is None. Tests for the no-home case added.
  - Stale '.modelkit' docstring in test_cli_integration cleaned up.
The file content is base64-encoded JSON per line, not a JSON document.
The .cache extension is honest about that and avoids confusing users
who open the file expecting parseable JSON.
@DingmaomaoBJTU

Copy link
Copy Markdown
Collaborator

Code review

Found 2 issues:

  1. sys.exit() failures are logged as success=True. In wrapped_invoke, the except Exception block does not catch SystemExit (which inherits from BaseException, not Exception). Commands that call sys.exit(1) — such as analyze.py which uses sys.exit(1) through sys.exit(4) — will have their ModelKitAction event emitted with success=True in the finally block, because the success = False assignment is never reached.

https://github.com/microsoft/WinML-ModelKit/blob/49b69e538c4c15956dd34aeeef17e0bee8534163/src/winml/modelkit/telemetry/click_group.py#L74-L90

  1. _reset_singleton fixture is duplicated across two conftest.py files with nearly identical implementations. tests/CLAUDE.md says "Never: Duplicate fixtures across multiple conftest.py files" and "Put shared fixtures in the narrowest conftest.py that covers all consumers." The fixture should live in a single conftest.py at a common ancestor (e.g. tests/conftest.py).

https://github.com/microsoft/WinML-ModelKit/blob/49b69e538c4c15956dd34aeeef17e0bee8534163/tests/unit/telemetry/conftest.py#L15-L34

https://github.com/microsoft/WinML-ModelKit/blob/49b69e538c4c15956dd34aeeef17e0bee8534163/tests/integration/telemetry/conftest.py#L18-L35

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

SystemExit inherits from BaseException, not Exception, so sys.exit(1)
slipped past the wrapped_invoke handler in ActionGroup. The finally
block emitted ModelKitAction with success=True for every command that
exited non-zero (e.g. analyze using sys.exit(1) through sys.exit(2)).

Catch SystemExit explicitly and mark non-zero codes as failure. No
log_error — SystemExit is intentional, not a crash.
The autouse singleton-reset fixture was duplicated across
tests/unit/telemetry/conftest.py and tests/integration/telemetry/
conftest.py. Per tests/CLAUDE.md, shared fixtures should live in the
narrowest conftest.py that covers all consumers — tests/conftest.py.

The fixture is a no-op for tests where _INSTANCE is None, so the
global autouse adds no real cost to non-telemetry tests but closes
the door on accidental singleton leak from any future test that
touches Telemetry.
Three test modules still pointed readers to the old
`_reset_singleton` name in `tests/unit/telemetry/conftest.py`.
Update them to the post-hoist name (`_reset_telemetry_singleton`)
and location (`tests/conftest.py`).
`ctx.exit(N)` raises `click.exceptions.Exit`, which inherits from
RuntimeError and so is caught by the catch-all `except Exception`
handler. That handler unconditionally calls `log_error` and forces
`success=False` — wrong on both counts:

  * `ctx.exit(0)` is a clean intentional exit, not a crash, and was
    being recorded with success=False and a spurious ModelKitError.
  * `ctx.exit(N)` non-zero is an intentional failure exit, not a
    Python crash, and was emitting ModelKitError with a meaningless
    stack trace.

Mirror the SystemExit handler: success reflects the exit code, no
log_error.

Currently latent — only `run.py` uses `ctx.exit` and `run` is in
`_DISABLED_COMMANDS` — but Click documents `ctx.exit` as the
recommended exit mechanism, so any future command that adopts it
would otherwise re-introduce the bug.
@timenick

Copy link
Copy Markdown
Collaborator Author

Code review

Found 2 issues:

  1. sys.exit() failures are logged as success=True. In wrapped_invoke, the except Exception block does not catch SystemExit (which inherits from BaseException, not Exception). Commands that call sys.exit(1) — such as analyze.py which uses sys.exit(1) through sys.exit(4) — will have their ModelKitAction event emitted with success=True in the finally block, because the success = False assignment is never reached.

https://github.com/microsoft/WinML-ModelKit/blob/49b69e538c4c15956dd34aeeef17e0bee8534163/src/winml/modelkit/telemetry/click_group.py#L74-L90

  1. _reset_singleton fixture is duplicated across two conftest.py files with nearly identical implementations. tests/CLAUDE.md says "Never: Duplicate fixtures across multiple conftest.py files" and "Put shared fixtures in the narrowest conftest.py that covers all consumers." The fixture should live in a single conftest.py at a common ancestor (e.g. tests/conftest.py).

https://github.com/microsoft/WinML-ModelKit/blob/49b69e538c4c15956dd34aeeef17e0bee8534163/tests/unit/telemetry/conftest.py#L15-L34

https://github.com/microsoft/WinML-ModelKit/blob/49b69e538c4c15956dd34aeeef17e0bee8534163/tests/integration/telemetry/conftest.py#L18-L35

🤖 Generated with Claude Code

  • If this code review was useful, please react with 👍. Otherwise, react with 👎.

All addressed

@timenick timenick merged commit 77d553b into main Apr 29, 2026
9 checks passed
@timenick timenick deleted the zhiwang/telemetry-phase3 branch April 29, 2026 03:24
timenick added a commit that referenced this pull request Apr 29, 2026
## Summary

PR 4 of 4 of the ModelKit telemetry rollout. User-facing documentation
only:

- **`docs/Privacy.md`** (new): canonical privacy statement — what the
three event types (`ModelKitHeartbeat` / `ModelKitAction` /
`ModelKitError`) carry, common context attributes, the explicit
NOT-collected list, the consent model (first-run prompt +
`%USERPROFILE%\.winml\config.json`), CI auto-disable env vars, the
failure cache file location, and the dev-install no-op invariant.
- **`README.md`**: short Data / Telemetry section summarizing the
consent model and linking to `docs/Privacy.md`.

Paths in both docs reflect Phase 3 reality (`.winml`, not `.modelkit`)
and the persistent failure cache wired in PR #399. The dev-only
`MODELKIT_TELEMETRY_CACHE_DIR` override is intentionally omitted from
the end-user docs.

## Plan reference

- [Phase 4
plan](docs/superpowers/plans/2026-04-17-modelkit-telemetry-phase4.md)
- Phase 1 #367 / Phase 2 #371 / Phase 3 #399 — all merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants