Skip to content

feat(telemetry): Phase 2 — core telemetry logic (consent, scrub, singleton)#371

Merged
timenick merged 13 commits into
mainfrom
zhiwang/telemetry-phase2
Apr 27, 2026
Merged

feat(telemetry): Phase 2 — core telemetry logic (consent, scrub, singleton)#371
timenick merged 13 commits into
mainfrom
zhiwang/telemetry-phase2

Conversation

@timenick

@timenick timenick commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator

Summary

Phase 2 of the telemetry rollout. Builds on Phase 1 (library/ + deviceid/, merged in #367) with the ModelKit-specific telemetry layer.

  • constants.py — empty INSTRUMENTATION_KEY placeholder; official builds inject the real iKey before packaging. No runtime / env-var override.
  • utils.py — exception scrubbing pipeline (path trim → 200-char cap → PII regex for email / GUID / IPv4 / IPv6 / long-token), structured traceback extraction (only {file, line, function}), base64 cache encoding, msvcrt exclusive file lock.
  • consent.py — first-run prompt (accept-by-default [Y/n]), JSON config at %USERPROFILE%\.modelkit\config.json with atomic writes that preserve unrelated keys, CI auto-detection via CI / TF_BUILD / GITHUB_ACTIONS / etc. (fail-closed in non-interactive environments).
  • telemetry.py — lazy singleton that wires LoggerProvider + Resource, exposes log_heartbeat / log_action / log_error with per-event allowlist enforcement, self._logger is None guards, defensive try/except on every public method, idempotent shutdown.

Still not wired to the CLIActionGroup auto-instrumentation lands in Phase 3.

Test results

  • uv run pytest tests/unit/telemetry/105 PASSED (39 from Phase 1 + 66 new).
  • uv run pytest tests/ → no regressions.
  • uv run python -c "import winml.modelkit.telemetry" → no side effects (iKey empty → Telemetry.disabled stays True).
  • uvx ruff check src/winml/modelkit/telemetry/ tests/unit/telemetry/ → clean.

Demo — captured wire format

A local-only preview harness (winml.modelkit.telemetry._demo, not shipped) monkey-patches the iKey, points consent at a temp config, and intercepts requests.Session.post so the exporter serializes CS 4.0 envelopes to stdout instead of the network. Output below (device ID + session ID redacted):

First-run consent prompt text
ModelKit can collect anonymous usage data to help improve the product.

What is collected:
  - Command name, duration, success/failure
  - Target device/EP (when the command specifies them)
  - OS, architecture, ModelKit version
  - Unhandled exception types, code locations, and scrubbed error
    messages (paths trimmed, length capped, PII patterns scrubbed)

What is never collected:
  - File paths, model contents, command arguments, credentials

Enable telemetry? [Y/n]:
ModelKitHeartbeat envelope
{
  "ver": "4.0",
  "name": "ModelKitHeartbeat",
  "time": "2026-04-21T07:12:41.603Z",
  "iKey": "o:fake-demo-tenant-token",
  "data": {},
  "ext": {
    "device": {
      "localId": "<redacted-sha256>",
      "authId": "EXISTING",
      "deviceClass": "AMD64"
    },
    "os": {
      "name": "Windows",
      "ver": "10.0.26310",
      "release": "10"
    },
    "app": {
      "ver": "0.0.1",
      "sesId": "<redacted-uuid>",
      "initTs": 1776755561.4647715
    }
  }
}
ModelKitAction envelope — happy path (build on NPU)

Extra kwargs outside the allowlist (e.g. leaked_field) are dropped before emission.

{
  "ver": "4.0",
  "name": "ModelKitAction",
  "time": "2026-04-21T07:12:41.603Z",
  "iKey": "o:fake-demo-tenant-token",
  "data": {
    "invoked_from": "Interactive",
    "action_name": "build",
    "device": "NPU",
    "ep": "QNNExecutionProvider",
    "duration_ms": 4321,
    "success": true
  },
  "ext": { "...": "same as heartbeat" }
}
ModelKitAction envelope — failure path
{
  "ver": "4.0",
  "name": "ModelKitAction",
  "time": "2026-04-21T07:12:41.603Z",
  "iKey": "o:fake-demo-tenant-token",
  "data": {
    "invoked_from": "Interactive",
    "action_name": "quantize",
    "device": null,
    "ep": null,
    "duration_ms": 87,
    "success": false
  },
  "ext": { "...": "same as heartbeat" }
}
ModelKitError envelope — scrubbed message + structured stack

Original exception was FileNotFoundError(r"model file not found: C:\Users\Alice\models\mobilenet.onnx"). Path is outside the package root, so _trim_path falls back to basename. Real ModelKit errors typically only hit the path rule; the defensive regex rules are shown separately below.

{
  "ver": "4.0",
  "name": "ModelKitError",
  "time": "2026-04-21T07:12:41.604Z",
  "iKey": "o:fake-demo-tenant-token",
  "data": {
    "exception_type": "FileNotFoundError",
    "exception_message": "model file not found: mobilenet.onnx",
    "exception_stack": [
      {
        "file": "winml/modelkit/telemetry/_demo.py",
        "line": 128,
        "function": "_run_demo"
      }
    ]
  },
  "ext": { "...": "same as heartbeat" }
}
PII scrubbing pipeline — rule-by-rule examples

Pipeline order: per-token path trim → 200-char length cap → PII regex sweep. Each rule in isolation:

Rule Input Output
Path (inside package) C:\Users\Alice\src\winml\modelkit\commands\build.py winml/modelkit/commands/build.py
Path (outside package) C:\Users\Alice\models\mobilenet.onnx mobilenet.onnx
Email failed to reach alice@example.com failed to reach <scrubbed>
GUID job 12345678-1234-5678-1234-567812345678 failed job <scrubbed> failed
IPv4 connection to 10.0.0.1 refused connection to <scrubbed> refused
IPv4 (invalid octet — preserved) expected version 1.18.999.1 expected version 1.18.999.1
IPv6 bind 2001:db8::1 failed bind <scrubbed> failed
Long token (has digit) Bearer example_token_1234567890abcdefghij Bearer <scrubbed>
Long token (no digit — preserved) WinMLImageFeatureExtractionEvaluator WinMLImageFeatureExtractionEvaluator

The path rule is what typical ModelKit errors hit. Email / IPv4 / IPv6 / GUID / long-token rules are defensive — they catch PII that might leak through from lower-level libraries (stdlib URL errors, Windows API GUIDs, upstream HTTP clients), not patterns ModelKit itself produces.

…leton)

Adds the ModelKit-specific telemetry layer on top of Phase 1's library/
and deviceid/:

- constants.py: empty INSTRUMENTATION_KEY placeholder (official builds inject)
- utils.py: exception scrubbing (path trim / 200-char cap / PII regex),
  structured traceback extraction, base64 cache encoding, msvcrt file lock
- consent.py: accept-by-default first-run prompt, JSON config at
  %USERPROFILE%\.modelkit\config.json, CI auto-detection (fail-closed)
- telemetry.py: lazy singleton wiring LoggerProvider + Resource,
  log_heartbeat / log_action / log_error with per-event allowlist,
  defensive try/except on every public method, idempotent shutdown

Still not wired to the CLI — ActionGroup lands in Phase 3.
@timenick timenick requested a review from a team as a code owner April 21, 2026 03:51
Comment thread src/winml/modelkit/telemetry/consent.py Fixed
Comment thread tests/unit/telemetry/test_utils_stack.py Fixed
- consent.py: add explanatory comment to empty except OSError in
  _write_stored_consent temp-file cleanup path
- test_utils_stack.py: add unreachable explicit return None in
  _raise_chain to avoid mixed implicit/explicit return warning

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thorough Phase 2 implementation. Clean separation of concerns (consent, scrubbing, singleton), defensive error handling throughout, and solid test coverage (66 tests). A few items worth discussing below.

Comment thread src/winml/modelkit/telemetry/utils.py Outdated
Comment thread src/winml/modelkit/telemetry/utils.py Outdated
Comment thread src/winml/modelkit/telemetry/consent.py Outdated
Comment thread src/winml/modelkit/telemetry/telemetry.py
Comment thread src/winml/modelkit/telemetry/telemetry.py
- utils.py: require a digit in the long-token regex so Pythonic class
  names (e.g. WinMLImageFeatureExtractionEvaluator) are not scrubbed
- utils.py: validate IPv4 octets to 0-255; document the remaining
  4-part-version false positive as an accepted trade-off
- consent.py: fall back to expanduser("~") if USERPROFILE is unset so
  the config never silently resolves to a CWD-relative path
- telemetry.py: comment that invoked_from="Script" is forward-compat
  under the current consent rules (non-TTY -> disabled)
- telemetry.py: note in Telemetry docstring that callers own shutdown
  lifecycle; Phase 3 ActionGroup will own it in its teardown path
- tests: regression cases for class-name preservation, digit-gated
  long token, and out-of-range IPv4
…vars

expanduser("~") is stdlib cross-platform — it checks POSIX HOME on
non-Windows. ModelKit is Windows-only, so use USERPROFILE with a fallback
to HOMEDRIVE+HOMEPATH (both native Windows vars) instead.
@timenick

timenick commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for the review — all 5 points addressed across commits ff41e35 + 7288707:

#3 (long-token regex): Added (?=[A-Za-z0-9_\-]*\d) lookahead so the pattern only matches if there's at least one digit. Pythonic class/function names (no digits) pass through untouched; API keys / JWTs / SHA hashes / base64 blobs still match. Added regression test.

#4 (IPv4 vs version strings): Added octet range validation (0-255) — obvious nonsense like 999.888.777.666 now passes through. The 4-part version trade-off (1.18.0.1) remains: regex alone can't distinguish, and the spec's "conservative over leaky" philosophy tips toward scrubbing. Documented in a comment. Happy to revisit if Privacy pushes back either way.

#5 (USERPROFILE missing): Fall back to HOMEDRIVE + HOMEPATH (both Windows-native). ModelKit is Windows-only so avoiding os.path.expanduser("~"), which would pull in POSIX HOME semantics.

#6 (_invoked_from = "Script" dead path): Added inline comment explaining it's forward-compat — today consent returns "disabled" for non-TTY, so only "Interactive" reaches the wire. Schema stays stable if Phase 3 relaxes that rule.

#7 (no atexit for shutdown): Added a note in Telemetry's class docstring that the caller owns lifecycle and Phase 3's ActionGroup will own the teardown call. Deferring the actual atexit wiring to Phase 3 where it naturally lives.

@timenick timenick requested a review from xieofxie April 21, 2026 08:26
Comment thread src/winml/modelkit/telemetry/consent.py
Comment thread src/winml/modelkit/telemetry/consent.py
Introduces `consent_version` in `%USERPROFILE%\.modelkit\config.json`
so a materially changed notice text can trigger a re-prompt on the
next invocation. A stored version strictly older than `_CONSENT_VERSION`
is treated as unrecorded on read; records without the field are
grandfathered as the current version so existing users are not
re-prompted by the introduction of the field itself.

Addresses xieofxie's review on PR #371.
Comment thread src/winml/modelkit/telemetry/consent.py
The two constants are a pair — version bumps when prompt text's scope
changes — so keep them adjacent and document the coupling.

@xieofxie xieofxie left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trust CC

- Remove Section 7 'Current Violations': all 9 classes already renamed.
- Acronym table: fix HTP example (real classes), drop HF row (not a class prefix).
- Add canonical EP table covering CPU, CUDA, DML, MIGraphX, NvTensorRTRTX,
  OpenVINO, QNN, VitisAI with ORT full names and casing notes.
- Move HF_ constant prefix to a separate 'Other Canonical Identifiers' section.
@timenick timenick force-pushed the zhiwang/telemetry-phase2 branch from 3b12398 to 08ba3af Compare April 24, 2026 06:53
@timenick timenick merged commit 7a66c02 into main Apr 27, 2026
9 checks passed
@timenick timenick deleted the zhiwang/telemetry-phase2 branch April 27, 2026 06:21
timenick added a commit that referenced this pull request Apr 29, 2026
## Summary

PR 4 of 4 of the ModelKit telemetry rollout. User-facing documentation
only:

- **`docs/Privacy.md`** (new): canonical privacy statement — what the
three event types (`ModelKitHeartbeat` / `ModelKitAction` /
`ModelKitError`) carry, common context attributes, the explicit
NOT-collected list, the consent model (first-run prompt +
`%USERPROFILE%\.winml\config.json`), CI auto-disable env vars, the
failure cache file location, and the dev-install no-op invariant.
- **`README.md`**: short Data / Telemetry section summarizing the
consent model and linking to `docs/Privacy.md`.

Paths in both docs reflect Phase 3 reality (`.winml`, not `.modelkit`)
and the persistent failure cache wired in PR #399. The dev-only
`MODELKIT_TELEMETRY_CACHE_DIR` override is intentionally omitted from
the end-user docs.

## Plan reference

- [Phase 4
plan](docs/superpowers/plans/2026-04-17-modelkit-telemetry-phase4.md)
- Phase 1 #367 / Phase 2 #371 / Phase 3 #399 — all merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants