Skip to content

feat(telemetry): Phase 1 — OneCollector library + device ID foundation#367

Merged
timenick merged 22 commits into
mainfrom
zhiwang/telemetry-phase1
Apr 21, 2026
Merged

feat(telemetry): Phase 1 — OneCollector library + device ID foundation#367
timenick merged 22 commits into
mainfrom
zhiwang/telemetry-phase1

Conversation

@timenick

@timenick timenick commented Apr 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

PR 1 of 4 in the ModelKit telemetry rollout. Introduces the transport foundation:

  • telemetry/library/ — a custom opentelemetry-sdk log exporter that speaks Common Schema 4.0 to the OneCollector endpoint.
  • telemetry/deviceid/ — a SHA256-hashed UUID persisted per machine, used as the telemetry device_id.

Emission-free by design: no Telemetry singleton, no consent prompt, no CLI wiring yet. User-facing behavior is unchanged.

What's next

  • PR 2 (blocked on Privacy sign-off of prompt text + event schemas): constants.py + utils.py (exception / PII scrubbing) + consent.py (first-run prompt) + telemetry.py (singleton).
  • PR 3: ActionGroup auto-instrumentation + cli.py integration + --disable-telemetry flag.
  • PR 4: docs/Privacy.md + README telemetry section.

@timenick timenick requested a review from a team as a code owner April 17, 2026 06:12
Comment thread src/winml/modelkit/telemetry/deviceid/_store.py Fixed
Comment thread tests/unit/telemetry/deviceid/conftest.py Fixed

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review

Found 4 issues:

  1. Test import in test_deviceid.py reaches into internal submodule for publicly exported symbol (CLAUDE.md violation)
  2. OneCollectorLogExporter missing from __all__ in library/__init__.py (CLAUDE.md violation)
  3. Test import in test_exporter.py reaches into internal submodule for publicly exported symbol (CLAUDE.md violation)
  4. _MIN_HTTP_TIMEOUT is dead code — max(0.1, 10.0) always evaluates to 10.0

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Comment thread tests/unit/telemetry/deviceid/test_deviceid.py Outdated
Comment thread src/winml/modelkit/telemetry/library/__init__.py Outdated
Comment thread tests/unit/telemetry/library/test_exporter.py Outdated
Comment thread src/winml/modelkit/telemetry/library/exporter.py Outdated
- Export OneCollectorLogExporter in library.__all__
- Switch test_deviceid and test_exporter to package-level imports
- Drop dead _MIN_HTTP_TIMEOUT clamp; use _HTTP_TIMEOUT directly

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous 4 issues all addressed. Found 1 new issue on second pass.

Comment thread tests/unit/telemetry/library/test_factory.py Outdated
Adds a make_provider fixture that tracks created LoggerProviders and
calls shutdown() during teardown. Without this the BatchLogRecordProcessor
leaks a background daemon thread per test invocation.
@timenick timenick requested a review from xieofxie April 21, 2026 02:21
Comment thread src/winml/modelkit/telemetry/deviceid/deviceid.py Outdated
Comment thread src/winml/modelkit/telemetry/deviceid/_store.py
… helper

IdStatus subclasses str so callers can plug status straight into an
OpenTelemetry Resource (str-typed attributes) and the CS 4.0
ext.device.authId slot without a .value conversion.

Dropped get_telemetry_base_dir(): it returned %LOCALAPPDATA%\..\ but
the design places cache/consent under %USERPROFILE%\.modelkit(owned by Phase 2). The helper had no production callers; Phase 2 will
expose the correct path from utils.py/consent.py.
@timenick timenick merged commit 3098e4c into main Apr 21, 2026
9 checks passed
@timenick timenick deleted the zhiwang/telemetry-phase1 branch April 21, 2026 03:16
timenick added a commit that referenced this pull request Apr 27, 2026
…leton) (#371)

## Summary

Phase 2 of the [telemetry
rollout](../blob/main/docs/superpowers/specs/2026-04-17-modelkit-telemetry-design.md).
Builds on Phase 1 (`library/` + `deviceid/`, merged in #367) with the
ModelKit-specific telemetry layer.

- `constants.py` — empty `INSTRUMENTATION_KEY` placeholder; official
builds inject the real iKey before packaging. No runtime / env-var
override.
- `utils.py` — exception scrubbing pipeline (path trim → 200-char cap →
PII regex for email / GUID / IPv4 / IPv6 / long-token), structured
traceback extraction (only `{file, line, function}`), base64 cache
encoding, `msvcrt` exclusive file lock.
- `consent.py` — first-run prompt (accept-by-default `[Y/n]`), JSON
config at `%USERPROFILE%\.modelkit\config.json` with atomic writes that
preserve unrelated keys, CI auto-detection via `CI` / `TF_BUILD` /
`GITHUB_ACTIONS` / etc. (fail-closed in non-interactive environments).
- `telemetry.py` — lazy singleton that wires `LoggerProvider` +
`Resource`, exposes `log_heartbeat` / `log_action` / `log_error` with
per-event allowlist enforcement, `self._logger is None` guards,
defensive `try/except` on every public method, idempotent `shutdown`.

Still **not wired to the CLI** — `ActionGroup` auto-instrumentation
lands in Phase 3.

## Test results

- `uv run pytest tests/unit/telemetry/` → **105 PASSED** (39 from Phase
1 + 66 new).
- `uv run pytest tests/` → no regressions.
- `uv run python -c "import winml.modelkit.telemetry"` → no side effects
(iKey empty → `Telemetry.disabled` stays `True`).
- `uvx ruff check src/winml/modelkit/telemetry/ tests/unit/telemetry/` →
clean.

## Demo — captured wire format

A local-only preview harness (`winml.modelkit.telemetry._demo`, not
shipped) monkey-patches the iKey, points consent at a temp config, and
intercepts `requests.Session.post` so the exporter serializes CS 4.0
envelopes to stdout instead of the network. Output below (device ID +
session ID redacted):

<details>
<summary>First-run consent prompt text</summary>

```
ModelKit can collect anonymous usage data to help improve the product.

What is collected:
  - Command name, duration, success/failure
  - Target device/EP (when the command specifies them)
  - OS, architecture, ModelKit version
  - Unhandled exception types, code locations, and scrubbed error
    messages (paths trimmed, length capped, PII patterns scrubbed)

What is never collected:
  - File paths, model contents, command arguments, credentials

Enable telemetry? [Y/n]:
```

</details>

<details>
<summary>ModelKitHeartbeat envelope</summary>

```json
{
  "ver": "4.0",
  "name": "ModelKitHeartbeat",
  "time": "2026-04-21T07:12:41.603Z",
  "iKey": "o:fake-demo-tenant-token",
  "data": {},
  "ext": {
    "device": {
      "localId": "<redacted-sha256>",
      "authId": "EXISTING",
      "deviceClass": "AMD64"
    },
    "os": {
      "name": "Windows",
      "ver": "10.0.26310",
      "release": "10"
    },
    "app": {
      "ver": "0.0.1",
      "sesId": "<redacted-uuid>",
      "initTs": 1776755561.4647715
    }
  }
}
```

</details>

<details>
<summary>ModelKitAction envelope — happy path (build on NPU)</summary>

Extra kwargs outside the allowlist (e.g. `leaked_field`) are dropped
before emission.

```json
{
  "ver": "4.0",
  "name": "ModelKitAction",
  "time": "2026-04-21T07:12:41.603Z",
  "iKey": "o:fake-demo-tenant-token",
  "data": {
    "invoked_from": "Interactive",
    "action_name": "build",
    "device": "NPU",
    "ep": "QNNExecutionProvider",
    "duration_ms": 4321,
    "success": true
  },
  "ext": { "...": "same as heartbeat" }
}
```

</details>

<details>
<summary>ModelKitAction envelope — failure path</summary>

```json
{
  "ver": "4.0",
  "name": "ModelKitAction",
  "time": "2026-04-21T07:12:41.603Z",
  "iKey": "o:fake-demo-tenant-token",
  "data": {
    "invoked_from": "Interactive",
    "action_name": "quantize",
    "device": null,
    "ep": null,
    "duration_ms": 87,
    "success": false
  },
  "ext": { "...": "same as heartbeat" }
}
```

</details>

<details>
<summary>ModelKitError envelope — scrubbed message + structured
stack</summary>

Original exception was `FileNotFoundError(r"model file not found:
C:\Users\Alice\models\mobilenet.onnx")`. Path is outside the package
root, so `_trim_path` falls back to basename. Real ModelKit errors
typically only hit the path rule; the defensive regex rules are shown
separately below.

```json
{
  "ver": "4.0",
  "name": "ModelKitError",
  "time": "2026-04-21T07:12:41.604Z",
  "iKey": "o:fake-demo-tenant-token",
  "data": {
    "exception_type": "FileNotFoundError",
    "exception_message": "model file not found: mobilenet.onnx",
    "exception_stack": [
      {
        "file": "winml/modelkit/telemetry/_demo.py",
        "line": 128,
        "function": "_run_demo"
      }
    ]
  },
  "ext": { "...": "same as heartbeat" }
}
```

</details>

<details>
<summary>PII scrubbing pipeline — rule-by-rule examples</summary>

Pipeline order: per-token path trim → 200-char length cap → PII regex
sweep. Each rule in isolation:

| Rule | Input | Output |
|---|---|---|
| Path (inside package) |
`C:\Users\Alice\src\winml\modelkit\commands\build.py` |
`winml/modelkit/commands/build.py` |
| Path (outside package) | `C:\Users\Alice\models\mobilenet.onnx` |
`mobilenet.onnx` |
| Email | `failed to reach alice@example.com` | `failed to reach
<scrubbed>` |
| GUID | `job 12345678-1234-5678-1234-567812345678 failed` | `job
<scrubbed> failed` |
| IPv4 | `connection to 10.0.0.1 refused` | `connection to <scrubbed>
refused` |
| IPv4 (invalid octet — preserved) | `expected version 1.18.999.1` |
`expected version 1.18.999.1` |
| IPv6 | `bind 2001:db8::1 failed` | `bind <scrubbed> failed` |
| Long token (has digit) | `Bearer example_token_1234567890abcdefghij` |
`Bearer <scrubbed>` |
| Long token (no digit — preserved) |
`WinMLImageFeatureExtractionEvaluator` |
`WinMLImageFeatureExtractionEvaluator` |

The path rule is what typical ModelKit errors hit. Email / IPv4 / IPv6 /
GUID / long-token rules are defensive — they catch PII that might leak
through from lower-level libraries (stdlib URL errors, Windows API
GUIDs, upstream HTTP clients), not patterns ModelKit itself produces.

</details>
timenick added a commit that referenced this pull request Apr 29, 2026
## Summary

PR 4 of 4 of the ModelKit telemetry rollout. User-facing documentation
only:

- **`docs/Privacy.md`** (new): canonical privacy statement — what the
three event types (`ModelKitHeartbeat` / `ModelKitAction` /
`ModelKitError`) carry, common context attributes, the explicit
NOT-collected list, the consent model (first-run prompt +
`%USERPROFILE%\.winml\config.json`), CI auto-disable env vars, the
failure cache file location, and the dev-install no-op invariant.
- **`README.md`**: short Data / Telemetry section summarizing the
consent model and linking to `docs/Privacy.md`.

Paths in both docs reflect Phase 3 reality (`.winml`, not `.modelkit`)
and the persistent failure cache wired in PR #399. The dev-only
`MODELKIT_TELEMETRY_CACHE_DIR` override is intentionally omitted from
the end-user docs.

## Plan reference

- [Phase 4
plan](docs/superpowers/plans/2026-04-17-modelkit-telemetry-phase4.md)
- Phase 1 #367 / Phase 2 #371 / Phase 3 #399 — all merged
timenick added a commit that referenced this pull request May 7, 2026
…closed

The fixture used ``return`` instead of ``yield`` + teardown, leaking
the underlying ``requests.Session`` (and its connection pool) across
tests. Same pattern flagged in #367 and fixed there for the factory
fixture; applying it here too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants