Skip to content

fix(telemetry): OneCollector iKey wire format and 4xx handling#456

Merged
timenick merged 6 commits into
mainfrom
zhiwang/fix-onecollector-telemetry
May 7, 2026
Merged

fix(telemetry): OneCollector iKey wire format and 4xx handling#456
timenick merged 6 commits into
mainfrom
zhiwang/fix-onecollector-telemetry

Conversation

@timenick

@timenick timenick commented May 7, 2026

Copy link
Copy Markdown
Collaborator

Summary

Three telemetry fixes that came out of debugging a persistent
InvalidTenantToken 401 from OneCollector, plus a small consent-prompt
rebrand.

1. Wire-format bug (the actual cause of the 401)

The exporter was embedding the full instrumentation key in both the
x-apikey HTTP header AND the envelope's iKey field. OneCollector
expects the envelope iKey in o:<tenant_token> form (just the prefix
before the first -); sending the full ikey there gets the request
rejected with Collector-Error: Invalid Tenant Token. This silently
401-ed every event since telemetry shipped — clients don't notice
because all telemetry errors are swallowed.

serialization._envelope_ikey() now derives o:<tenant_token> from the
full ikey; the exporter calls it once at __init__ and embeds the
result in every envelope. x-apikey still carries the full ikey.

2. kill-tokens directive ignored

When OneCollector decides our tenant is misbehaving (e.g. unknown to
the tenant directory), it returns kill-tokens: o:<token>:all plus
kill-duration: N telling the client to back off for N seconds. We
were ignoring both, retrying every batch and every process startup —
exactly the noisy-client behavior the directive exists to suppress.

The exporter now parses both headers, records the kill window, and
short-circuits exports during it. Envelopes for the kill window are
dropped, not cached — caching them just guarantees forever-failure
on the next startup within the window.

3. DEBUG log on non-2xx was unactionable

Previously logged only the status code. Now also includes the
Collector-Error response header and a 200-byte body excerpt — the
two pieces of info OneCollector uses to communicate the actual
rejection reason. Without these, diagnosing a tenant or wire-format
issue required a live probe round-trip.

4. Consent prompt rebrand

ModelKitWinML CLI in the user-visible consent text.
_CONSENT_VERSION is unchanged because no scope changed.

Validation

Verified end-to-end against the live OneCollector ingest with a real
ikey: with the wire-format fix, a single Heartbeat envelope returns
status: 200, body: {"acc":1}. New unit tests cover the helper,
kill-tokens parsing, kill-window enforcement, log content, and a
regression guard that the full ikey never appears in the envelope
iKey field.

timenick added 2 commits May 7, 2026 11:08
The exporter was sending the full instrumentation key in both the
x-apikey HTTP header AND the envelope's iKey field. OneCollector
expects the envelope iKey in "o:<tenant_token>" form (just the prefix
before the first '-'), and rejects the request with InvalidTenantToken
when the full ikey is in there. This silently 401-ed every event
since telemetry shipped.

- serialization._envelope_ikey() derives "o:<tenant_token>" from the
  full ikey; exporter calls it once at __init__ and embeds the result
  in the envelope. x-apikey still gets the full ikey.
- exporter parses kill-tokens / kill-duration response headers and
  short-circuits exports for the duration of a tenant-level backoff.
  Envelopes during the kill window are dropped, not cached -- caching
  them just guarantees forever-failure on the next process startup
  within the window.
- DEBUG log on non-2xx now includes the Collector-Error header and a
  200-byte body excerpt. Without these, diagnosing tenant or wire
  format issues required a live probe round-trip.
- Consent prompt rebrand: "ModelKit" -> "WinML CLI" in the user-
  visible text. _CONSENT_VERSION unchanged because scope is identical.
- Tests cover the new helper, kill-tokens parsing, kill-window
  enforcement, log content, and a regression guard that the full ikey
  never appears in the envelope iKey field.
@timenick timenick requested a review from a team as a code owner May 7, 2026 03:09
- Use ``str.removeprefix("o:")`` instead of slice index ``[2:]`` to
  derive the bare tenant_token, so a future change to the envelope
  iKey prefix can't silently break tenant matching.
- Tighten the ``_killed_until`` comment to a single sentence about
  the in-memory-only limitation; drop the verbose justification
  paragraph.
- Replace ``test_kill_tokens_without_duration_header_is_ignored``
  with a parametrized test that also covers empty / non-positive /
  non-numeric ``kill-duration`` values -- the three other defensive
  early-returns in ``_record_kill_if_present``.
@DingmaomaoBJTU

Copy link
Copy Markdown
Collaborator

Code review

Found 1 issue:

  1. exporter fixture leaks a requests.Session — it uses return instead of yield + shutdown(), so the session's connection pool is never closed. The same pattern was flagged in PR feat(telemetry): Phase 1 — OneCollector library + device ID foundation #367 and fixed for test_factory.py; the new fixture here has the same problem. OneCollectorLogExporter.__init__ creates a requests.Session, and shutdown() exists specifically to close it.

https://github.com/microsoft/WinML-ModelKit/blob/dd1a4f981c143751c10d3808e7b8f515b7fec25f/tests/unit/telemetry/library/test_exporter.py#L40-L53

Fix: change to yield + teardown:

@pytest.fixture
def exporter(tmp_path):
    cache = _PersistentCache(path=tmp_path / "modelkit.cache")
    exp = OneCollectorLogExporter(
        ikey="abc-def",
        endpoint="https://example.invalid/OneCollector/1.0/",
        cache=cache,
    )
    yield exp
    exp.shutdown()

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Comment thread src/winml/modelkit/telemetry/library/exporter.py
…closed

The fixture used ``return`` instead of ``yield`` + teardown, leaking
the underlying ``requests.Session`` (and its connection pool) across
tests. Same pattern flagged in #367 and fixed there for the factory
fixture; applying it here too.
Comment thread src/winml/modelkit/telemetry/library/exporter.py Outdated
Comment thread src/winml/modelkit/telemetry/consent.py
timenick added 2 commits May 7, 2026 12:57
- export(): skip the new-batch POST if the cache-flush POST already
  triggered a kill. Returns SUCCESS (consistent with the top-of-export
  guard) so the BatchLogRecordProcessor doesn't re-queue. Adds a
  regression test for the cache-flush-then-new-batch sequence.
- exporter: drop dead ``self._ikey`` assignment. The full ikey is
  closure-captured via the constructor parameter for the x-apikey
  header, and envelopes use ``self._envelope_ikey``; ``self._ikey``
  was never read anywhere.
- consent.py: module docstring "ModelKit telemetry" -> "WinML CLI
  telemetry" to match the user-visible prompt copy.
@timenick timenick merged commit 58a6bf3 into main May 7, 2026
9 checks passed
@timenick timenick deleted the zhiwang/fix-onecollector-telemetry branch May 7, 2026 06:27
timenick added a commit that referenced this pull request May 15, 2026
Closes #635.

## Diagnosis

The OneCollector backend rejected every batch with
`{"acc":0,"efi":{"InvalidEventFormat":"all"}}`. A live probe against
the production ingest narrowed the cause to two fields in the
envelope that are **not** part of the documented CS 4.0 extension
slots — either field on its own is enough to flip the whole batch
from `acc:1` to `InvalidEventFormat: all`:

| envelope field | CS 4.0 status |

|-----------------------|--------------------------------------------------------------|
| `ext.os.release` | not documented (`ext.os` keys: `name`, `ver`,
`bootId`) |
| `ext.app.initTs` | not documented (`ext.app` keys: `name`, `ver`,
`id`, `iid`, `expId`, `userId`, `sesId`, `env`, `asId`, `locale`, `tz`)
|

`ext.app.initTs` is rejected for every value type (float, int, ISO
8601 string), so this is the **field name** being unknown to the
schema, not a type mismatch.

The previous `acc:1` validation in #456 happened to use a minimal
envelope (no `ext`), so the bad fields never reached the wire. The
existing unit tests asserted the envelope shape but never validated
it against the live backend, so the regression slipped through.

## Fix

- `_resource_to_ext()` no longer maps `os.release → ext.os.release`
  or `initTs → ext.app.initTs`.
- `_build_resource()` no longer puts those attributes in the
  Resource at all; the `_init_ts` instance attribute is dropped too
  (it was only read by the now-removed mapping).
- The "translates resource to ext" unit test is updated to assert
the trimmed envelope, plus a new
`test_export_omits_undocumented_cs40_ext_fields`
  fails fast if anyone re-introduces either mapping.

## Probe log (production ingest, masked)

```
[01 minimal Heartbeat (control)]                              status=200  body={"acc":1}
[02 Heartbeat + full ext (initTs FLOAT)]                      status=400  body={"acc":0,"efi":{"InvalidEventFormat":"all"}}
[03 Heartbeat + full ext (initTs STRING)]                     status=400  body={"acc":0,"efi":{"InvalidEventFormat":"all"}}
[07 Heartbeat + only ext.app.initTs as FLOAT]                 status=400  body={"acc":0,"efi":{"InvalidEventFormat":"all"}}
[14 ext.device only {localId, authId, deviceClass}]           status=200  body={"acc":1}
[15 ext.os only {name, ver, release}]                         status=400  body={"acc":0,"efi":{"InvalidEventFormat":"all"}}
[16 full ext WITHOUT initTs but WITH release]                 status=400  body={"acc":0,"efi":{"InvalidEventFormat":"all"}}
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants