fix(telemetry): OneCollector iKey wire format and 4xx handling#456
Merged
Conversation
The exporter was sending the full instrumentation key in both the x-apikey HTTP header AND the envelope's iKey field. OneCollector expects the envelope iKey in "o:<tenant_token>" form (just the prefix before the first '-'), and rejects the request with InvalidTenantToken when the full ikey is in there. This silently 401-ed every event since telemetry shipped. - serialization._envelope_ikey() derives "o:<tenant_token>" from the full ikey; exporter calls it once at __init__ and embeds the result in the envelope. x-apikey still gets the full ikey. - exporter parses kill-tokens / kill-duration response headers and short-circuits exports for the duration of a tenant-level backoff. Envelopes during the kill window are dropped, not cached -- caching them just guarantees forever-failure on the next process startup within the window. - DEBUG log on non-2xx now includes the Collector-Error header and a 200-byte body excerpt. Without these, diagnosing tenant or wire format issues required a live probe round-trip. - Consent prompt rebrand: "ModelKit" -> "WinML CLI" in the user- visible text. _CONSENT_VERSION unchanged because scope is identical. - Tests cover the new helper, kill-tokens parsing, kill-window enforcement, log content, and a regression guard that the full ikey never appears in the envelope iKey field.
- Use ``str.removeprefix("o:")`` instead of slice index ``[2:]`` to
derive the bare tenant_token, so a future change to the envelope
iKey prefix can't silently break tenant matching.
- Tighten the ``_killed_until`` comment to a single sentence about
the in-memory-only limitation; drop the verbose justification
paragraph.
- Replace ``test_kill_tokens_without_duration_header_is_ignored``
with a parametrized test that also covers empty / non-positive /
non-numeric ``kill-duration`` values -- the three other defensive
early-returns in ``_record_kill_if_present``.
Collaborator
Code reviewFound 1 issue:
Fix: change to @pytest.fixture
def exporter(tmp_path):
cache = _PersistentCache(path=tmp_path / "modelkit.cache")
exp = OneCollectorLogExporter(
ikey="abc-def",
endpoint="https://example.invalid/OneCollector/1.0/",
cache=cache,
)
yield exp
exp.shutdown()🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
…closed The fixture used ``return`` instead of ``yield`` + teardown, leaking the underlying ``requests.Session`` (and its connection pool) across tests. Same pattern flagged in #367 and fixed there for the factory fixture; applying it here too.
- export(): skip the new-batch POST if the cache-flush POST already triggered a kill. Returns SUCCESS (consistent with the top-of-export guard) so the BatchLogRecordProcessor doesn't re-queue. Adds a regression test for the cache-flush-then-new-batch sequence. - exporter: drop dead ``self._ikey`` assignment. The full ikey is closure-captured via the constructor parameter for the x-apikey header, and envelopes use ``self._envelope_ikey``; ``self._ikey`` was never read anywhere. - consent.py: module docstring "ModelKit telemetry" -> "WinML CLI telemetry" to match the user-visible prompt copy.
DingmaomaoBJTU
approved these changes
May 7, 2026
timenick
added a commit
that referenced
this pull request
May 15, 2026
Closes #635. ## Diagnosis The OneCollector backend rejected every batch with `{"acc":0,"efi":{"InvalidEventFormat":"all"}}`. A live probe against the production ingest narrowed the cause to two fields in the envelope that are **not** part of the documented CS 4.0 extension slots — either field on its own is enough to flip the whole batch from `acc:1` to `InvalidEventFormat: all`: | envelope field | CS 4.0 status | |-----------------------|--------------------------------------------------------------| | `ext.os.release` | not documented (`ext.os` keys: `name`, `ver`, `bootId`) | | `ext.app.initTs` | not documented (`ext.app` keys: `name`, `ver`, `id`, `iid`, `expId`, `userId`, `sesId`, `env`, `asId`, `locale`, `tz`) | `ext.app.initTs` is rejected for every value type (float, int, ISO 8601 string), so this is the **field name** being unknown to the schema, not a type mismatch. The previous `acc:1` validation in #456 happened to use a minimal envelope (no `ext`), so the bad fields never reached the wire. The existing unit tests asserted the envelope shape but never validated it against the live backend, so the regression slipped through. ## Fix - `_resource_to_ext()` no longer maps `os.release → ext.os.release` or `initTs → ext.app.initTs`. - `_build_resource()` no longer puts those attributes in the Resource at all; the `_init_ts` instance attribute is dropped too (it was only read by the now-removed mapping). - The "translates resource to ext" unit test is updated to assert the trimmed envelope, plus a new `test_export_omits_undocumented_cs40_ext_fields` fails fast if anyone re-introduces either mapping. ## Probe log (production ingest, masked) ``` [01 minimal Heartbeat (control)] status=200 body={"acc":1} [02 Heartbeat + full ext (initTs FLOAT)] status=400 body={"acc":0,"efi":{"InvalidEventFormat":"all"}} [03 Heartbeat + full ext (initTs STRING)] status=400 body={"acc":0,"efi":{"InvalidEventFormat":"all"}} [07 Heartbeat + only ext.app.initTs as FLOAT] status=400 body={"acc":0,"efi":{"InvalidEventFormat":"all"}} [14 ext.device only {localId, authId, deviceClass}] status=200 body={"acc":1} [15 ext.os only {name, ver, release}] status=400 body={"acc":0,"efi":{"InvalidEventFormat":"all"}} [16 full ext WITHOUT initTs but WITH release] status=400 body={"acc":0,"efi":{"InvalidEventFormat":"all"}} ```
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three telemetry fixes that came out of debugging a persistent
InvalidTenantToken401 from OneCollector, plus a small consent-promptrebrand.
1. Wire-format bug (the actual cause of the 401)
The exporter was embedding the full instrumentation key in both the
x-apikeyHTTP header AND the envelope'siKeyfield. OneCollectorexpects the envelope
iKeyino:<tenant_token>form (just the prefixbefore the first
-); sending the full ikey there gets the requestrejected with
Collector-Error: Invalid Tenant Token. This silently401-ed every event since telemetry shipped — clients don't notice
because all telemetry errors are swallowed.
serialization._envelope_ikey()now deriveso:<tenant_token>from thefull ikey; the exporter calls it once at
__init__and embeds theresult in every envelope.
x-apikeystill carries the full ikey.2.
kill-tokensdirective ignoredWhen OneCollector decides our tenant is misbehaving (e.g. unknown to
the tenant directory), it returns
kill-tokens: o:<token>:allpluskill-duration: Ntelling the client to back off for N seconds. Wewere ignoring both, retrying every batch and every process startup —
exactly the noisy-client behavior the directive exists to suppress.
The exporter now parses both headers, records the kill window, and
short-circuits exports during it. Envelopes for the kill window are
dropped, not cached — caching them just guarantees forever-failure
on the next startup within the window.
3. DEBUG log on non-2xx was unactionable
Previously logged only the status code. Now also includes the
Collector-Errorresponse header and a 200-byte body excerpt — thetwo pieces of info OneCollector uses to communicate the actual
rejection reason. Without these, diagnosing a tenant or wire-format
issue required a live probe round-trip.
4. Consent prompt rebrand
ModelKit→WinML CLIin the user-visible consent text._CONSENT_VERSIONis unchanged because no scope changed.Validation
Verified end-to-end against the live OneCollector ingest with a real
ikey: with the wire-format fix, a single Heartbeat envelope returns
status: 200, body: {"acc":1}. New unit tests cover the helper,kill-tokens parsing, kill-window enforcement, log content, and a
regression guard that the full ikey never appears in the envelope
iKeyfield.