Skip to content

fix(paths-through): use canonical resolved_path instead of naive prefix match — fixes wrong-node attribution (#1352)#1353

Merged
Kpa-clawbot merged 4 commits into
masterfrom
fix/issue-1352
May 25, 2026
Merged

fix(paths-through): use canonical resolved_path instead of naive prefix match — fixes wrong-node attribution (#1352)#1353
Kpa-clawbot merged 4 commits into
masterfrom
fix/issue-1352

Conversation

@Kpa-clawbot
Copy link
Copy Markdown
Owner

Summary

/api/nodes/{pk}/paths (paths-through-node) attributed the same transmission to every prefix-sibling when their hop bytes collided (e.g. 5 nodes with c0… on staging). Querying any of them returned the tx — visible bug per #1352 where Kpa Roof Solar's view included a packet whose actual relay was C0ffee SF.

Root cause

handleNodePaths has two branches:

  1. Canonical resolved_path branch (bug(node-detail): "Paths through this node" includes packets whose canonical resolved_path does NOT include the node — disambiguator anchor-bias inconsistency #1278) — when a tx has a persisted resolved_path, membership is decided from the stored pubkeys. This branch is correct.
  2. Fallback branch — when resolved_path is NULL/missing, the code invoked pm.resolveWithContext(hop, []string{lowerPK}, graph) to re-resolve hops. The hopContext=[lowerPK] anchors the resolver on the queried target, so the tier-2 (geo-proximity) / tier-3 (GPS+observation-count) tiers preferentially pick the target. Every paths-through-X call for any X in the sibling set then resolved the colliding hop to X and counted the tx — wrong-node attribution across the whole sibling set.

Fix

Server-side, query-time only. No DB writes (#1289 read-only invariant preserved). No canonical-branch changes — only the fallback path.

In the fallback branch, accept a biased-resolver match as evidence of target membership only when either:

  • (a) the tx is already pre-confirmed via the resolved_path index hit or SQL INSTR(resolved_path, pubkey) check, or
  • (b) the hop's prefix candidate set is unique (len(pm.m[hop]) <= 1) — no collision, no bias possible.

Multi-candidate prefix hops without independent SQL/index confirmation are now treated as ambiguous and excluded from paths-through. Same rule applied to the unresolvable-hop sub-case (when resolveHop returns nil but the prefix could match the target).

Which canonical resolved_path source is used

This PR does not introduce a new resolved_path source. It piggybacks on what's already in place:

So when canonical data exists, attribution is purely persisted-path driven; when it doesn't, attribution requires either a SQL pubkey hit or a unique prefix candidate. Biased resolution alone is no longer sufficient.

TDD — red, then green

Two new tests in cmd/server/paths_through_collision_1352_test.go:

  1. TestHandleNodePaths_PrefixCollision_1352 — canonical branch (already green via bug(node-detail): "Paths through this node" includes packets whose canonical resolved_path does NOT include the node — disambiguator anchor-bias inconsistency #1278). 3 nodes share c0, tx canonical resolved_path = [B]. Only paths-through-B includes the tx.
  2. TestHandleNodePaths_PrefixCollision_1352_FallbackBranchred before the fix. 3 GPS-having c0 siblings, NULL resolved_path. Before: A=1 B=1 C=1 (wrong-node attribution on all). After: ≤1 attribution.

Mutation: reverting the len(pm.m[hop]) <= 1 guard in routes.go restores the failing red state.

Existing tests preserved:

Acceptance criteria (from #1352)

  • On node detail for Kpa Roof Solar-shape, packet where actual relay is C0ffee SF does NOT appear in paths-through (canonical branch test).
  • On node detail for C0ffee SF-shape, that same packet DOES appear (canonical branch test).
  • Ambiguous fallback case (NULL resolved_path, multi-prefix-collision) attributes to ≤1 node (fallback test).
  • Mutation test: removing the uniqueness guard makes the fallback test fail.

Out of scope

Fixes #1352

bot added 2 commits May 24, 2026 15:53
…ttribution

Two TDD tests covering issue #1352:

1. PrefixCollision_1352 (canonical resolved_path branch): 3 nodes share
   prefix "c0", tx has canonical resolved_path naming B. Only
   paths-through-B must include the tx; A and C must exclude.
   (Already passes thanks to #1278's Option-A canonical handling.)

2. PrefixCollision_1352_FallbackBranch (NULL resolved_path): 3 GPS-having
   prefix siblings, NULL resolved_path. Old fallback uses biased resolver
   with hopContext=[targetPK] — every paths-through-X query attributes the
   tx to X. Asserts ≤1 attribution.

Currently fails on the fallback test: A=1 B=1 C=1 (all three include).
…ibution — #1352

handleNodePaths fallback branch (txs with NULL persisted resolved_path)
biased the hop resolver with hopContext=[lowerPK], anchoring tier-2 geo
and tier-3 GPS tiers on the queried target. When multiple sibling nodes
shared a hop's 1-byte prefix (e.g. 5 nodes with 'c0' on staging), every
paths-through-X query for any X in the sibling set resolved that hop to X
and counted the tx as a path through X. The same tx appeared in
paths-through for ALL collisions — wrong-node attribution.

Fix: in the fallback branch only, accept a biased-resolver match as
evidence of target membership when EITHER (a) the tx is already
pre-confirmed via the resolved_path index or SQL contains-pubkey check,
OR (b) the hop's candidate set is unique (single prefix candidate, no
collision possible). Multi-candidate prefix hops without independent
confirmation are now treated as ambiguous and excluded.

Server is read-only; this is a query-time logic change only. No DB
writes, no schema changes (#1289 invariant preserved).

The canonical resolved_path Option-A branch (#1278) is unchanged: when a
tx has a persisted canonical resolved_path, membership is decided solely
from the persisted pubkeys (no biased re-resolution).

Tests (red-then-green):
  cmd/server/paths_through_collision_1352_test.go
  - PrefixCollision_1352: canonical resolved_path naming B only
    attributes to B (was already correct via #1278)
  - PrefixCollision_1352_FallbackBranch: 3 GPS siblings, NULL
    resolved_path → previously A=1 B=1 C=1, now ≤1 attribution

Fixes #1352
@Kpa-clawbot
Copy link
Copy Markdown
Owner Author

Independent review (round 1) — adversarial

Cold read against gh pr diff only. Diff scope matches PR description (canonical branch untouched, only fallback guard + tests). Logic of the fix is sound; the must-fixes below are quality/coverage gaps, not directional disagreement.

Must-fix

  1. cmd/server/paths_through_collision_1352_test.go:84_ = strings.ToLower is an explicit silence-the-unused-import hack. strings is not used anywhere else in the file; drop the import and the line. Ships dead code into the test binary.
  2. cmd/server/paths_through_collision_1352_test.go (all db.conn.Exec calls, lines ~54-58, 64-65, 144-148, 154-155) — every INSERT discards the error. If a schema change makes an INSERT silently fail (0 rows), the fallback test still passes (includes == 0 ≤ 1) for the wrong reason. Capture err, t.Fatalf on non-nil. The fallback assertion in particular has no liveness check — an empty table satisfies "≤1 attribution."
  3. cmd/server/paths_through_collision_1352_test.go:188-192 — fallback assertion is too weak. includes > 1 accepts the degenerate case where the route returns 0 for everything (e.g. handler errors swallowed, store load returns no candidates). Add an explicit assertion that querying nodeB with a tx whose hop is unambiguously attributable returns 1 (i.e. a positive control transmission in the same test, or b > 0 || (a+b+c) == 0 with-explicit-ambiguous-expected comment). As written, the test is a one-sided lockout that can pass when the route is completely broken.
  4. cmd/server/paths_through_collision_1352_test.go — no test exercises the preconfirmed bypass. The clause preconfirmed || len(pm.m[hop]) <= 1 can have its left operand deleted and both new tests still pass (test 1 takes the canonical branch entirely; test 2 has NULL resolved_path so preconfirmed==false). The preconfirmed || half of the guard is therefore tautological — covered only by older preserved tests (Paths Through: false positives from short-prefix collisions in byPathHop index #929, bug(node-detail): "Paths through this node" includes packets whose canonical resolved_path does NOT include the node — disambiguator anchor-bias inconsistency #1278) which the author asserts are still green but does not include diff-time evidence of. Add a fallback-branch test with a c0-collision tx whose resolved_path index hit makes confirmedByFullKey==true, prove attribution still flows to the right sibling.
  5. cmd/server/routes.go:1700strings.ToLower(hop) is called twice on the same loop iteration (once implicitly through resolveHop's cache key — though that one uses raw hop — and once here). The unresolvable branch correctly hoists lowerHop := strings.ToLower(hop); do the same in the resolved branch (lines 1697-1701) for symmetry/DRY. As-is, the two branches lowercase the hop differently which invites future drift.
  6. cmd/server/routes.go:1684-1685preconfirmed := containsTarget is set once and never mutated; the name reads like a snapshot but the intent ("did SQL/index confirm before we entered the loop") is non-obvious. Either inline it (if (confirmedByFullKey[tx.ID] || confirmedBySQL[tx.ID]) || len(pm.m[lowerHop]) <= 1) or rename to sqlOrIndexConfirmed and add a one-line comment that it is intentionally NOT re-read after a loop iteration sets containsTarget=true.
  7. cmd/server/routes.go:1700, 1713len(pm.m[lowerHop]) <= 1 treats len==0 (hop prefix not in the prefix map at all) the same as len==1 (unique). In the resolved branch that's unreachable (resolved!=nil implies hop ∈ pm.m). In the unresolvable branch it means: "hop prefix is unknown to pm, target's pubkey happens to start with it → attribute." That preserves legacy behavior but is undocumented and surprising under the new guard semantics. Spell out the len==0 case in the comment, or tighten to len == 1.
  8. cmd/server/paths_through_collision_1352_test.go:50-52 — comment says "B has GPS so the OLD biased resolver could have picked B" but immediately admits the bug manifests on A and C, so the GPS-on-B detail is a red herring that distracts from what the test is locking down. Trim or rewrite the paragraph to match what is asserted.
  9. cmd/server/paths_through_collision_1352_test.go:34-65 vs 130-156 — the two tests duplicate setup verbatim (same pubkeys, same INSERT scaffolding, same server/store/router wiring). Extract a helper (setupPrefixCollisionFixture(t, withGPS bool, resolvedPath *string)) to keep the assertions front-and-center. Right now ~80 lines are copy-paste.
  10. PR body claims "Mutation: reverting the len(pm.m[hop]) <= 1 guard … restores the failing red state." No evidence the mutation was actually run for both halves of the OR (preconfirmed || …). Per AGENTS.md TDD rules, the red commit must fail on the assertion when the guard is reverted — but as item 4 notes, preconfirmed is untested, so a partial revert that keeps preconfirmed || would still pass. State clearly which clause's mutation was verified and add the missing test.

Out-of-scope

  1. Frontend "ambiguous (N candidates)" badge (PR body already declares this OOS).
  2. The biased hopContext=[lowerPK] itself is preserved (per bug(analytics): hop disambiguator called with nil context — 4-tier resolver collapses to "first GPS by DB-insertion-order" on distance/longest-hops paths #1197). The fix layers a guard on top rather than removing the bias. Replacing biased resolution with a context-free fallback is architectural — separate issue.
  3. db.conn.Exec swallowed errors are a pattern already present in older tests in this package — file a sweep issue if you want to fix them everywhere.

Verdict

NEEDS-WORK — must-fix count: 10.

@Kpa-clawbot
Copy link
Copy Markdown
Owner Author

Kent Beck Gate (round 1) — TDD + test quality

Verdict: NEEDS-WORK

TDD history: PASS (structural)


TDD history check

  • Commit 1 3fcce50 — tests only (cmd/server/paths_through_collision_1352_test.go). ✅
  • Commit 2 d99c40b — production code only (cmd/server/routes.go). ✅
  • Clean red→green split by file scope.

Caveat: GH Actions only ran on the PR head SHA (commit 2 = success). No CI artifact proves commit 1 was red on an assertion. The PR body's mutation claim (revert len(pm.m[hop]) <= 1 → fallback test fails) is plausible from the diff but not CI-verified. Not blocking — common GHA pattern — but noting.


The Six Questions

a. Mutation: revert len(pm.m[hop]) <= 1 — does fallback test go red?
Yes. With the guard removed, the biased resolver (hopContext=[lowerPK], tier-2 geo) picks the queried target every time. All three GPS siblings would attribute → includes == 3 > 1 → test fails. ✅

b. Smallest test that would have caught the original bug?
The FallbackBranch test is close to minimal: 3 colliding nodes, 1 tx, NULL resolved_path, query each, assert ≤1. Could arguably shrink to 2 siblings (collision still possible) but 3 is fine.

c. Could a WRONG implementation pass these tests? ⚠️ YES — see must-fix #1
A degenerate "always return 0 in the fallback branch" implementation would:

  • pass FallbackBranch (0 ≤ 1 ✓)
  • pass the canonical PrefixCollision_1352 test (canonical branch unchanged)

Nothing in this test file pins the positive fallback case. The weak ≤1 assertion is correct as a regression guard for over-attribution but it lets the pendulum swing all the way the other direction.

d. What edge cases are NOT tested?

  • preconfirmed branch (must-fix setInterval leaks in live.js — timers not cleared on page navigation #2): the (a) clause — confirmedByFullKey || confirmedBySQL — is the OTHER half of the new guard. No test exercises a tx that has a SQL pubkey match and a colliding prefix hop. Removing preconfirmed || from either branch of the new code wouldn't fail any test in this file.
  • Multi-hop path with mixed unique + colliding hops (e.g. ["ab","c0"] where "ab" is unique and "c0" collides).
  • 2-byte vs 4-byte hash_size variants (issue mentions 1-byte; resolver behavior may differ at other prefix lengths).
  • Observer-attributed paths (observer_idx != NULL) — both tests use NULL.
  • Unresolvable hop sub-case (resolveHop returns nil, prefix matches target) — the second guard added in the fix. No test exercises this branch.

e. Test names describe behavior or implementation?
PrefixCollision_1352 describes behavior + issue ref — fine. PrefixCollision_1352_FallbackBranch names the code branch (implementation). Behavior name would be e.g. PrefixCollision_NullResolvedPath_AmbiguousAttribution. Minor smell, not blocking.

f. Setup more complex than test? API smell?
No. 3 INSERTs + 1 tx + 1 obs is appropriate for an HTTP integration test. Setup ≈ assertion in size. No smell.


Must-fix

  1. Strengthen FallbackBranch assertion against the "always exclude" degenerate implcmd/server/paths_through_collision_1352_test.go:185. Current assertion includes > 1 is unidirectional. An implementation that drops every fallback tx unconditionally passes this test while shipping a new bug (lose ALL non-canonical attribution). Add a positive companion case in the same test file: insert a tx with NULL resolved_path and a hop whose prefix has exactly ONE candidate in pm.m (e.g. a unique 4-byte prefix). Assert that node's paths-through returns 1. This pins both endpoints of the guard.

  2. Cover the preconfirmed branchcmd/server/routes.go:1685, 1715. The fix adds preconfirmed := containsTarget and uses it in BOTH the resolved-hop guard and the unresolvable-hop guard. No test in this PR exercises a collision-prefix tx that is confirmedByFullKey or confirmedBySQL. If a future refactor drops preconfirmed || from either site, all paths-through-X queries for SQL-matched txs with colliding prefixes would silently drop to zero attribution — regression invisible to this PR's tests. Add one test: 3 colliding siblings, NULL resolved_path, but seed s.store.byPathHop[targetPK] (or rely on INSTR SQL match by writing the full pubkey into another tx's resolved_path) so confirmedByFullKey[tx.ID] or confirmedBySQL[tx.ID] is true. Assert that target attribution survives (== 1) despite the collision.


Out of scope (not blocking)

  • Test name _FallbackBranch (implementation-flavored).
  • Missing hash_size / multi-hop / observer-only edge case coverage — deserves a separate hardening issue, not this PR.
  • Lack of CI artifact for the red commit — GHA limitation, not author-correctable.

Requesting changes pending must-fix #1 and #2. Both are small additions to the same test file; no production-code touch needed.

— Kent Beck Gate (round 1)

@Kpa-clawbot
Copy link
Copy Markdown
Owner Author

MeshCore Review (round 1)

Independent expert (MeshCore protocol engineer) cold review. Verdict: NEEDS-WORK — fix is protocol-correct, but the diff has improvable items.

Protocol correctness — verified ✅

  • Uniqueness guard semantics. pm.m is keyed by lowercased prefix strings pk[:l] for l ∈ [2..8] plus full pubkey when len(pk) > 8 (store.go:4555-4569). So a 1-byte hash (2 hex chars, e.g. "c0") maps to one entry per node sharing that prefix, regardless of that node's own advert hash_size variants — because each node contributes ALL of its 2..8 prefixes. The guard len(pm.m[hop]) <= 1 therefore correctly reflects "no sibling node shares this prefix in the node cache." For 1-byte hops ("c0") this is the strictest check; for 3-byte hops ("c0ffee") it's looser by design.
  • Cross-hash-size mis-count risk. Not a real concern: nodes are deduped by full pubkey before prefix insertion (one nodeInfo per pubkey, advert hash_size doesn't shard the map). The guard counts distinct nodes, not distinct advert encodings.
  • fetchResolvedPathForTxBest ordering. Returns []*string parallel to path_json of the longest-path observation (resolved_index.go:244-289). Firmware convention path[0]=closest-to-originator is preserved end-to-end. Index alignment with hops via i < rpLen is correct.
  • ADVERT vs non-ADVERT. The fix is in the path-membership branch which iterates path_json hops only. from_node (ADVERT-only originator edge) is not consulted here. No regression.
  • Case canonicalization. confirmResolvedPathContains uses INSTR(LOWER(resolved_path), needle) with needle = "\""+ToLower(pk)+"\"" — quote-bracketed exact-token match within a JSON array string. Safe against pubkey-substring collisions (the surrounding " block partial-prefix matches). resolved_path is written lowercase upstream (fix(#1197): plumb hop-context + observation-count tiebreak to disambiguator #1198/1235); the LOWER() wrapper is belt-and-braces.
  • Observer/edge extraction paths. Untouched. Fix is scoped to handleNodePaths fallback branch only — neighbor_graph and neighbor_persist use pm.m[hop] for their own logic and are unaffected.

Must-fix (5)

  1. cmd/server/paths_through_collision_1352_test.go:106_ = strings.ToLower // silence import if unused above is dead code; the strings import is no longer used anywhere in the file. Remove both the line and the "strings" import.
  2. cmd/server/paths_through_collision_1352_test.go:189-193 (FallbackBranch) — assertion is includes > 1 ("≤1 attribution"), but with len(pm.m["c0"]) == 3 the guard deterministically excludes all three. Strengthen to if a != 0 || b != 0 || c != 0 (expected 0/0/0); the current bound also lets a future regression that flips ONE node through silently. Comment even says "After: ≤1 attribution" while the actual behavior is exactly 0 — tighten the assertion to match.
  3. cmd/server/routes.go:1698-1701 and :1713-1716 — neither new test exercises the resolved == nil (unresolvable-hop) branch's uniqueness guard. Both fallback tests cause resolveHop to return non-nil. Add a third test where the hop prefix appears in pm.m with >1 candidates but resolveHop returns nil (e.g. all candidates filtered by role/eviction) — proves the added len(pm.m[lowerHop]) <= 1 on the nil branch is gating, not dead.
  4. cmd/server/routes.go:1697 vs :1711 — inconsistent lowercasing pattern: resolved branch inlines strings.ToLower(hop) inside the pm.m[...] access while the nil branch hoists lowerHop := strings.ToLower(hop) once. Hoist lowerHop once at the top of the loop body and use it in both arms for symmetry and one less ToLower call per hop per tx.
  5. cmd/server/routes.go:1685preconfirmed := containsTarget is a frozen snapshot of pre-loop confirmation, intentionally NOT updated when containsTarget flips true mid-loop (preventing a unique-hop match from "self-pre-confirming" a downstream ambiguous hop). This is correct but non-obvious and a footgun for future edits. Add a one-line comment: // snapshot: do NOT update inside the loop — pre-confirmation must come from index/SQL, not from same-loop hop matches.

Out-of-scope

gh pr review --request-changes requested per skill rules (5 must-fix items).

@Kpa-clawbot Kpa-clawbot enabled auto-merge (squash) May 25, 2026 05:57
@Kpa-clawbot Kpa-clawbot merged commit de583f9 into master May 25, 2026
3 checks passed
@Kpa-clawbot Kpa-clawbot deleted the fix/issue-1352 branch May 25, 2026 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(paths-through-node): 1-byte prefix collision resolves arbitrarily — attributes packets to wrong node

1 participant