Skip to content

dash-spv: get_quorum_at_height fails for a just-retired Platform quorum (InvalidQuorum) though the pubkey is resident in quorum_statuses #800

@Claudius-Maginificent

Description

@Claudius-Maginificent

TL;DR

dash-spv's get_quorum_at_height resolves a quorum only through the single active-window masternode list at or below the lookup height. Platform/Drive signs proofs with a signing quorum selected at a lagged height (~4.5 DKG intervals back on devnet), so by the proof's core_chain_locked_height that quorum can already have retired out of Core's active set. MasternodeList::apply_diff drops a retired quorum from the list's .quorums, so the active-window lookup misses — even though the quorum's public key is still resident in the engine's insert-only quorum_statuses by-hash index, which the read path never consults. The result is Quorum not foundInvalidQuorum. It is intermittent: most proofs reference an in-window quorum and verify fine; the failure fires only at the retirement edge.

Environment

  • rust-dashcore branch fix/sml-extnetinfo-v3-decode @ 2a68c3819131b71e42df39612e6d82228bd00a82 (the PR feat: decode SML ProTx v3 entries #797 head; confirmed against the local checkout).
  • Dash Core 23.1.2 devnet (paloma), protocol 70240.
  • LLMQ type 107 (llmq_devnet_platform / LLMQType::LlmqtypeDevnetPlatform), signing_active_quorum_count = 4, DKG interval 24 → active window = 96 blocks.

Symptom

WARN dash_spv::client::queries: Quorum not found: type 107_Dev-Platform at list height 16596
(requested 16596) with hash 50973dd2ab53091024fc2c8e344c91d07a98281e717ac83b9885607ab6020000
(masternode list exists with 4 quorums of this type)

Downstream this surfaces as ContextProviderError(InvalidQuorum) during proof verification. It is intermittent — at the same synced SPV state, proof verification succeeds for one Drive response and fails 200 ms later for another, the only difference being the quorum hash Drive embedded (see Evidence).

Root cause

Walking the read path on 2a68c38:

  1. dash-spv/src/client/queries.rs:48-107get_quorum_at_height is the defect site. It calls masternode_lists_around_height(height), takes the before list, then ml.quorums.get(&quorum_type)?.get(&quorum_hash). On a miss it returns SpvError::QuorumLookupError (lines 71-82) — it never consults quorum_statuses:

    let (before, _after) = masternode_engine_guard.masternode_lists_around_height(height);
    if let Some(ml) = before {
        match ml.quorums.get(&quorum_type) {
            Some(quorums) => match quorums.get(&quorum_hash) {
                Some(quorum) => return Ok(quorum.clone()),
                None => { /* WARN + Err(QuorumLookupError) — queries.rs:71-82 */ }
            },
            ...
        }
    }
  2. dash/src/sml/masternode_list_engine/helpers.rs:29-40masternode_lists_around_height picks the single highest list ≤ height via self.masternode_lists.range(..=core_block_height).next_back() (line 33-34). One list, no fallback to history.

  3. dash/src/sml/masternode_list/apply_diff.rs:70-78 removes a retired quorum from that list's .quorums the instant Core advertises it as deleted:

    // Remove deleted quorums
    for deleted_quorum in diff.deleted_quorums {
        if let Some(quorum_map) = updated_quorums.get_mut(&deleted_quorum.llmq_type) {
            quorum_map.remove(&deleted_quorum.quorum_hash);
            ...
        }
    }

    So a retired quorum is absent from every list whose known_height ≥ its retirement height — including the list ≤ the proof height.

  4. dash/src/sml/masternode_list_engine/mod.rs:265-279quorum_statuses is, by contrast, an insert-only by-hash index of every quorum public key the engine has ever applied:

    pub quorum_statuses: BTreeMap<
        LLMQType,
        BTreeMap<
            QuorumHash,
            (BTreeSet<CoreBlockHeight>, BLSPublicKey, LLMQEntryVerificationStatus),
        >,
    >,

    It is written on every ingest path (mod.rs:627, :998, :1080, :1240, :1272, :1306, :1374) and never removed on retirement — a grep for remove/retain/prune/truncate/clear against quorum_statuses across both dash/src and dash-spv/src returns nothing.

The miss condition: a proof carries (quorum_type = 107, quorum_hash = Q, core_chain_locked_height = H). When H ≥ mint(Q) + signing_active_quorum_count × DKG_interval (Q has retired) while Drive legitimately selected Q at the lagged selection height, the active list ≤ H no longer holds Q → get_quorum_at_height errors. Yet quorum_statuses[107][Q].1 still holds Q's BLSPublicKey — the read path simply never looks there.

Why the reference is legitimate (not a node bug)

Platform/Drive signs with the type-107 quorum selected at a lagged height, and the proof carries that quorum's hash. Verification should resolve the signing quorum by hash, regardless of whether the quorum is still in the active set at H. The verifier consumes only the 48-byte public key — height is pure context, not a membership constraint.

Live confirmation on paloma — dash-cli quorum info 107 000002b67a6085983bc87a711e28987ad0914c348e2cfc24100953abd23d9750 (the big-endian form of the logged 50973dd2…b6020000) returns a real, valid quorum:

  • height 16488, type llmq_devnet_platform, quorumIndex 0, 12 valid members
  • quorumPublicKey b1801046775dc6ca7c2b42bc3084b819ccb31712fcc4dea97d973c73261f92359c55a593d898ffa70ba617e4988b72d8

The active-4 at the tip and their heights:

Quorum hash (BE, abbrev.) Mint height
0000056d… 16536
000001da… 16560
0000030a… 16584
00000088… 16608

Heights 16536/16560/16584/16608 → DKG interval 24. The full sequence including the retired one is …16488, 16512, 16536, 16560, 16584, 16608… With active_count = 4, Q (16488) is active in [16488, 16584) and retires at 16584. The proof references it at core_chain_locked_height = 16596 — 12 blocks after retirement, i.e. exactly one step out of the active window. The selection offset 16596 − 16488 = 108 > 96 (the active window), which is the precise inequality above.

Evidence

  • Same-state success-then-fail, 200 ms apart (fully synced, chain-locked at 16596): proof verification successful at 10:31:16.703, then InvalidQuorum for a different embedded quorum hash at 10:31:16.905. Identical SPV state across those 200 ms rules out staleness — the discriminator is the hash, not height or list freshness.
  • 54 type-107 Quorum not found warnings in a run where the engine's window was frozen at 16596 (all at list height 16596 (requested 16596)); hundreds in another run. A non-advancing window lags every incoming proof, so the edge condition that is rare against a live engine becomes pervasive — same defect, amplified.
  • 4 distinct retired hashes recur across heights (50973dd2, 5eec7acc, 1c0b8f69, b7fc2340), rotating with height in a quorum-aging pattern.

Framing: this is a retirement-edge timing race. It is pervasive when the engine lags, rare-but-real when live — it fires only when a proof's signing quorum has just left the active window.

Deterministic reproduction (hermetic — no devnet)

The retirement asymmetry is directly observable without a live network. Mirror the existing engine_with_lists fixture pattern (dash-spv/src/sync/masternodes/manager.rs:711) and MasternodeList::empty(block_hash, block_height) (dash/src/sml/masternode_list/mod.rs:39). Parameters mirror the devnet cadence and the paloma instance: active_count = 4, interval = 24 → window 96; quorum Q minted at M = 100 (retires at 100 + 4×24 = 196); lookup at H = 208 (≥ 196; 208 − 100 = 108 > 96 — same inequality as paloma's 16488/16584/16596).

let mut engine = MasternodeListEngine::default_for_network(Network::Regtest); // mod.rs:342
let type107 = LLMQType::LlmqtypeDevnetPlatform;

// 1. A list at height 148 (Q still active) holding Q with pubkey PK.
//    Build via MasternodeList::empty(anchor_hash, 148), then insert Q into .quorums[type107].
engine.masternode_lists.insert(148, list_with_quorum(type107, Q, PK));

// 2. A post-retirement list at 208 WITHOUT Q (Q deleted on retirement),
//    .quorums[type107] populated by the then-active 4, excluding Q.
engine.masternode_lists.insert(208, list_without_quorum(type107, /* excludes Q */));

// 3. Mirror Q into the never-pruned by-hash index exactly as apply_diff does
//    (mod.rs:1240/1272/1306). Planted explicitly so the test doesn't depend on diff plumbing.
engine
    .quorum_statuses
    .entry(type107)
    .or_default()
    .insert(Q, (BTreeSet::from([148]), PK, LLMQEntryVerificationStatus::Verified));

// CURRENT behavior — proves the bug deterministically:
assert!(engine
    .masternode_lists_around_height(208).0.unwrap()       // list at 208
    .quorums.get(&type107).unwrap().get(&Q).is_none());   // Q absent from active-window list
assert!(client.get_quorum_at_height(208, type107, Q).await.is_err()); // => Quorum not found (THE BUG)
assert_eq!(engine.quorum_statuses[&type107][&Q].1, PK);   // but the pubkey IS resident by hash

// AFTER the fix — same state, now resolves:
assert_eq!(engine.quorum_public_key_by_hash(type107, Q), Some(PK)); // new by-hash accessor

Higher-fidelity variant (optional, guards the diff plumbing): feed a base list then a sequence of MnListDiffs across ≥ 5 cycles, where the cycle-M diff carries Q in new_quorums and the cycle-196 diff carries Q in deleted_quorums. Real binary MnListDiff fixtures already live at dash/tests/data/test_DML_diffs/*.bin and back the existing apply_diff tests; a maintainer can synthesize an analogous devnet fixture.

Suggested fix

  1. dash: add MasternodeListEngine::quorum_public_key_by_hash(&self, llmq_type, quorum_hash) -> Option<(BLSPublicKey, LLMQEntryVerificationStatus)> reading quorum_statuses. (The existing test at mod.rs:1958 already reads quorum_statuses via .get(&type).and_then(|m| m.get(&hash)).map(|(_, _, status)| ...) — same access shape.)
  2. dash-spv: in get_quorum_at_height (client/queries.rs), on the active-list miss (queries.rs:71-82) fall back to the by-hash accessor instead of returning Err; keep QuorumLookupError only when both miss.

This is window-independent by construction — it keys on the hash, not on active-set membership at H — so it resolves the just-retired signing quorum the active-window lookup cannot. The common in-window path is untouched (the active list hits first). The verifier needs only the 48-byte key, which BLSPublicKey provides directly. Effort ~S. Prefer the fallback to surface/prefer Verified entries; correctness is ultimately gated by the BLS threshold-signature check against the returned pubkey, so a wrong pubkey fails the proof rather than forging one. Suggest an independent PR off dev (not stacked on #797 — the fix does not depend on the SML-v3 decode work).

Related (separate)

Downstream SDKs ban the entire DAPI pool on a single InvalidQuorum, turning one rare retirement-edge miss into a NoAvailableAddresses cascade. That is worth a separate hardening issue — out of scope here.

🤖 Co-authored by Claudius the Magnificent AI Agent

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions