Skip to content

Steady-state consensus fork after ~11,600 blocks (~8 hours) of clean operation #15

@twobitapps

Description

@twobitapps

Steady-state 4-way fork after ~11,600 blocks of clean consensus

Summary

After the full determinism fix pipeline (v1.3.1–v1.3.10), the chain runs in verified 4/4 consensus for ~8 hours (~11,600 blocks), then fragments into 4 independent forks during normal operation — no restarts, no panics, no auto-updates.

Observed

  • Blocks 0–11,600: all 4 validators identical block hashes at every sampled height (100, 300, 500, …, 11600)
  • Block 11,700: 4 distinct hashes — each validator on its own fork
  • Block 18,000+: still 4 forks, each advancing independently

What this is NOT

  • Not a restart issue (no panics in logs, no systemd restarts, no WaitGroup crash)
  • Not a bootstrap issue (the WaitForSync settle window worked correctly on startup)
  • Not a gas-limit / timestamp / coinbase / beacon issue (all fixed in v1.3.1–v1.3.6)
  • Not an emergency-recovery / catch-up issue (all 0-cert paths removed in v1.3.8)

What this likely IS

A timing-dependent cert-gossip divergence in the Narwhal DAG that causes validators to reach different quorum-support counts for the same leader round, triggering divergent Bullshark view changes. After 8 hours of operation, accumulated gossip-timing differences eventually push one validator past a view-change threshold that the others haven't reached yet, and once a single wave is committed differently the chain forks permanently (the block-production layer has no fork-choice reconciliation mechanism).

Reproducing

  1. Deploy chain-v1.3.10 to a 4-validator testnet with a coordinated restart
  2. Verify 4/4 consensus at block 100, 500, 1000 (should pass)
  3. Wait 8+ hours
  4. Check consensus at block 11000, 12000 — expect fork around 11600-12000

Investigation needed

  • Add per-round DAG state snapshot logging (cert count, support count, leader, view-change votes) on all 4 validators
  • Replay the exact cert sequence at blocks 11,500–11,700 across all four nodes
  • Identify which cert arrived out of order and which view-change fired asymmetrically
  • Determine if the trigger is wall-clock-dependent (e.g. a remaining time.Now() in the DAG layer) or purely gossip-order-dependent

Workaround

Coordinated restart restores consensus immediately. Payment economy and external RPC users are not affected during the fork (they see whichever validator rpc.a1.hyper.space points to).

Context

Full postmortem at docs/POSTMORTEM_2026-04-09_FOUR_WAY_CHAIN_FORK.md.
Releases v1.3.1–v1.3.10 documented at https://github.com/hyperspaceai/agi/releases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions