Defer state/block pruning until after block cascade completes (#240)

pablodeymo · web-flow · commit 455611c9546d · 2026-03-17T18:00:46.000-03:00
## Motivation

During the devnet4 run (2026-03-13), all three ethlambda nodes entered
an **infinite re-processing loop** at slot ~15276, generating ~3.5GB of
logs each and consuming 100% CPU for hours.

This PR fixes the root cause by deferring heavy state/block pruning
until after a block processing cascade completes, so parent states
survive long enough for their children to be processed.

## Root Cause

The infinite loop is caused by **fallback pruning running inside the
block processing cascade**, deleting states that pending children still
need.

### The three interacting mechanisms

**1. Asymmetric retention creates a state-header gap**

When finalization stalls, fallback pruning keeps only
`STATES_TO_KEEP=900` states but `BLOCKS_TO_KEEP=21600` headers. Block
headers exist in DB without their states.

**2. Chain walk reaches protected checkpoints**

When a block arrives with a missing parent, `process_or_pend_block`
walks ancestor headers looking for one whose parent has state. Protected
checkpoints (justified/finalized) always have state, so the walk can
reach blocks thousands of slots behind head.

**3. Mid-cascade pruning deletes just-computed states**

`on_block_core` calls `update_checkpoints` after every block, which runs
`prune_old_states`. States for old slots (far behind head) are
immediately deleted — even if they were just computed milliseconds ago
by the same cascade.

### The loop

```
                    ┌──────────────────────────────────────────────┐
                    │                                              │
                    ▼                                              │
1. Chain walk finds block 15266 (parent=4dda, justified)          │
   → parent state exists (protected) → enqueue for processing     │
                    │                                              │
2. Cascade processes 15266 → 15269 → ... → 15276                 │
   → states computed and stored                                   │
                    │                                              │
3. Each on_block_core calls update_checkpoints                    │
   → fallback pruning runs → states for slots 15266-15276        │
     are IMMEDIATELY deleted (slot &lt; head - 900)                  │
                    │                                              │
4. collect_pending_children(15276) finds block 15278              │
   → process_or_pend_block(15278)                                 │
   → has_state(parent=15276) → FALSE (just pruned!)               │
   → stores as pending                                            │
                    │                                              │
5. Chain walk for 15278 re-discovers 15266                        │
   → parent 4dda still has state (protected)                      │
   → enqueue 15266 ─────────────────────────────────────────────→─┘
```

### How it was triggered in devnet4

1. 9 validators, 7 clients. Finalization stalled at slot 15261 due to a
fork at slot 15264 (qlean diverged).
2. At ~10:13:40 UTC, qlean's alternate fork blocks arrived at ethlambda
via gossip.
3. The chain walk for these blocks traversed ~2000 slots back to the
justified checkpoint.
4. The cascade re-processed blocks 15266→15276, but fallback pruning
deleted each state immediately.
5. All three ethlambda nodes (validators 6, 7, 8) entered the loop
simultaneously.

## Solution

**Defer heavy pruning (states + blocks) until after the block cascade
completes.**

### Before (pruning runs per-block, mid-cascade)

```
on_block
  └─ while queue:
       └─ process_or_pend_block
            └─ on_block_core
                 └─ update_checkpoints
                      ├─ write metadata          ← immediate
                      ├─ prune_live_chain        ← immediate
                      ├─ prune_gossip_signatures ← immediate
                      ├─ prune_old_states        ← DELETES PARENT STATES MID-CASCADE
                      └─ prune_old_blocks        ← DELETES BLOCK DATA MID-CASCADE
```

### After (pruning deferred to end of cascade)

```
on_block
  └─ while queue:
  │    └─ process_or_pend_block
  │         └─ on_block_core
  │              └─ update_checkpoints
  │                   ├─ write metadata          ← immediate
  │                   ├─ prune_live_chain        ← immediate (fork choice correctness)
  │                   ├─ prune_gossip_signatures ← immediate (cheap)
  │                   └─ (no state/block pruning)
  │
  └─ store.prune_old_data()                      ← runs ONCE after cascade
```

### Split of `update_checkpoints`

| Operation | Where it runs | Why |
|-----------|--------------|-----|
| Write head/justified/finalized metadata | `update_checkpoints`
(per-block) | Checkpoints must be current for fork choice |
| `prune_live_chain` | `update_checkpoints` (per-block) | Affects fork
choice traversal |
| `prune_gossip_signatures` | `update_checkpoints` (per-block) | Cheap,
correctness-related |
| `prune_attestation_data_by_root` | `update_checkpoints` (per-block) |
Cheap, correctness-related |
| `prune_old_states` | **`prune_old_data`** (after cascade) | Heavy,
causes infinite loop if mid-cascade |
| `prune_old_blocks` | **`prune_old_data`** (after cascade) | Heavy,
coupled with state pruning |

### Why this fixes the loop

With deferred pruning, the devnet4 scenario plays out safely:

1. Cascade processes 15266 → 15269 → ... → 15276 → **states are KEPT**
(no pruning mid-cascade)
2. `collect_pending_children(15276)` finds 15278 →
`has_state(parent=15276)` → **TRUE** (state still exists)
3. 15278 processes successfully, cascade continues through children
4. Queue empties, `while` loop ends
5. `prune_old_data()` runs once — deletes old states
6. Cascade is already done — no one re-triggers it

### Cross-client validation

We surveyed how other lean consensus clients handle this (Lighthouse,
Zeam, Ream, Qlean, Lantern, Grandine). **None of them prune states
mid-cascade.** Common patterns:

- **Zeam**: Canonicality-based pruning, only after finalization or after
long stalls (14,400 slots). Never during block processing.
- **Ream**: Prunes one state per tick (not during block import).
- **Grandine**: Never prunes states (in-memory forever).
- **Lighthouse**: Background migrator thread, decoupled from block
import.

## Changes

- **`crates/storage/src/store.rs`**: Split `update_checkpoints` —
extract `prune_old_states`/`prune_old_blocks` into new
`prune_old_data()` method. Lightweight pruning (live chain, signatures,
attestation data) stays in `update_checkpoints`.
- **`crates/blockchain/src/lib.rs`**: Call `store.prune_old_data()` once
after the `on_block` while loop completes.
- **Tests**: Updated `fallback_pruning_*` tests to call
`prune_old_data()` explicitly.

## How to Test

1. `make test` — all 125 tests pass including 27 fork choice spec tests
2. Deploy to devnet with a multi-client setup where finalization stalls
and alternate fork blocks arrive
3. Verify ethlambda nodes do not enter re-processing loops (no repeated
"Block imported successfully" for the same slot in logs)
4. Monitor memory during long finalization stalls — temporary state
accumulation during cascades is bounded by cascade size
diff --git a/crates/blockchain/src/lib.rs b/crates/blockchain/src/lib.rs
@@ -299,6 +299,11 @@ impl BlockChainServer {
         while let Some(block) = queue.pop_front() {
             self.process_or_pend_block(block, &mut queue);
         }
+
+        // Prune old states and blocks AFTER the entire cascade completes.
+        // Running this mid-cascade would delete states that pending children
+        // still need, causing re-processing loops when fallback pruning is active.
+        self.store.prune_old_data();
     }
 
     /// Try to process a single block. If its parent state is missing, store it
diff --git a/crates/storage/src/store.rs b/crates/storage/src/store.rs
@@ -470,53 +470,41 @@ impl Store {
         batch.put_batch(Table::Metadata, entries).expect("put");
         batch.commit().expect("commit");
 
-        // Prune after successful checkpoint update
+        // Lightweight pruning that should happen immediately on finalization advance:
+        // live chain index, signatures, and attestation data. These are cheap and
+        // affect fork choice correctness (live chain) or attestation processing.
+        // Heavy state/block pruning is deferred to prune_old_data().
         if let Some(finalized) = checkpoints.finalized
             && finalized.slot > old_finalized_slot
         {
             let pruned_chain = self.prune_live_chain(finalized.slot);
-
-            // Prune signatures and attestation data for finalized slots
             let pruned_sigs = self.prune_gossip_signatures(finalized.slot);
             let pruned_att_data = self.prune_attestation_data_by_root(finalized.slot);
-            // Prune old states before blocks: state pruning uses headers for slot lookup
-            let protected_roots = [finalized.root, self.latest_justified().root];
-            let pruned_states = self.prune_old_states(&protected_roots);
-            let pruned_blocks = self.prune_old_blocks(&protected_roots);
-
-            if pruned_chain > 0
-                || pruned_sigs > 0
-                || pruned_att_data > 0
-                || pruned_states > 0
-                || pruned_blocks > 0
-            {
+
+            if pruned_chain > 0 || pruned_sigs > 0 || pruned_att_data > 0 {
                 info!(
                     finalized_slot = finalized.slot,
-                    pruned_chain,
-                    pruned_sigs,
-                    pruned_att_data,
-                    pruned_states,
-                    pruned_blocks,
-                    "Pruned finalized data"
-                );
-            }
-        } else {
-            // Fallback pruning when finalization is stalled.
-            // When finalization doesn't advance, the normal pruning path above never
-            // triggers. Prune old states and blocks on every head update to keep
-            // storage bounded. The prune methods are no-ops when within retention limits.
-            let protected_roots = [self.latest_finalized().root, self.latest_justified().root];
-            let pruned_states = self.prune_old_states(&protected_roots);
-            let pruned_blocks = self.prune_old_blocks(&protected_roots);
-            if pruned_states > 0 || pruned_blocks > 0 {
-                info!(
-                    pruned_states,
-                    pruned_blocks, "Fallback pruning (finalization stalled)"
+                    pruned_chain, pruned_sigs, pruned_att_data, "Pruned finalized data"
                 );
             }
         }
     }
 
+    /// Prune old states and blocks to keep storage bounded.
+    ///
+    /// This is separated from `update_checkpoints` so callers can defer heavy
+    /// pruning until after a batch of blocks has been fully processed. Running
+    /// this mid-cascade would delete states that pending children still need,
+    /// causing infinite re-processing loops when fallback pruning is active.
+    pub fn prune_old_data(&mut self) {
+        let protected_roots = [self.latest_finalized().root, self.latest_justified().root];
+        let pruned_states = self.prune_old_states(&protected_roots);
+        let pruned_blocks = self.prune_old_blocks(&protected_roots);
+        if pruned_states > 0 || pruned_blocks > 0 {
+            info!(pruned_states, pruned_blocks, "Pruned old states and blocks");
+        }
+    }
+
     // ============ Blocks ============
 
     /// Get block data for fork choice: root -> (slot, parent_root).
@@ -1486,6 +1474,12 @@ mod tests {
         let head_root = root(total_states as u64 - 1);
         store.update_checkpoints(ForkCheckpoints::head_only(head_root));
 
+        // update_checkpoints no longer prunes states/blocks inline — the caller
+        // must invoke prune_old_data() separately (after a block cascade completes).
+        assert_eq!(count_entries(backend.as_ref(), Table::States), total_states);
+
+        store.prune_old_data();
+
         // 3005 headers total. Top 3000 by slot are kept in the retention window,
         // leaving 5 candidates. 2 are protected (finalized + justified),
         // so 3 are pruned → 3005 - 3 = 3002 states remaining.
@@ -1530,6 +1524,7 @@ mod tests {
         // Use the last inserted root as head
         let head_root = root(STATES_TO_KEEP as u64 - 1);
         store.update_checkpoints(ForkCheckpoints::head_only(head_root));
+        store.prune_old_data();
 
         // Nothing should be pruned (within retention window)
         assert_eq!(

Original file line number	Diff line number	Diff line change
`@@ -299,6 +299,11 @@ impl BlockChainServer {`
`299`	`299`	`while let Some(block) = queue.pop_front() {`
`300`	`300`	`self.process_or_pend_block(block, &mut queue);`
`301`	`301`	`}`
	`302`	`+`
	`303`	`+ // Prune old states and blocks AFTER the entire cascade completes.`
	`304`	`+ // Running this mid-cascade would delete states that pending children`
	`305`	`+ // still need, causing re-processing loops when fallback pruning is active.`
	`306`	`+ self.store.prune_old_data();`
`302`	`307`	`}`
`303`	`308`
`304`	`309`	`/// Try to process a single block. If its parent state is missing, store it`