Skip to content

chore: merge shadow branch fixes and improvements to main#703

Closed
GrapeBaBa wants to merge 24 commits intomainfrom
shadow-to-main
Closed

chore: merge shadow branch fixes and improvements to main#703
GrapeBaBa wants to merge 24 commits intomainfrom
shadow-to-main

Conversation

@GrapeBaBa
Copy link
Copy Markdown
Member

Summary

Cherry-pick all non-shadow-specific commits from the shadow branch to main. These are fixes, refactors, and features that accumulated on shadow while it was the active development branch.

Key changes:

  • xev Dynamic API for runtime io_uring/epoll detection (beneficial for all platforms)
  • Store refactor: key by AttestationData (refactor: update store to key by AttestationData #656)
  • Multiple bug fixes: checkpoint sync, subnet subscription, pre-finalized attestations, segfaults, etc.
  • Devnet3 metrics, key-manager consolidation, performance improvements
  • Documentation updates

Not included (shadow-only):

  • rust/patch/quinn-udp/ (Shadow fallback UDP)
  • shadow/ directory (removed, now in lean-quickstart)
  • Intermediate Shadow config commits

After this merges, the shadow branch can be rebased onto main with minimal diff (only quinn-udp patch).

Test plan

  • CI passes on all existing tests
  • Build succeeds on Linux and macOS

zclawz and others added 24 commits April 2, 2026 22:25
Co-authored-by: zclawz <zclawz@blockblaz.io>
Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com>
* docs: add LeanSpec client documentation links in zeam.md (closes #337)

* Add Lean Consensus Book link to documentation

---------

Co-authored-by: zclawz <zclawz@openclaw.ai>
Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com>
Co-authored-by: chethack/Chetany <95150398+chetanyb@users.noreply.github.com>
…ted attestation (closes #657) (#659)

Co-authored-by: zclawz <zclawz@users.noreply.github.com>
Co-authored-by: zclawz <zclawz@openclaw.ai>
Co-authored-by: lilhammerfun <lilhammerfun@users.noreply.github.com>
* feat: pre-generated test keys for faster CI

- Add keygen command to zeam-tools for generating XMSS key pairs
- Add test-keys submodule (anshalshuklabot/zeam-test-keys) with 32 pre-generated keys
- Update getTestKeyManager to load pre-generated keys from disk (near-instant)
- Falls back to runtime generation if test-keys submodule not initialized
- build.zig: add xmss and types imports to tools target

Each XMSS key takes ~1 min to generate at runtime. Loading 32 pre-generated
keys from SSZ files is near-instant, significantly speeding up CI and tests.

* fix: improve pre-generated key loading

- Search upward for test-keys/ directory (handles test runners in subdirs)
- Load pre-generated keys for first 32 validators, generate rest at runtime
- If 40 validators needed: loads 32 from disk + generates 8 at runtime
- Fixes CI failure where keys weren't found

* ci: add submodules: recursive to all checkout steps

Ensures test-keys submodule is available in every CI job.
Fixes CACHE MISS in Dummy prove jobs where pre-generated
keys weren't found.

* fix: add Rust build dependency to tools target

tools now imports @zeam/xmss which requires libhashsig_glue.a.
Without this dependency, the tools binary and tests fail to link
on CI where Rust libs aren't pre-built in the cache.

* fix: use addRustGlueLib for tools target to link Rust static libs

The tools executable was missing proper Rust library linking (libhashsig_glue.a,
libmultisig_glue.a, liblibp2p_glue.a) and platform-specific frameworks (macOS).
Replace manual step.dependOn with addRustGlueLib() which handles object files,
linkLibC, linkSystemLibrary(unwind), and macOS frameworks.

* fix: add Rust build step dependency for tools target

addRustGlueLib only adds object files and link flags but not the step
dependency that ensures Rust libs are built first. Both dependOn AND
addRustGlueLib are needed, matching how all other targets are wired.

* fix: resolve test-keys path via build.zig absolute path

The relative path search for test-keys/hash-sig-keys failed because Zig
test binaries run from cache directories, not the repo root. Fix by
injecting the absolute repo path as a build option (test_keys_path) from
build.zig using pathFromRoot(), and add build_options import to
key-manager module.

All 5 build commands verified locally:
- zig build all
- zig build test
- zig build simtest
- zig build spectest:generate
- zig build spectest

* fix: handle partial preload failures and fix keypair memory leak

Two fixes:
1. Track actually_loaded count instead of assuming all num_preloaded keys
   were inserted. On partial failure (corrupt SSZ, addKeypair error), the
   fallback loop now starts from the correct index, so no validators are
   left without keys.

2. Replace boolean owns_keypairs with num_owned_keypairs counter. Pre-
   generated keys (index < num_owned_keypairs) are freed on deinit while
   cached runtime keys (index >= num_owned_keypairs) are skipped. This
   prevents leaking ~20MB private keys when the mixed ownership path is
   taken.

* fix: track per-key ownership to prevent leaks in CLI and mixed paths

Replace index-threshold ownership check with a per-key owned_keys hashmap.
addKeypair() marks keys as owned (freed on deinit), addCachedKeypair()
does not (for borrowed/cached runtime keys).

This fixes the regression where CLI-loaded keys (arbitrary validator
indices) were never freed because num_owned_keypairs stayed 0. Now any
caller using addKeypair() gets correct cleanup regardless of index order.

* chore: remove plans folder

* style: zig fmt

* chore: update test-keys submodule URL to blockblaz org

---------

Co-authored-by: anshalshuklabot <anshalshuklabot@users.noreply.github.com>
Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com>
* fix: consolidate key loading logic into key-manager (closes #661)

Extract a shared loadKeypairFromFiles() helper in the key-manager package
and use it across all callers:

- pkgs/key-manager/src/lib.zig: add public loadKeypairFromFiles(allocator,
  sk_path, pk_path) and expose MAX_SK_SIZE / MAX_PK_SIZE; refactor
  loadPreGeneratedKey to delegate to it
- pkgs/cli/src/node.zig: replace duplicated open/read/fromSsz sequence in
  loadHashSigKeys() with key_manager_lib.loadKeypairFromFiles(); remove
  now-unused xmss and constants imports
- pkgs/cli/src/main.zig: replace duplicated open/read/fromSsz sequence in
  the testsig command with key_manager_lib.loadKeypairFromFiles(); hoist
  key_manager_lib import to file scope and remove inner-scope duplicate

* fix: preserve distinct HashSigSecretKeyMissing/HashSigPublicKeyMissing errors in loadKeypairFromFiles

---------

Co-authored-by: zclawz <zclawz@users.noreply.github.com>
Co-authored-by: zclawz <zclawz@blockblaz.io>
Co-authored-by: zclawz <zclawz@blockblaz.io>
…671)

* fix: skip STF re-processing for blocks already known to fork choice (closes #669)

* fix: also skip STF re-processing in processCachedDescendants if block already known to fork choice

---------

Co-authored-by: zclawz <zclawz@openclaw.ai>
…Topic format signature (closes #594) (#673)

* fix: replace {any} with {f} for types with format methods, fix GossipTopic format signature (closes #594)

* fix: replace all {any} with {f} for types with custom format methods (closes #594)

---------

Co-authored-by: zclawz <zclawz@blockblaz.io>
 #675) (#676)

* fix: replace {any} with {f} for types with format methods, fix GossipTopic format signature (closes #594)

* fix: replace all {any} with {f} for types with custom format methods (closes #594)

* fix: properly free BeamState on error paths to prevent segfaults (closes #675)

Three memory-safety fixes in chain.zig:

1. onBlock — computedstate block: add errdefer for cpost_state
   - After allocator.create(): errdefer destroy (covers sszClone failure)
   - After sszClone(): errdefer deinit (covers verifySignatures / apply_transition
     failures, e.g. InvalidPostState). LIFO ordering ensures deinit runs before
     destroy, so interior ArrayList fields are freed cleanly.
   - Previously: a partially-mutated cloned state (historical_block_hashes already
     appended by process_slots) was silently leaked, causing UB / segfault.

2. onBlock — function level: add errdefer for post_state when we own it
   - If computedstate succeeds but a later step (forkChoice.onBlock, updateHead,
     InvalidSignatureGroups) returns an error, post_state was previously leaked.

3. BeamChain init — cloned_anchor_state: add errdefer before states.put
   - Same create+sszClone pattern without cleanup on states.put failure.

---------

Co-authored-by: zclawz <zclawz@blockblaz.io>
Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com>
#678)

* fix: wire attestation-committee-count CLI flag through to ChainOptions

NodeCommand was missing the @"attestation-committee-count" field, so
simargs silently ignored the flag and the value always fell through to
the hardcoded default of 1 in ChainConfig.init.  With every node using
committee_count=1, all validators compute validator_id % 1 = 0 and end
up subscribed to the same single subnet.

Changes:
- Add @"attestation-committee-count": ?u64 = null to NodeCommand so
  simargs parses and exposes the flag.
- Add attestation_committee_count: ?u64 = null to NodeOptions.
- Add attestationCommitteeCountFromYAML() helper that reads
  ATTESTATION_COMMITTEE_COUNT from config.yaml as a fallback.
- In buildStartOptions: CLI flag takes precedence; falls back to
  config.yaml value; leaves null if neither is present (ChainConfig
  still defaults to 1 in that case for backward compat).
- In Node.init: apply options.attestation_committee_count to
  chain_options before constructing ChainConfig so the correct value
  reaches EthLibp2p.attestation_committee_count and the subnet
  assignment logic.
- Add ATTESTATION_COMMITTEE_COUNT: 4 to test fixture config.yaml.
- Add two unit tests for attestationCommitteeCountFromYAML.

* fix: validate attestation-committee-count >= 1, default to 1 with warning

* build: add minimum_zig_version = "0.15.0" to build.zig.zon

---------

Co-authored-by: zclawz <zclawz@blockblaz.io>
Co-authored-by: zclawz <zclawz@openclaw.ai>
#663) (#666)

Co-authored-by: zclawz <zclawz@blockblaz.io>
Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com>
* forkchoice: align /lean/v0/fork_choice response with leanSpec

- head: was {"slot", "root"}, now "0x..." flat root string
- safe_target: was {"root": "0x..."}, now "0x..." flat root string
- nodes: add proposer_index field (was missing)

Achieved by:
- Adding proposer_index to ProtoBlock and ProtoNode
- Wiring proposer_index through all ProtoBlock/ProtoNode construction sites
- Updating tree_visualizer.buildForkChoiceJSON serialization

* forkchoice: add buildForkChoiceJSON test and clarifying comments
* fix: improve state recovery logging and add genesis time validation (#481)

- Add explicit success log when state is loaded from DB on restart
- Validate genesis time of loaded state matches chain config; fall back
  to genesis on mismatch with a clear warning
- Improve DB recovery logging: log block root and slot at each step so
  failures are easier to diagnose
- Root cause: lean-quickstart spin-node.sh unconditionally wiped the data
  directory on every restart (fixed separately in lean-quickstart)

* chore: update lean-quickstart submodule to fix data dir wipe on restart

* fix: wipe stale database on genesis time mismatch

When the DB contains state from a different genesis (genesis_time mismatch),
close and delete the RocksDB directory before reopening so the node starts
with a fresh DB instance rather than accumulating stale data.

Requested by @g11tech in #637 (comment)

* fix: add post-wipe genesis time re-check, log and return error if still mismatches

* refactor: remove redundant post-wipe genesis check

Per g11tech's review: the downstream loadLatestFinalizedState call
will handle any inconsistency if the wipe somehow failed. No need
to re-probe immediately after wiping.

* fix: resolve CI failure - apply zig fmt to node.zig (remove trailing newline)

* fix: error out if db wipe fails on genesis time mismatch

* refactor the anchor setup on startup

* fix: wipe db even when no local finalized state found

Per @anshalshukla review: if loadLatestFinalizedState fails (no
finalized state in db), we should still wipe the db for a clean
slate rather than risk leftover data. NotFound errors are ignored
since the db directory may not exist yet on first run.

* build: switch production builds from ReleaseFast to ReleaseSafe

ReleaseFast compiles unreachable statements and failed
bounds/overflow checks to @trap() — a silent illegal instruction
with no output. ReleaseSafe keeps safety checks active (bounds,
overflow, unreachable → panics with a stack trace) while still
applying most optimizations.

This makes crash sites visible in production and CI instead of
silently killing the process, which was the root cause of the
hard-to-diagnose crashes fixed in PR #681.

Affected: Dockerfile, ci.yml (SSE integration build), auto-release.yml
(x86_64 and aarch64 release binaries).

The ReleaseFast in build.zig is for the risc0/zkvm targets and
is left unchanged.

* fix: change ReleaseFast to ReleaseSafe in zkvm build step

* fix: revert zkvm optimize to ReleaseFast (ReleaseSafe breaks riscv32 inline asm)

---------

Co-authored-by: anshalshuklabot <anshalshuklabot@users.noreply.github.com>
Co-authored-by: zclawz <zclawz@users.noreply.github.com>
Co-authored-by: zeam-bot <zeam-bot@openclaw>
Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com>
Co-authored-by: zclawz <zclawz@openclaw.ai>
Co-authored-by: harkamal <gajinder@zeam.in>
Co-authored-by: zclawz <zclawz@blockblaz.io>
Co-authored-by: zclawz <zclawz@blockblaz.com>
* devnet 3 metrics

* move aggregated payloads metrics outside mutex scope

* address review comment
…ogic (closes #682) (#684)

The db.deinit() + deleteTreeAbsolute + Db.open() pattern was duplicated
in two branches of Node.init:

  1. Genesis time mismatch (known stale db) — FileNotFound is an error
  2. No finalized state found (fresh install) — FileNotFound is ignored

Extracted into wipeAndReopenDb(db, allocator, database_path,
logger_config, logger, ignore_not_found) which consolidates the
two paths and makes the FileNotFound distinction explicit via the
bool parameter.

Co-authored-by: zclawz <zclawz@blockblaz.io>
…85 (#687)

Co-authored-by: ZclawZ <zclawz@blockblaz.io>
…ync (#681)

Three bugs caused the node to crash silently (Zig ReleaseFast trap) when
processing a burst of cached blocks while syncing from a checkpoint:

1. clock: `_ = r catch unreachable` triggered a trap on the expected
   `error.Canceled` that xev emits when a timer fires after being
   re-armed. The error is now handled explicitly: `Canceled` is silently
   ignored, any other value panics with a message.

2. chain: `processFinalizationAdvancement` assumed `finalized_roots` is
   non-empty before performing `finalized_roots[1..finalized_roots.len]`
   slice arithmetic. If the fork choice had already been rebased past the
   requested root `getCanonicalViewAndAnalysis` can return an empty slice,
   causing an out-of-bounds panic. The function now returns early with a
   warning instead. The `usize` subtraction for the orphan-count log line
   was also guarded against underflow.

3. node: `processCachedDescendants` did not call `chain.onBlockFollowup`
   after successfully integrating a cached block, unlike the sibling path
   `processBlockByRootChunk`. This meant head/justification/finalization
   events were not emitted and `last_emitted_finalized` was not advanced
   as the node processed a chain of cached blocks, leading to a large
   deferred finalization jump that triggered the panics above.
When the SSZ state grows beyond ~3 MB the server switches from sending
a Content-Length response to Transfer-Encoding: chunked. The previous
body-reading loop called readSliceShort which internally goes through:

  readSliceShort → readVec → defaultReadVec → contentLengthStream

contentLengthStream accesses reader.state.body_remaining_content_length
but that field is not active for chunked responses (state is 'ready'),
causing a panic:

  thread 1 panic: access of union field 'body_remaining_content_length'
  while field 'ready' is active

Replace the manual request/response loop with client.fetch() using a
std.Io.Writer.Allocating as the response_writer. fetch() calls
response.readerDecompressing() + streamRemaining() which dispatches
through chunkedStream or contentLengthStream correctly based on the
actual transfer encoding used by the server.
* fix: subnet subscription for attestations (closes #683)

- Add --subnet-ids CLI flag (comma-separated) to explicitly configure
  which attestation subnets to subscribe and aggregate
- When --subnet-ids is provided, subscribe only to those subnets at
  the libp2p gossip level (overrides auto-computation from validator ids)
- Gate forkchoice import (onGossipAttestation) behind is_aggregator flag:
  gossip attestations are only imported if --is-aggregator is set
- When --subnet-ids is also set, further filter imports to the specified
  subnets only; aggregator without --subnet-ids imports from all subscribed subnets
- Proposer attestations from blocks continue to be imported unconditionally
  (they go through the block processing path, not onGossipAttestation)
- Propagate subnet_ids through NodeOpts -> BeamNode -> ChainOpts -> BeamChain

* fix: rename subnet-ids to aggregate-subnet-ids, subscribe to all subnets

- Rename CLI flag --subnet-ids to --aggregate-subnet-ids per review
- Subscribe to both explicit aggregate-subnet-ids AND validator-derived
  subnets (not exclusive else-if); deduplicate with a seen set
- Fall back to subnet 0 only if no subnets from either source

* fix: use ReleaseFast for risc0ospkg host tool to fix docker-build CI

* fix: validate is-aggregator before aggregate-subnet-ids; scope no-filter import to validator subnet

* fix: gate attestation p2p subscription on is_aggregator

Non-aggregator nodes never import gossip attestations (chain.zig
guards import behind is_aggregator_enabled). Subscribing to attestation
subnets at the libp2p layer when the node will never consume them
wastes network bandwidth — the messages are received, validated, and
then immediately discarded.

Move the is_aggregator guard up to the p2p subscription step so
non-aggregators never join attestation subnet topics at all.

Addresses review comment #685 (chain.zig line 790).

* fix: move subnet filtering to p2p subscription layer, not node import level

Per @anshalshukla review: when no explicit --aggregate-subnet-ids filter is set,
trust the p2p subscription (node.zig already subscribes only to validator-derived
subnets at libp2p level). Removing the redundant computeSubnetId loop in chain.zig
that was rejecting at the node level instead of preventing bandwidth use at the
network layer.

* fix: rename aggregate-subnet-ids to import-subnet-ids, subscribe/import regardless of aggregator flag

- Rename --aggregate-subnet-ids CLI flag to --import-subnet-ids
- Rename aggregation_subnet_ids field to import_subnet_ids in chain.zig
- Update help text: option adds to (not overrides) automatic validator subnet computation
- Subscribe to import_subnet_ids at p2p level regardless of is-aggregator flag
  (proposer nodes also need these attestations to include in blocks)
- Import gossip attestations for import_subnet_ids regardless of is_aggregator_enabled
- Validator-derived subnets and aggregator fallback still gated behind is_aggregator_enabled

Addresses review feedback from @g11tech on PR #685

* correct the implementation

* fix: remove stale subnet_ids from ChainOpts init, fix subscribe_subnet_ids typo

- Remove .subnet_ids from ChainOpts initializer in node.zig (field removed by g11tech)
- Fix typo: self.subscribe_subnet_ids -> self.subscription_subnet_ids in cli/node.zig
- Add allocator.free for subscription_subnet_ids in Node.deinit

* fix: resolve CI failure - pass empty subnet slice to net.run() in simtest call sites

* undo changing network run

* fix the subnet subscription

* fix: implement subscription_subnet_ids computation from validator_ids and aggregation_subnet_ids

* remove redundant compute of validator subnet ids

* fix the edge condition

* fix build

* fix build

---------

Co-authored-by: zclawz <zclawz@openclaw.ai>
Co-authored-by: zclawz <zclawz@users.noreply.github.com>
Co-authored-by: zclawz <zclawz@blockblaz.io>
Co-authored-by: harkamal <gajinder@zeam.in>
Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com>
Co-authored-by: zclawz <zclawz@zeam.dev>
…#692)

* cli: fix checkpoint sync panic on chunked HTTP responses

When the SSZ state grows beyond ~3 MB the server switches from sending
a Content-Length response to Transfer-Encoding: chunked. The previous
body-reading loop called readSliceShort which internally goes through:

  readSliceShort → readVec → defaultReadVec → contentLengthStream

contentLengthStream accesses reader.state.body_remaining_content_length
but that field is not active for chunked responses (state is 'ready'),
causing a panic:

  thread 1 panic: access of union field 'body_remaining_content_length'
  while field 'ready' is active

Replace the manual request/response loop with client.fetch() using a
std.Io.Writer.Allocating as the response_writer. fetch() calls
response.readerDecompressing() + streamRemaining() which dispatches
through chunkedStream or contentLengthStream correctly based on the
actual transfer encoding used by the server.

* types: skip pre-finalized attestations instead of aborting block import

After checkpoint sync, incoming blocks may carry attestations whose target
slots predate the finalized anchor. IsJustifiableSlot correctly identifies
these as non-justifiable, but the `try` was propagating the error fatally,
causing the entire block import to fail.

This creates a cascading gap: block N fails → blocks N+1..M fail (missing
parent) → no epoch-boundary attestations accumulate → justified checkpoint
never advances → forkchoice stays stuck in `initing` indefinitely.

Fix: catch InvalidJustifiableSlot and treat it as `false`. The attestation
is then silently skipped via the existing !is_target_justifiable check,
exactly as all other non-viable attestations (unknown source/target/head,
stale slot, etc.) are handled. The block imports successfully, the chain
catches up, and the node exits the initing state.

Update the test that was asserting the old (buggy) error-propagation
behaviour to instead assert that process_attestations succeeds.

* node: fix fc_initing deadlock after checkpoint sync

After checkpoint sync the forkchoice starts in the initing state and
waits for a first justified checkpoint before declaring itself ready.
The status-response sync handler was checking getSyncStatus() and
treating fc_initing the same as synced — doing nothing.  This created
a deadlock: the node never requested blocks from ahead peers because
it was in fc_initing, and it could never leave fc_initing because no
blocks were imported.

Fix the deadlock in two places:

1. Status-response handler: add an explicit fc_initing branch that
   requests the peer's head block when the peer is ahead of our anchor
   slot.  This mirrors the behind_peers branch but uses head_slot for
   the comparison (finalized_slot is not yet meaningful in fc_initing).

2. Periodic sync refresh: every SYNC_STATUS_REFRESH_INTERVAL_SLOTS (8)
   slots, re-send our status to all connected peers when not synced.
   This recovers from the case where all peers were already connected
   before the fix was deployed, so no new connection event fires and
   the status-response handler would never be re-triggered.
* fix: validateAttestationData refactor + validate aggregated attestation (closes #654)

* refactor: update store - key signatures/payloads by AttestationData instead of SignatureKey (closes #636)

Implements leanEthereum/leanSpec#436:
- Replace SignatureKey(validator_id, data_root) -> StoredSignature map with
  AttestationData -> HashMap(ValidatorIndex, StoredSignature) (SignaturesMap)
- Replace SignatureKey -> AggregatedPayloadsList map with
  AttestationData -> AggregatedPayloadsList (AggregatedPayloadsMap)
- Remove attestation_data_by_root reverse-lookup map (no longer needed)
- Remove SignatureKey type (replaced by AttestationData as key)
- Update computeAggregatedSignatures Phase 1 to look up by AttestationData
- Update computeAggregatedSignatures Phase 2 greedy set-cover to use per-data
  candidate list instead of per-validator lookup
- Update pruneStaleAttestationData to iterate gossip_signatures directly and
  prune by target.slot (renamed to prunePayloadMapBySlot)
- Update all call sites: forkchoice.zig, fork_choice_runner.zig, mock.zig,
  block_signatures_testing.zig

* fix: resolve CI failure - update tests to use AttestationData as map key after store refactor

* refactor: address PR #656 review comments

- Remove 'Replaces the old...' and 'not per-validator' comments from
  block.zig (three sites); reviewer noted these transition comments are
  not needed in the final code.

- Make GossipSignaturesInnerMap a proper struct wrapping AutoHashMap
  with init/put/get/deinit/iterator/count methods, so callers don't
  need to manage the inner map lifecycle and the initialization pattern
  (getOrPut + init if not found) stays consistent across all sites.

- Remove validator_ids parameter from storeAggregatedPayload — payloads
  are now keyed by AttestationData so the param was unused. Update all
  call sites: chain.zig (×2), forkchoice.zig test helper,
  fork_choice_runner.zig.

- Scope signatures_mutex in storeAggregatedPayload using a {} block
  with defer unlock, releasing the lock as soon as the map write
  completes. Consistent with the scoped-lock pattern used everywhere
  else in the file.

* refactor: eliminate attestation_list from computeAggregatedSignatures; remove validateAttestation wrapper

aggregateCommitteeSignaturesUnlocked / computeAggregatedSignatures:

The old flow built a flat attestation list from gossip_signatures to
pass as the first arg to computeAggregatedSignatures, which then
re-grouped it by AttestationData.  Since gossip_signatures is already
keyed by AttestationData → GossipSignaturesInnerMap this round-trip
was pure overhead.

New flow:
- computeAggregatedSignatures(validators, signatures_map, aggregated_payloads)
  drops the attestations_list parameter entirely.
- Iterates the *union* of signatures_map and aggregated_payloads keys
  so groups that only exist in aggregated_payloads (individual sigs
  pruned but stored proof available) are still processed.
- Phase 1 iterates the inner map directly (validator_id → stored_sig)
  with no intermediate grouping or root-index lookup.
- Before Phase 2 greedy set-cover, seeds  from proof
  participants not already in sigmap_available so proofs covering
  validators absent from signatures_map are still included.

aggregateCommitteeSignaturesUnlocked (forkchoice.zig):
- Drops the attestations ArrayList build loop entirely.
- Calls computeAggregatedSignatures directly on gossip_signatures.

chain.zig:
- Removes getProposalAttestations() call (was only needed as the
  first arg); aggregation now reads gossip_signatures directly.
- Removes validateAttestation() thin wrapper (review comment
  #r2955207656); callers should use validateAttestationData directly.

All call sites updated: chain.zig, mock.zig, block_signatures_testing.zig.

* fix: resolve CI failure - apply zig fmt to chain.zig (trailing blank line)

* improve code and simplify aggregation

* optimize blcok building

* add checks in block production

* refactor: move greedy proposal attestation logic to forkchoice.getProposalAttestations

Move the fixed-point greedy proof selection loop from chain.produceBlock
into forkchoice.getProposalAttestations, which now takes pre_state, slot,
proposer_index, and parent_root as inputs.

This encapsulates the attestation selection strategy inside the fork choice
module, where it belongs. chain.produceBlock now simply calls
forkChoice.getProposalAttestations and uses the returned
AggregatedAttestations + AttestationSignatures.

---------

Co-authored-by: zclawz <zclawz@blockblaz.io>
Co-authored-by: zclawz <zclawz@openclaw.ai>
Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com>
Co-authored-by: anshalshukla <shukla.anshal85@gmail.com>
Co-authored-by: zclawz <zclawz@users.noreply.github.com>
Replace compile-time backend override (forked libxev) with upstream
xev's Dynamic API. On Linux, Dynamic probes io_uring first and falls
back to epoll at runtime — exactly what Shadow needs since it doesn't
support io_uring_setup. On macOS, Dynamic degenerates to static kqueue.

- Switch all `@import("xev")` to `@import("xev").Dynamic`
- Add `detectBackend()` helper with `@hasDecl` guard for single-backend
  platforms (macOS) where `detect()` doesn't exist
- Revert build.zig.zon to upstream libxev (no fork needed)
- Remove build.zig XevBackend option
Copilot AI review requested due to automatic review settings April 2, 2026 14:29
@GrapeBaBa
Copy link
Copy Markdown
Member Author

Wrong approach — included unrelated commits. Will recreate with only shadow-specific changes.

@GrapeBaBa GrapeBaBa closed this Apr 2, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR cherry-picks a large set of non-shadow-specific fixes/refactors/features from the former shadow development branch onto main, including runtime I/O backend detection, a fork-choice/store refactor around AttestationData keys, and multiple operational/CLI/metrics improvements.

Changes:

  • Refactors gossip-signature + aggregated-payload storage to be keyed by AttestationData (nested per-validator inner maps) and updates aggregation/block-production paths accordingly.
  • Switches xev usage to the Dynamic API with runtime backend detection (io_uring/epoll probing on Linux) and propagates formatting changes ({f}) across logging/fixtures.
  • Adds test key infrastructure (new test-keys submodule, key loading helpers, and a tools keygen command), plus metrics and CLI/database recovery improvements.

Reviewed changes

Copilot reviewed 40 out of 40 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
resources/zeam.md Adds links to LeanSpec client documentation.
resources/poseidon.md Documents Poseidon2 SSZ hashing build flag and design notes.
README.md Mentions Poseidon2 option and adds build troubleshooting notes.
pkgs/types/src/state.zig Changes pre-finalized attestation handling (skip vs error) + updates test.
pkgs/types/src/lib.zig Updates public exports to match new attestation/signature aggregation API.
pkgs/types/src/block.zig Introduces SignaturesMap keyed by AttestationData and AggregateInnerMap; updates proto types.
pkgs/types/src/block_signatures_testing.zig Removes legacy aggregation test suite tied to old SignatureKey design.
pkgs/tools/src/main.zig Adds keygen command to generate XMSS test validator keys + manifest.
pkgs/state-transition/src/transition.zig Updates PQ signature verification metrics and invalid/valid counters.
pkgs/state-transition/src/mock.zig Updates mock chain generation to use new signature map + inner-map aggregation.
pkgs/spectest/src/runner/fork_choice_runner.zig Updates fixtures runner formatting and aggregated-payload storage API usage.
pkgs/spectest/src/json_expect.zig Updates fixture error formatting to use {f} formatting.
pkgs/node/src/utils.zig Switches to xev.Dynamic and adds detectBackend() runtime probe.
pkgs/node/src/tree_visualizer.zig Aligns fork-choice JSON output shapes with leanSpec + adds test coverage.
pkgs/node/src/testing.zig Uses xev.Dynamic and calls detectBackend() during test context init.
pkgs/node/src/node.zig Adds subnet subscription controls, sync recovery logic, cached-block followups, and formatting updates.
pkgs/node/src/lib.zig Re-exports detectBackend from utils.
pkgs/node/src/forkchoice.zig Refactors signature/payload store to AttestationData keys; new proposal attestation selection algorithm; pruning/metrics updates.
pkgs/node/src/constants.zig Adds periodic sync status refresh interval constant.
pkgs/node/src/clock.zig Switches to xev.Dynamic and handles timer cancelation cleanly.
pkgs/node/src/chain.zig Updates block production to new proposal attestation API; improves gossip missing-root propagation; memory-safety errdefers.
pkgs/network/src/mock.zig Uses xev.Dynamic, backend detection in tests, and minor loop refactor.
pkgs/network/src/interface.zig Uses xev.Dynamic; updates GossipTopic formatting signature and log format specifiers.
pkgs/network/src/ethlibp2p.zig Uses xev.Dynamic and updates log formatting specifiers.
pkgs/metrics/src/lib.zig Renames/adds PQ signature metrics, adds fork-choice store gauges and committee aggregation histogram; adjusts buckets.
pkgs/key-manager/src/lib.zig Adds pre-generated key loading, ownership tracking, shared key loading helper, and signing metric updates.
pkgs/database/src/rocksdb.zig Improves finalized-state recovery logs and corruption diagnostics.
pkgs/cli/test/fixtures/config.yaml Adds ATTESTATION_COMMITTEE_COUNT to fixture config.
pkgs/cli/src/node.zig Adds DB wipe/reopen helper, checkpoint-vs-db anchor selection logic, subnet/committee-count options parsing, and key loading helper use.
pkgs/cli/src/main.zig Uses xev.Dynamic, calls detectBackend() at startup, and uses shared key loading helper.
pkgs/api/src/events.zig Updates proto block construction to include proposer_index.
pkgs/api/src/event_broadcaster.zig Updates proto block construction to include proposer_index.
Dockerfile Switches release build optimization to ReleaseSafe.
build.zig.zon Sets minimum Zig version to 0.15.0.
build.zig Injects test_keys_path, wires build_options into key-manager, updates tools deps, and forces ReleaseFast for riscv32 targets.
.gitmodules Adds test-keys submodule.
.github/workflows/ci.yml Ensures submodules are checked out; switches build optimization to ReleaseSafe.
.github/workflows/auto-release.yml Switches release builds to ReleaseSafe.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +983 to +989
var payload_it = self.latest_known_aggregated_payloads.iterator();
while (payload_it.next()) |entry| {
if (!std.mem.eql(u8, &current_justified_root, &entry.key_ptr.source.root)) continue;
if (!self.protoArray.indices.contains(entry.key_ptr.head.root)) continue;
if (processed_att_data.contains(entry.key_ptr.*)) continue;
try sorted_entries.append(self.allocator, .{ .att_data = entry.key_ptr, .payloads = entry.value_ptr });
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getProposalAttestationsUnlocked iterates over latest_known_aggregated_payloads without taking signatures_mutex, but that map is mutated under signatures_mutex by storeAggregatedPayload / acceptNewAttestationsUnlocked / aggregateCommitteeSignaturesUnlocked. Iterating without the same lock can race with concurrent writers and lead to iterator invalidation / memory corruption. Consider either (a) taking signatures_mutex while collecting the relevant payload entries into a local list (keys + cloned proofs) and then releasing it before the STF loop, or (b) protecting payload maps with the main mutex consistently (writers + readers).

Copilot uses AI. Check for mistakes.
Comment on lines +81 to +87
pub fn getOrPut(self: *SignaturesMap, key: attestation.AttestationData) !InnerHashMap.GetOrPutResult {
return self.inner.getOrPut(key);
}

pub fn getPtr(self: *const SignaturesMap, key: attestation.AttestationData) ?*InnerMap {
return self.inner.getPtr(key);
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SignaturesMap.getPtr takes self: *const SignaturesMap but returns a mutable ?*InnerMap, which breaks const-correctness (callers can mutate inner maps through a const reference). This should either take self: *SignaturesMap or return ?*const InnerMap (and provide a separate mutable accessor when needed).

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +91
pub fn get(self: *const SignaturesMap, key: attestation.AttestationData) ?InnerMap {
return self.inner.get(key);
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SignaturesMap.get returns an InnerMap by value (copying the hash map struct). Even if callers treat it read-only, this makes it easy to accidentally mutate/deinit a copy and double-free internal allocations. Prefer exposing only pointer-based accessors (getPtr / getConstPtr) so ownership/lifetime is unambiguous.

Copilot uses AI. Check for mistakes.
Comment on lines 772 to 785
@@ -781,9 +781,9 @@ fn processBlockStep(
validator_ids[i] = @intCast(vi);
}

ctx.fork_choice.storeAggregatedPayload(validator_ids, &aggregated_attestation.data, proof_template, true) catch |err| {
ctx.fork_choice.storeAggregatedPayload(&aggregated_attestation.data, proof_template, true) catch |err| {
std.debug.print(
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After storeAggregatedPayload was refactored to no longer take validator_ids, this block still allocates/fills validator_ids but never uses it. Zig will treat this as an unused local and fail compilation. Remove the allocation/loop (or use it if still needed).

Copilot uses AI. Check for mistakes.
Comment on lines +151 to +158
db.deinit();
const rocksdb_path = try std.fmt.allocPrint(allocator, "{s}/rocksdb", .{database_path});
defer allocator.free(rocksdb_path);
std.fs.deleteTreeAbsolute(rocksdb_path) catch |wipe_err| {
if (!ignore_not_found or wipe_err != error.FileNotFound) {
logger.err("failed to delete database directory '{s}': {any}", .{ rocksdb_path, wipe_err });
return wipe_err;
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wipeAndReopenDb builds rocksdb_path via "{s}/rocksdb" and then calls std.fs.deleteTreeAbsolute(rocksdb_path). If database_path is a relative path (common for CLI flags/defaults), deleteTreeAbsolute will error because the path is not absolute. Use std.fs.cwd().deleteTree(rocksdb_path) for relative paths, or convert to an absolute path first (e.g., realpathAlloc) before calling deleteTreeAbsolute.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants