chore: merge shadow branch fixes and improvements to main#703
chore: merge shadow branch fixes and improvements to main#703
Conversation
Co-authored-by: zclawz <zclawz@blockblaz.io> Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com>
* docs: add LeanSpec client documentation links in zeam.md (closes #337) * Add Lean Consensus Book link to documentation --------- Co-authored-by: zclawz <zclawz@openclaw.ai> Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com> Co-authored-by: chethack/Chetany <95150398+chetanyb@users.noreply.github.com>
Co-authored-by: zclawz <zclawz@openclaw.ai> Co-authored-by: lilhammerfun <lilhammerfun@users.noreply.github.com>
* feat: pre-generated test keys for faster CI - Add keygen command to zeam-tools for generating XMSS key pairs - Add test-keys submodule (anshalshuklabot/zeam-test-keys) with 32 pre-generated keys - Update getTestKeyManager to load pre-generated keys from disk (near-instant) - Falls back to runtime generation if test-keys submodule not initialized - build.zig: add xmss and types imports to tools target Each XMSS key takes ~1 min to generate at runtime. Loading 32 pre-generated keys from SSZ files is near-instant, significantly speeding up CI and tests. * fix: improve pre-generated key loading - Search upward for test-keys/ directory (handles test runners in subdirs) - Load pre-generated keys for first 32 validators, generate rest at runtime - If 40 validators needed: loads 32 from disk + generates 8 at runtime - Fixes CI failure where keys weren't found * ci: add submodules: recursive to all checkout steps Ensures test-keys submodule is available in every CI job. Fixes CACHE MISS in Dummy prove jobs where pre-generated keys weren't found. * fix: add Rust build dependency to tools target tools now imports @zeam/xmss which requires libhashsig_glue.a. Without this dependency, the tools binary and tests fail to link on CI where Rust libs aren't pre-built in the cache. * fix: use addRustGlueLib for tools target to link Rust static libs The tools executable was missing proper Rust library linking (libhashsig_glue.a, libmultisig_glue.a, liblibp2p_glue.a) and platform-specific frameworks (macOS). Replace manual step.dependOn with addRustGlueLib() which handles object files, linkLibC, linkSystemLibrary(unwind), and macOS frameworks. * fix: add Rust build step dependency for tools target addRustGlueLib only adds object files and link flags but not the step dependency that ensures Rust libs are built first. Both dependOn AND addRustGlueLib are needed, matching how all other targets are wired. * fix: resolve test-keys path via build.zig absolute path The relative path search for test-keys/hash-sig-keys failed because Zig test binaries run from cache directories, not the repo root. Fix by injecting the absolute repo path as a build option (test_keys_path) from build.zig using pathFromRoot(), and add build_options import to key-manager module. All 5 build commands verified locally: - zig build all - zig build test - zig build simtest - zig build spectest:generate - zig build spectest * fix: handle partial preload failures and fix keypair memory leak Two fixes: 1. Track actually_loaded count instead of assuming all num_preloaded keys were inserted. On partial failure (corrupt SSZ, addKeypair error), the fallback loop now starts from the correct index, so no validators are left without keys. 2. Replace boolean owns_keypairs with num_owned_keypairs counter. Pre- generated keys (index < num_owned_keypairs) are freed on deinit while cached runtime keys (index >= num_owned_keypairs) are skipped. This prevents leaking ~20MB private keys when the mixed ownership path is taken. * fix: track per-key ownership to prevent leaks in CLI and mixed paths Replace index-threshold ownership check with a per-key owned_keys hashmap. addKeypair() marks keys as owned (freed on deinit), addCachedKeypair() does not (for borrowed/cached runtime keys). This fixes the regression where CLI-loaded keys (arbitrary validator indices) were never freed because num_owned_keypairs stayed 0. Now any caller using addKeypair() gets correct cleanup regardless of index order. * chore: remove plans folder * style: zig fmt * chore: update test-keys submodule URL to blockblaz org --------- Co-authored-by: anshalshuklabot <anshalshuklabot@users.noreply.github.com> Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com>
* fix: consolidate key loading logic into key-manager (closes #661) Extract a shared loadKeypairFromFiles() helper in the key-manager package and use it across all callers: - pkgs/key-manager/src/lib.zig: add public loadKeypairFromFiles(allocator, sk_path, pk_path) and expose MAX_SK_SIZE / MAX_PK_SIZE; refactor loadPreGeneratedKey to delegate to it - pkgs/cli/src/node.zig: replace duplicated open/read/fromSsz sequence in loadHashSigKeys() with key_manager_lib.loadKeypairFromFiles(); remove now-unused xmss and constants imports - pkgs/cli/src/main.zig: replace duplicated open/read/fromSsz sequence in the testsig command with key_manager_lib.loadKeypairFromFiles(); hoist key_manager_lib import to file scope and remove inner-scope duplicate * fix: preserve distinct HashSigSecretKeyMissing/HashSigPublicKeyMissing errors in loadKeypairFromFiles --------- Co-authored-by: zclawz <zclawz@users.noreply.github.com> Co-authored-by: zclawz <zclawz@blockblaz.io>
#675) (#676) * fix: replace {any} with {f} for types with format methods, fix GossipTopic format signature (closes #594) * fix: replace all {any} with {f} for types with custom format methods (closes #594) * fix: properly free BeamState on error paths to prevent segfaults (closes #675) Three memory-safety fixes in chain.zig: 1. onBlock — computedstate block: add errdefer for cpost_state - After allocator.create(): errdefer destroy (covers sszClone failure) - After sszClone(): errdefer deinit (covers verifySignatures / apply_transition failures, e.g. InvalidPostState). LIFO ordering ensures deinit runs before destroy, so interior ArrayList fields are freed cleanly. - Previously: a partially-mutated cloned state (historical_block_hashes already appended by process_slots) was silently leaked, causing UB / segfault. 2. onBlock — function level: add errdefer for post_state when we own it - If computedstate succeeds but a later step (forkChoice.onBlock, updateHead, InvalidSignatureGroups) returns an error, post_state was previously leaked. 3. BeamChain init — cloned_anchor_state: add errdefer before states.put - Same create+sszClone pattern without cleanup on states.put failure. --------- Co-authored-by: zclawz <zclawz@blockblaz.io> Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com>
#678) * fix: wire attestation-committee-count CLI flag through to ChainOptions NodeCommand was missing the @"attestation-committee-count" field, so simargs silently ignored the flag and the value always fell through to the hardcoded default of 1 in ChainConfig.init. With every node using committee_count=1, all validators compute validator_id % 1 = 0 and end up subscribed to the same single subnet. Changes: - Add @"attestation-committee-count": ?u64 = null to NodeCommand so simargs parses and exposes the flag. - Add attestation_committee_count: ?u64 = null to NodeOptions. - Add attestationCommitteeCountFromYAML() helper that reads ATTESTATION_COMMITTEE_COUNT from config.yaml as a fallback. - In buildStartOptions: CLI flag takes precedence; falls back to config.yaml value; leaves null if neither is present (ChainConfig still defaults to 1 in that case for backward compat). - In Node.init: apply options.attestation_committee_count to chain_options before constructing ChainConfig so the correct value reaches EthLibp2p.attestation_committee_count and the subnet assignment logic. - Add ATTESTATION_COMMITTEE_COUNT: 4 to test fixture config.yaml. - Add two unit tests for attestationCommitteeCountFromYAML. * fix: validate attestation-committee-count >= 1, default to 1 with warning * build: add minimum_zig_version = "0.15.0" to build.zig.zon --------- Co-authored-by: zclawz <zclawz@blockblaz.io> Co-authored-by: zclawz <zclawz@openclaw.ai>
Co-authored-by: zclawz <zclawz@blockblaz.io>
* forkchoice: align /lean/v0/fork_choice response with leanSpec
- head: was {"slot", "root"}, now "0x..." flat root string
- safe_target: was {"root": "0x..."}, now "0x..." flat root string
- nodes: add proposer_index field (was missing)
Achieved by:
- Adding proposer_index to ProtoBlock and ProtoNode
- Wiring proposer_index through all ProtoBlock/ProtoNode construction sites
- Updating tree_visualizer.buildForkChoiceJSON serialization
* forkchoice: add buildForkChoiceJSON test and clarifying comments
* fix: improve state recovery logging and add genesis time validation (#481) - Add explicit success log when state is loaded from DB on restart - Validate genesis time of loaded state matches chain config; fall back to genesis on mismatch with a clear warning - Improve DB recovery logging: log block root and slot at each step so failures are easier to diagnose - Root cause: lean-quickstart spin-node.sh unconditionally wiped the data directory on every restart (fixed separately in lean-quickstart) * chore: update lean-quickstart submodule to fix data dir wipe on restart * fix: wipe stale database on genesis time mismatch When the DB contains state from a different genesis (genesis_time mismatch), close and delete the RocksDB directory before reopening so the node starts with a fresh DB instance rather than accumulating stale data. Requested by @g11tech in #637 (comment) * fix: add post-wipe genesis time re-check, log and return error if still mismatches * refactor: remove redundant post-wipe genesis check Per g11tech's review: the downstream loadLatestFinalizedState call will handle any inconsistency if the wipe somehow failed. No need to re-probe immediately after wiping. * fix: resolve CI failure - apply zig fmt to node.zig (remove trailing newline) * fix: error out if db wipe fails on genesis time mismatch * refactor the anchor setup on startup * fix: wipe db even when no local finalized state found Per @anshalshukla review: if loadLatestFinalizedState fails (no finalized state in db), we should still wipe the db for a clean slate rather than risk leftover data. NotFound errors are ignored since the db directory may not exist yet on first run. * build: switch production builds from ReleaseFast to ReleaseSafe ReleaseFast compiles unreachable statements and failed bounds/overflow checks to @trap() — a silent illegal instruction with no output. ReleaseSafe keeps safety checks active (bounds, overflow, unreachable → panics with a stack trace) while still applying most optimizations. This makes crash sites visible in production and CI instead of silently killing the process, which was the root cause of the hard-to-diagnose crashes fixed in PR #681. Affected: Dockerfile, ci.yml (SSE integration build), auto-release.yml (x86_64 and aarch64 release binaries). The ReleaseFast in build.zig is for the risc0/zkvm targets and is left unchanged. * fix: change ReleaseFast to ReleaseSafe in zkvm build step * fix: revert zkvm optimize to ReleaseFast (ReleaseSafe breaks riscv32 inline asm) --------- Co-authored-by: anshalshuklabot <anshalshuklabot@users.noreply.github.com> Co-authored-by: zclawz <zclawz@users.noreply.github.com> Co-authored-by: zeam-bot <zeam-bot@openclaw> Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com> Co-authored-by: zclawz <zclawz@openclaw.ai> Co-authored-by: harkamal <gajinder@zeam.in> Co-authored-by: zclawz <zclawz@blockblaz.io> Co-authored-by: zclawz <zclawz@blockblaz.com>
* devnet 3 metrics * move aggregated payloads metrics outside mutex scope * address review comment
…ogic (closes #682) (#684) The db.deinit() + deleteTreeAbsolute + Db.open() pattern was duplicated in two branches of Node.init: 1. Genesis time mismatch (known stale db) — FileNotFound is an error 2. No finalized state found (fresh install) — FileNotFound is ignored Extracted into wipeAndReopenDb(db, allocator, database_path, logger_config, logger, ignore_not_found) which consolidates the two paths and makes the FileNotFound distinction explicit via the bool parameter. Co-authored-by: zclawz <zclawz@blockblaz.io>
…85 (#687) Co-authored-by: ZclawZ <zclawz@blockblaz.io>
…ync (#681) Three bugs caused the node to crash silently (Zig ReleaseFast trap) when processing a burst of cached blocks while syncing from a checkpoint: 1. clock: `_ = r catch unreachable` triggered a trap on the expected `error.Canceled` that xev emits when a timer fires after being re-armed. The error is now handled explicitly: `Canceled` is silently ignored, any other value panics with a message. 2. chain: `processFinalizationAdvancement` assumed `finalized_roots` is non-empty before performing `finalized_roots[1..finalized_roots.len]` slice arithmetic. If the fork choice had already been rebased past the requested root `getCanonicalViewAndAnalysis` can return an empty slice, causing an out-of-bounds panic. The function now returns early with a warning instead. The `usize` subtraction for the orphan-count log line was also guarded against underflow. 3. node: `processCachedDescendants` did not call `chain.onBlockFollowup` after successfully integrating a cached block, unlike the sibling path `processBlockByRootChunk`. This meant head/justification/finalization events were not emitted and `last_emitted_finalized` was not advanced as the node processed a chain of cached blocks, leading to a large deferred finalization jump that triggered the panics above.
When the SSZ state grows beyond ~3 MB the server switches from sending a Content-Length response to Transfer-Encoding: chunked. The previous body-reading loop called readSliceShort which internally goes through: readSliceShort → readVec → defaultReadVec → contentLengthStream contentLengthStream accesses reader.state.body_remaining_content_length but that field is not active for chunked responses (state is 'ready'), causing a panic: thread 1 panic: access of union field 'body_remaining_content_length' while field 'ready' is active Replace the manual request/response loop with client.fetch() using a std.Io.Writer.Allocating as the response_writer. fetch() calls response.readerDecompressing() + streamRemaining() which dispatches through chunkedStream or contentLengthStream correctly based on the actual transfer encoding used by the server.
* fix: subnet subscription for attestations (closes #683) - Add --subnet-ids CLI flag (comma-separated) to explicitly configure which attestation subnets to subscribe and aggregate - When --subnet-ids is provided, subscribe only to those subnets at the libp2p gossip level (overrides auto-computation from validator ids) - Gate forkchoice import (onGossipAttestation) behind is_aggregator flag: gossip attestations are only imported if --is-aggregator is set - When --subnet-ids is also set, further filter imports to the specified subnets only; aggregator without --subnet-ids imports from all subscribed subnets - Proposer attestations from blocks continue to be imported unconditionally (they go through the block processing path, not onGossipAttestation) - Propagate subnet_ids through NodeOpts -> BeamNode -> ChainOpts -> BeamChain * fix: rename subnet-ids to aggregate-subnet-ids, subscribe to all subnets - Rename CLI flag --subnet-ids to --aggregate-subnet-ids per review - Subscribe to both explicit aggregate-subnet-ids AND validator-derived subnets (not exclusive else-if); deduplicate with a seen set - Fall back to subnet 0 only if no subnets from either source * fix: use ReleaseFast for risc0ospkg host tool to fix docker-build CI * fix: validate is-aggregator before aggregate-subnet-ids; scope no-filter import to validator subnet * fix: gate attestation p2p subscription on is_aggregator Non-aggregator nodes never import gossip attestations (chain.zig guards import behind is_aggregator_enabled). Subscribing to attestation subnets at the libp2p layer when the node will never consume them wastes network bandwidth — the messages are received, validated, and then immediately discarded. Move the is_aggregator guard up to the p2p subscription step so non-aggregators never join attestation subnet topics at all. Addresses review comment #685 (chain.zig line 790). * fix: move subnet filtering to p2p subscription layer, not node import level Per @anshalshukla review: when no explicit --aggregate-subnet-ids filter is set, trust the p2p subscription (node.zig already subscribes only to validator-derived subnets at libp2p level). Removing the redundant computeSubnetId loop in chain.zig that was rejecting at the node level instead of preventing bandwidth use at the network layer. * fix: rename aggregate-subnet-ids to import-subnet-ids, subscribe/import regardless of aggregator flag - Rename --aggregate-subnet-ids CLI flag to --import-subnet-ids - Rename aggregation_subnet_ids field to import_subnet_ids in chain.zig - Update help text: option adds to (not overrides) automatic validator subnet computation - Subscribe to import_subnet_ids at p2p level regardless of is-aggregator flag (proposer nodes also need these attestations to include in blocks) - Import gossip attestations for import_subnet_ids regardless of is_aggregator_enabled - Validator-derived subnets and aggregator fallback still gated behind is_aggregator_enabled Addresses review feedback from @g11tech on PR #685 * correct the implementation * fix: remove stale subnet_ids from ChainOpts init, fix subscribe_subnet_ids typo - Remove .subnet_ids from ChainOpts initializer in node.zig (field removed by g11tech) - Fix typo: self.subscribe_subnet_ids -> self.subscription_subnet_ids in cli/node.zig - Add allocator.free for subscription_subnet_ids in Node.deinit * fix: resolve CI failure - pass empty subnet slice to net.run() in simtest call sites * undo changing network run * fix the subnet subscription * fix: implement subscription_subnet_ids computation from validator_ids and aggregation_subnet_ids * remove redundant compute of validator subnet ids * fix the edge condition * fix build * fix build --------- Co-authored-by: zclawz <zclawz@openclaw.ai> Co-authored-by: zclawz <zclawz@users.noreply.github.com> Co-authored-by: zclawz <zclawz@blockblaz.io> Co-authored-by: harkamal <gajinder@zeam.in> Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com> Co-authored-by: zclawz <zclawz@zeam.dev>
…#692) * cli: fix checkpoint sync panic on chunked HTTP responses When the SSZ state grows beyond ~3 MB the server switches from sending a Content-Length response to Transfer-Encoding: chunked. The previous body-reading loop called readSliceShort which internally goes through: readSliceShort → readVec → defaultReadVec → contentLengthStream contentLengthStream accesses reader.state.body_remaining_content_length but that field is not active for chunked responses (state is 'ready'), causing a panic: thread 1 panic: access of union field 'body_remaining_content_length' while field 'ready' is active Replace the manual request/response loop with client.fetch() using a std.Io.Writer.Allocating as the response_writer. fetch() calls response.readerDecompressing() + streamRemaining() which dispatches through chunkedStream or contentLengthStream correctly based on the actual transfer encoding used by the server. * types: skip pre-finalized attestations instead of aborting block import After checkpoint sync, incoming blocks may carry attestations whose target slots predate the finalized anchor. IsJustifiableSlot correctly identifies these as non-justifiable, but the `try` was propagating the error fatally, causing the entire block import to fail. This creates a cascading gap: block N fails → blocks N+1..M fail (missing parent) → no epoch-boundary attestations accumulate → justified checkpoint never advances → forkchoice stays stuck in `initing` indefinitely. Fix: catch InvalidJustifiableSlot and treat it as `false`. The attestation is then silently skipped via the existing !is_target_justifiable check, exactly as all other non-viable attestations (unknown source/target/head, stale slot, etc.) are handled. The block imports successfully, the chain catches up, and the node exits the initing state. Update the test that was asserting the old (buggy) error-propagation behaviour to instead assert that process_attestations succeeds. * node: fix fc_initing deadlock after checkpoint sync After checkpoint sync the forkchoice starts in the initing state and waits for a first justified checkpoint before declaring itself ready. The status-response sync handler was checking getSyncStatus() and treating fc_initing the same as synced — doing nothing. This created a deadlock: the node never requested blocks from ahead peers because it was in fc_initing, and it could never leave fc_initing because no blocks were imported. Fix the deadlock in two places: 1. Status-response handler: add an explicit fc_initing branch that requests the peer's head block when the peer is ahead of our anchor slot. This mirrors the behind_peers branch but uses head_slot for the comparison (finalized_slot is not yet meaningful in fc_initing). 2. Periodic sync refresh: every SYNC_STATUS_REFRESH_INTERVAL_SLOTS (8) slots, re-send our status to all connected peers when not synced. This recovers from the case where all peers were already connected before the fix was deployed, so no new connection event fires and the status-response handler would never be re-triggered.
* fix: validateAttestationData refactor + validate aggregated attestation (closes #654) * refactor: update store - key signatures/payloads by AttestationData instead of SignatureKey (closes #636) Implements leanEthereum/leanSpec#436: - Replace SignatureKey(validator_id, data_root) -> StoredSignature map with AttestationData -> HashMap(ValidatorIndex, StoredSignature) (SignaturesMap) - Replace SignatureKey -> AggregatedPayloadsList map with AttestationData -> AggregatedPayloadsList (AggregatedPayloadsMap) - Remove attestation_data_by_root reverse-lookup map (no longer needed) - Remove SignatureKey type (replaced by AttestationData as key) - Update computeAggregatedSignatures Phase 1 to look up by AttestationData - Update computeAggregatedSignatures Phase 2 greedy set-cover to use per-data candidate list instead of per-validator lookup - Update pruneStaleAttestationData to iterate gossip_signatures directly and prune by target.slot (renamed to prunePayloadMapBySlot) - Update all call sites: forkchoice.zig, fork_choice_runner.zig, mock.zig, block_signatures_testing.zig * fix: resolve CI failure - update tests to use AttestationData as map key after store refactor * refactor: address PR #656 review comments - Remove 'Replaces the old...' and 'not per-validator' comments from block.zig (three sites); reviewer noted these transition comments are not needed in the final code. - Make GossipSignaturesInnerMap a proper struct wrapping AutoHashMap with init/put/get/deinit/iterator/count methods, so callers don't need to manage the inner map lifecycle and the initialization pattern (getOrPut + init if not found) stays consistent across all sites. - Remove validator_ids parameter from storeAggregatedPayload — payloads are now keyed by AttestationData so the param was unused. Update all call sites: chain.zig (×2), forkchoice.zig test helper, fork_choice_runner.zig. - Scope signatures_mutex in storeAggregatedPayload using a {} block with defer unlock, releasing the lock as soon as the map write completes. Consistent with the scoped-lock pattern used everywhere else in the file. * refactor: eliminate attestation_list from computeAggregatedSignatures; remove validateAttestation wrapper aggregateCommitteeSignaturesUnlocked / computeAggregatedSignatures: The old flow built a flat attestation list from gossip_signatures to pass as the first arg to computeAggregatedSignatures, which then re-grouped it by AttestationData. Since gossip_signatures is already keyed by AttestationData → GossipSignaturesInnerMap this round-trip was pure overhead. New flow: - computeAggregatedSignatures(validators, signatures_map, aggregated_payloads) drops the attestations_list parameter entirely. - Iterates the *union* of signatures_map and aggregated_payloads keys so groups that only exist in aggregated_payloads (individual sigs pruned but stored proof available) are still processed. - Phase 1 iterates the inner map directly (validator_id → stored_sig) with no intermediate grouping or root-index lookup. - Before Phase 2 greedy set-cover, seeds from proof participants not already in sigmap_available so proofs covering validators absent from signatures_map are still included. aggregateCommitteeSignaturesUnlocked (forkchoice.zig): - Drops the attestations ArrayList build loop entirely. - Calls computeAggregatedSignatures directly on gossip_signatures. chain.zig: - Removes getProposalAttestations() call (was only needed as the first arg); aggregation now reads gossip_signatures directly. - Removes validateAttestation() thin wrapper (review comment #r2955207656); callers should use validateAttestationData directly. All call sites updated: chain.zig, mock.zig, block_signatures_testing.zig. * fix: resolve CI failure - apply zig fmt to chain.zig (trailing blank line) * improve code and simplify aggregation * optimize blcok building * add checks in block production * refactor: move greedy proposal attestation logic to forkchoice.getProposalAttestations Move the fixed-point greedy proof selection loop from chain.produceBlock into forkchoice.getProposalAttestations, which now takes pre_state, slot, proposer_index, and parent_root as inputs. This encapsulates the attestation selection strategy inside the fork choice module, where it belongs. chain.produceBlock now simply calls forkChoice.getProposalAttestations and uses the returned AggregatedAttestations + AttestationSignatures. --------- Co-authored-by: zclawz <zclawz@blockblaz.io> Co-authored-by: zclawz <zclawz@openclaw.ai> Co-authored-by: Anshal Shukla <53994948+anshalshukla@users.noreply.github.com> Co-authored-by: anshalshukla <shukla.anshal85@gmail.com> Co-authored-by: zclawz <zclawz@users.noreply.github.com>
Replace compile-time backend override (forked libxev) with upstream
xev's Dynamic API. On Linux, Dynamic probes io_uring first and falls
back to epoll at runtime — exactly what Shadow needs since it doesn't
support io_uring_setup. On macOS, Dynamic degenerates to static kqueue.
- Switch all `@import("xev")` to `@import("xev").Dynamic`
- Add `detectBackend()` helper with `@hasDecl` guard for single-backend
platforms (macOS) where `detect()` doesn't exist
- Revert build.zig.zon to upstream libxev (no fork needed)
- Remove build.zig XevBackend option
|
Wrong approach — included unrelated commits. Will recreate with only shadow-specific changes. |
There was a problem hiding this comment.
Pull request overview
This PR cherry-picks a large set of non-shadow-specific fixes/refactors/features from the former shadow development branch onto main, including runtime I/O backend detection, a fork-choice/store refactor around AttestationData keys, and multiple operational/CLI/metrics improvements.
Changes:
- Refactors gossip-signature + aggregated-payload storage to be keyed by
AttestationData(nested per-validator inner maps) and updates aggregation/block-production paths accordingly. - Switches
xevusage to the Dynamic API with runtime backend detection (io_uring/epoll probing on Linux) and propagates formatting changes ({f}) across logging/fixtures. - Adds test key infrastructure (new
test-keyssubmodule, key loading helpers, and a toolskeygencommand), plus metrics and CLI/database recovery improvements.
Reviewed changes
Copilot reviewed 40 out of 40 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| resources/zeam.md | Adds links to LeanSpec client documentation. |
| resources/poseidon.md | Documents Poseidon2 SSZ hashing build flag and design notes. |
| README.md | Mentions Poseidon2 option and adds build troubleshooting notes. |
| pkgs/types/src/state.zig | Changes pre-finalized attestation handling (skip vs error) + updates test. |
| pkgs/types/src/lib.zig | Updates public exports to match new attestation/signature aggregation API. |
| pkgs/types/src/block.zig | Introduces SignaturesMap keyed by AttestationData and AggregateInnerMap; updates proto types. |
| pkgs/types/src/block_signatures_testing.zig | Removes legacy aggregation test suite tied to old SignatureKey design. |
| pkgs/tools/src/main.zig | Adds keygen command to generate XMSS test validator keys + manifest. |
| pkgs/state-transition/src/transition.zig | Updates PQ signature verification metrics and invalid/valid counters. |
| pkgs/state-transition/src/mock.zig | Updates mock chain generation to use new signature map + inner-map aggregation. |
| pkgs/spectest/src/runner/fork_choice_runner.zig | Updates fixtures runner formatting and aggregated-payload storage API usage. |
| pkgs/spectest/src/json_expect.zig | Updates fixture error formatting to use {f} formatting. |
| pkgs/node/src/utils.zig | Switches to xev.Dynamic and adds detectBackend() runtime probe. |
| pkgs/node/src/tree_visualizer.zig | Aligns fork-choice JSON output shapes with leanSpec + adds test coverage. |
| pkgs/node/src/testing.zig | Uses xev.Dynamic and calls detectBackend() during test context init. |
| pkgs/node/src/node.zig | Adds subnet subscription controls, sync recovery logic, cached-block followups, and formatting updates. |
| pkgs/node/src/lib.zig | Re-exports detectBackend from utils. |
| pkgs/node/src/forkchoice.zig | Refactors signature/payload store to AttestationData keys; new proposal attestation selection algorithm; pruning/metrics updates. |
| pkgs/node/src/constants.zig | Adds periodic sync status refresh interval constant. |
| pkgs/node/src/clock.zig | Switches to xev.Dynamic and handles timer cancelation cleanly. |
| pkgs/node/src/chain.zig | Updates block production to new proposal attestation API; improves gossip missing-root propagation; memory-safety errdefers. |
| pkgs/network/src/mock.zig | Uses xev.Dynamic, backend detection in tests, and minor loop refactor. |
| pkgs/network/src/interface.zig | Uses xev.Dynamic; updates GossipTopic formatting signature and log format specifiers. |
| pkgs/network/src/ethlibp2p.zig | Uses xev.Dynamic and updates log formatting specifiers. |
| pkgs/metrics/src/lib.zig | Renames/adds PQ signature metrics, adds fork-choice store gauges and committee aggregation histogram; adjusts buckets. |
| pkgs/key-manager/src/lib.zig | Adds pre-generated key loading, ownership tracking, shared key loading helper, and signing metric updates. |
| pkgs/database/src/rocksdb.zig | Improves finalized-state recovery logs and corruption diagnostics. |
| pkgs/cli/test/fixtures/config.yaml | Adds ATTESTATION_COMMITTEE_COUNT to fixture config. |
| pkgs/cli/src/node.zig | Adds DB wipe/reopen helper, checkpoint-vs-db anchor selection logic, subnet/committee-count options parsing, and key loading helper use. |
| pkgs/cli/src/main.zig | Uses xev.Dynamic, calls detectBackend() at startup, and uses shared key loading helper. |
| pkgs/api/src/events.zig | Updates proto block construction to include proposer_index. |
| pkgs/api/src/event_broadcaster.zig | Updates proto block construction to include proposer_index. |
| Dockerfile | Switches release build optimization to ReleaseSafe. |
| build.zig.zon | Sets minimum Zig version to 0.15.0. |
| build.zig | Injects test_keys_path, wires build_options into key-manager, updates tools deps, and forces ReleaseFast for riscv32 targets. |
| .gitmodules | Adds test-keys submodule. |
| .github/workflows/ci.yml | Ensures submodules are checked out; switches build optimization to ReleaseSafe. |
| .github/workflows/auto-release.yml | Switches release builds to ReleaseSafe. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var payload_it = self.latest_known_aggregated_payloads.iterator(); | ||
| while (payload_it.next()) |entry| { | ||
| if (!std.mem.eql(u8, ¤t_justified_root, &entry.key_ptr.source.root)) continue; | ||
| if (!self.protoArray.indices.contains(entry.key_ptr.head.root)) continue; | ||
| if (processed_att_data.contains(entry.key_ptr.*)) continue; | ||
| try sorted_entries.append(self.allocator, .{ .att_data = entry.key_ptr, .payloads = entry.value_ptr }); | ||
| } |
There was a problem hiding this comment.
getProposalAttestationsUnlocked iterates over latest_known_aggregated_payloads without taking signatures_mutex, but that map is mutated under signatures_mutex by storeAggregatedPayload / acceptNewAttestationsUnlocked / aggregateCommitteeSignaturesUnlocked. Iterating without the same lock can race with concurrent writers and lead to iterator invalidation / memory corruption. Consider either (a) taking signatures_mutex while collecting the relevant payload entries into a local list (keys + cloned proofs) and then releasing it before the STF loop, or (b) protecting payload maps with the main mutex consistently (writers + readers).
| pub fn getOrPut(self: *SignaturesMap, key: attestation.AttestationData) !InnerHashMap.GetOrPutResult { | ||
| return self.inner.getOrPut(key); | ||
| } | ||
|
|
||
| pub fn getPtr(self: *const SignaturesMap, key: attestation.AttestationData) ?*InnerMap { | ||
| return self.inner.getPtr(key); | ||
| } |
There was a problem hiding this comment.
SignaturesMap.getPtr takes self: *const SignaturesMap but returns a mutable ?*InnerMap, which breaks const-correctness (callers can mutate inner maps through a const reference). This should either take self: *SignaturesMap or return ?*const InnerMap (and provide a separate mutable accessor when needed).
| pub fn get(self: *const SignaturesMap, key: attestation.AttestationData) ?InnerMap { | ||
| return self.inner.get(key); | ||
| } |
There was a problem hiding this comment.
SignaturesMap.get returns an InnerMap by value (copying the hash map struct). Even if callers treat it read-only, this makes it easy to accidentally mutate/deinit a copy and double-free internal allocations. Prefer exposing only pointer-based accessors (getPtr / getConstPtr) so ownership/lifetime is unambiguous.
| @@ -781,9 +781,9 @@ fn processBlockStep( | |||
| validator_ids[i] = @intCast(vi); | |||
| } | |||
|
|
|||
| ctx.fork_choice.storeAggregatedPayload(validator_ids, &aggregated_attestation.data, proof_template, true) catch |err| { | |||
| ctx.fork_choice.storeAggregatedPayload(&aggregated_attestation.data, proof_template, true) catch |err| { | |||
| std.debug.print( | |||
There was a problem hiding this comment.
After storeAggregatedPayload was refactored to no longer take validator_ids, this block still allocates/fills validator_ids but never uses it. Zig will treat this as an unused local and fail compilation. Remove the allocation/loop (or use it if still needed).
| db.deinit(); | ||
| const rocksdb_path = try std.fmt.allocPrint(allocator, "{s}/rocksdb", .{database_path}); | ||
| defer allocator.free(rocksdb_path); | ||
| std.fs.deleteTreeAbsolute(rocksdb_path) catch |wipe_err| { | ||
| if (!ignore_not_found or wipe_err != error.FileNotFound) { | ||
| logger.err("failed to delete database directory '{s}': {any}", .{ rocksdb_path, wipe_err }); | ||
| return wipe_err; | ||
| } |
There was a problem hiding this comment.
wipeAndReopenDb builds rocksdb_path via "{s}/rocksdb" and then calls std.fs.deleteTreeAbsolute(rocksdb_path). If database_path is a relative path (common for CLI flags/defaults), deleteTreeAbsolute will error because the path is not absolute. Use std.fs.cwd().deleteTree(rocksdb_path) for relative paths, or convert to an absolute path first (e.g., realpathAlloc) before calling deleteTreeAbsolute.
Summary
Cherry-pick all non-shadow-specific commits from the
shadowbranch tomain. These are fixes, refactors, and features that accumulated onshadowwhile it was the active development branch.Key changes:
Not included (shadow-only):
rust/patch/quinn-udp/(Shadow fallback UDP)shadow/directory (removed, now in lean-quickstart)After this merges, the
shadowbranch can be rebased ontomainwith minimal diff (only quinn-udp patch).Test plan