New concurrent AggregaringAttestationPool (V2) #9297

tbenr · 2025-03-28T12:31:35Z

new AggregatingAttestationPoolV2 and MatchingDataAttestationGroupV2 classes implementing the new pool

--Xaggregating-attestation-pool-v2-enabled CLI switches it on (default disabled)

intentionally left code duplications, so that the implementations can drift over time independently.

unit tests run on both implementations

fixes #9291

Documentation

I thought about documentation and added the doc-change-required label to this PR if updates are required.

Changelog

I thought about adding a changelog entry, and added one if I deemed necessary.

improved removeAttestationsPriorToSlot

rolfyone · 2025-04-06T21:56:43Z

...rc/main/java/tech/pegasys/teku/statetransition/attestation/AggregatingAttestationPoolV1.java

+ * specific language governing permissions and limitations under the License.
+ */
+
+package tech.pegasys.teku.statetransition.attestation;


maybe we should just use the same class name from different packages?
not set on this but to demonstrate
statetransition.attestation.v1
statetransition.attestation.v2
may be cache.v1 and cache.v2 or something... just the concept...

Nice idea.. this way we can see that essentially V1 isn't changed.

rolfyone · 2025-04-06T22:58:46Z

...rc/main/java/tech/pegasys/teku/statetransition/attestation/AggregatingAttestationPoolV2.java

+    // attestation is not from the current or previous epoch
+    // this is really an edge case because the current or previous epoch is at least 31 slots
+    // and the attestation is only valid for 64 slots, so it may be epoch-2 but not beyond.
+    final UInt64 attestationEpochStartSlot = miscHelpers.computeStartSlotAtEpoch(attestationEpoch);


i've been thinking about this a little.... i think if we only ever use the start slot of the current epoch or start slot of previous epoch, then that's probably safest (because of limited cache largely)
we've got 2 options, i think simplest is

like above, if attestation is current or previous epoch use bestState

otherwise use first slot of previous epoch (which generally is cached still)

if we can easily log cache misses here it may be a good thing...

i'd probably avoid doing the start slot of the attestation epoch because that may be 2 epochs behind (per comment) and almost certainly require a regeneration from ... probably the justified state...

rolfyone · 2025-04-06T23:11:38Z

...rc/main/java/tech/pegasys/teku/statetransition/attestation/AggregatingAttestationPoolV2.java

+    // Prune based on maximum size if needed
+    int currentSize = getSize();
+    if (currentSize > maximumAttestationCount) {
+      // Keep removing oldest slots until size is acceptable or only one slot remains
+      while (dataHashBySlot.size() > 1 && currentSize > maximumAttestationCount) {
+        LOG.trace(
+            "V2 Attestation cache at {} exceeds {}. Pruning...",
+            currentSize,
+            maximumAttestationCount);
+        final UInt64 oldestSlot = dataHashBySlot.firstKey();
+        // Remove slot immediately following the oldest to ensure we always keep at least one slot
+        removeAttestationsPriorToSlot(oldestSlot.plus(1));
+        final int newSize = getSize();
+        // Break if removal failed to change size or get oldest key (edge case for concurrent
+        // modification)
+        if (newSize == currentSize || oldestSlot.equals(dataHashBySlot.firstKey())) {
+          LOG.warn(
+              "V2 Failed to prune oldest slot {}, possibly due to concurrent access or no removable attestations. Skipping further pruning this cycle.",
+              oldestSlot);
+          break;
+        }
+        currentSize = newSize;
+      }
+    }


i do think we need a set of lock logic around the clean... i think i had a cleanup function in my draft that used a lock here...

The reasoning is really the degenerate scenario where 2 onSlot run because of a long prune and hilarity ensues.

I would suggest breaking this cleanup logic out, and having a lock where if its already running then just return

final AtomicBoolean isCleanupRunning = new AtomicBoolean(false); ... void cleanupCache(final Optional<UInt64> maybeSlot) { // one cleanup at a time can run if (!isCleanupRunning.compareAndSet(false, true)) { return; } try { if (maybeSlot.isEmpty()) { while (dataHashBySlot.size() > 1 && size.get() > maximumAttestationCount) { LOG.trace("Attestation cache at {} exceeds {}, ", size.get(), maximumAttestationCount); removeAttestationsPriorToSlot(dataHashBySlot.firstKey().plus(1)); } } else { removeAttestationsPriorToSlot(maybeSlot.get()); } } finally { isCleanupRunning.set(false); } }

something like this concept...

i'd prefer to have a cache thats too large for a period than a broken cache or a bunch of onSlots taking a long time

So you are thinking about that because you think that two onSlot could run concurrently?
I actually don't think it can happen because IIRC event channel always queues pending events and execute them in sequence. I'll double check.

yeh so removeAttestationsPriorToSlot is actually a dangerous function, it cant run from multiple sources concurrently...
In v1 it ran in add and in onSlot, and now it just runs in onSlot... that's good, as long as onSlot is not able to run more than once at a time.
The safer thing is to have a cacheCleanup thats behind a lock, and then it just skips if its already running. It still probably needs a 'dont call this anywhere else because' on that removeAttestationsPriorToSlot
This could be done in the onSlot, just having the same lock concept and exiting if the cache cleanup is already running (so isCleanupRunning could easily be inside onSlot rather than its own function and would be equivalent)

actually i think more correctly that concept of the looping and reducing is the fun bit, 2 of them running at the same time we want to avoid as we'll flush out more than we want in the worst case.

tbenr · 2025-05-12T20:56:06Z

Closing in favour of a new wave of PRs

Pool v2

d4cd42b

tbenr marked this pull request as draft March 28, 2025 12:31

revert benchmark

3023f99

tbenr force-pushed the concurrent-aggregatingAttestationPool branch from 2f12a02 to 3023f99 Compare March 28, 2025 13:27

tbenr added 2 commits March 28, 2025 19:25

some cleanups

485e687

spotless

38c3b41

tbenr force-pushed the concurrent-aggregatingAttestationPool branch from bd7a49e to 1af17ee Compare March 28, 2025 18:53

add commandline option

838a7b0

tbenr force-pushed the concurrent-aggregatingAttestationPool branch from 1af17ee to 838a7b0 Compare March 28, 2025 19:32

improved setting committeeShuffling

e095c0f

improved removeAttestationsPriorToSlot

tbenr changed the title ~~Pool v2~~ New concurrent AggregaringAttestationPool (V2) Mar 29, 2025

tbenr marked this pull request as ready for review March 29, 2025 14:26

rolfyone reviewed Apr 6, 2025

View reviewed changes

rolfyone mentioned this pull request Apr 6, 2025

refactor AggregateAttestationPool #9225

Closed

2 tasks

rolfyone reviewed Apr 6, 2025

View reviewed changes

tbenr closed this May 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New concurrent AggregaringAttestationPool (V2) #9297

New concurrent AggregaringAttestationPool (V2) #9297

Uh oh!

tbenr commented Mar 28, 2025 •

edited

Loading

Uh oh!

rolfyone Apr 6, 2025 •

edited

Loading

Uh oh!

tbenr Apr 7, 2025

Uh oh!

rolfyone Apr 6, 2025

Uh oh!

rolfyone Apr 6, 2025

Uh oh!

tbenr Apr 7, 2025

Uh oh!

rolfyone Apr 7, 2025

Uh oh!

rolfyone Apr 7, 2025

Uh oh!

tbenr commented May 12, 2025

Uh oh!

Uh oh!

New concurrent AggregaringAttestationPool (V2) #9297

New concurrent AggregaringAttestationPool (V2) #9297

Uh oh!

Conversation

tbenr commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation

Changelog

Uh oh!

rolfyone Apr 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tbenr Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

rolfyone Apr 6, 2025

Choose a reason for hiding this comment

Uh oh!

rolfyone Apr 6, 2025

Choose a reason for hiding this comment

Uh oh!

tbenr Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

rolfyone Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

rolfyone Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

tbenr commented May 12, 2025

Uh oh!

Uh oh!

tbenr commented Mar 28, 2025 •

edited

Loading

rolfyone Apr 6, 2025 •

edited

Loading