8369048: GenShen: Defer ShenFreeSet::available() during rebuild #27612

kdnilsen · 2025-10-02T17:58:48Z

This code introduces a new rebuild-freeset lock for purposes of coordinating the freeset rebuild activities and queries as to memory available for allocation in the mutator partition.

This addresses a problem that results if available memory is probed while we are rebuilding the freeset.

Rather than using the existing global heap lock to synchronize these activities, a new more narrowly scoped lock is introduced. This allows the available memory to be probed even when other activities hold the global heap lock for reasons other than rebuilding the freeset, such as when they are allocating memory. It is known that the global heap lock is heavily contended for certain workloads, and using this new lock avoids adding to contention for the global heap lock.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8369048: GenShen: Defer ShenFreeSet::available() during rebuild (Enhancement - P4)

Reviewers

Y. Srinivas Ramakrishna (@ysramakrishna - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/27612/head:pull/27612
$ git checkout pull/27612

Update a local copy of the PR:
$ git checkout pull/27612
$ git pull https://git.openjdk.org/jdk.git pull/27612/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 27612

View PR using the GUI difftool:
$ git pr show -t 27612

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/27612.diff

Using Webrev

Link to Webrev Comment

kdnilsen · 2025-10-02T17:59:19Z

Will identify this PR as draft until I complete performance and correctness tests.

bridgekeeper · 2025-10-02T18:00:12Z

👋 Welcome back kdnilsen! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-10-02T18:00:18Z

@kdnilsen This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8369048: GenShen: Defer ShenFreeSet::available() during rebuild

Reviewed-by: ysr

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 30 new commits pushed to the master branch:

7c169c9: 8365426: [macos26] Graphics2D tests fail on new macOS 26
bc928c8: 5038439: Warning message for literal shift amounts outside the canonical domain
7aff8e1: 8371319: java.lang.reflect.Method#equals doesn't short-circuit with same instances
... and 27 more: https://git.openjdk.org/jdk/compare/bfc048aba6391d52c07d9a5146466b47d2f6fed8...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk · 2025-10-02T18:01:12Z

@kdnilsen The following labels will be automatically applied to this pull request:

hotspot-gc
shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

kdnilsen · 2025-10-06T17:07:59Z

I have these results from running Extremem tests on commit 99d0175

I am going to try an experiment with a different approach. I will remove the synchronization lock and instead will cause the implementation of freeset rebuild to not update available() until after it is done with its work. I think this may address the same problem with less run-time overhead.

kdnilsen · 2025-10-07T14:53:50Z

On the same workload, here are the results of the experiment (rather than locking to prevent fetch of available during rebuild, we continue to return the value of available at the start of rebuild until rebuild finishes):

General observations are that:

CPU utilization increased for both GenShen and Shen.
Number of completed GCs increased for GenShen but decreased for Shen
Shen degenerated GCs increased
GenShen P50 latency increased, but p95, p99, and p99.9 latencies decreased. Higher latencies all increased for GenShen.
Shen latencies are worse at all percentiles.

Qualitatively, what would we expect? If we return an old value for available() during freeset rebuild, we are usually causing triggering heuristics to believe there is less memory available than is actually available. This may cause us to trigger GC more aggressively. This bears out for GenShen, but not for Shen.

With GenShen, the critical conflict occurs when old marking has completed, and we rebuild the free set following old marking in order to recycle immediate old garbage and to set aside old-collector reserves which will be required for anticipated mixed evacuation GC cycles that will immediately follow. While this is happening, the Shenandoah regulator thread is trying to decide whether it should interrupt old GC in order to perform an "urgent" young GC cycle. And sometimes, the regulator thread's inquiry as to how much memory is available sees a bogus (not just stale, but out of thin air) value because the freeset is under construction at the time of its inquiry. Preventing this bogus value is the point of this PR.

This situation does not generally happen with traditional Shenandoah. Traditional Shenandoah only queries the available() during times when GC is idle. (There are plans to change this, to allow the freeset to be rebuilt more asynchronously, so we are testing this coordination mechanism out for both GenShen and Shen.). A plausible explanation for the observed impact on Shen is that the absence of synchronization allows Shen to see more stale values of available(), even when we are not conflicting with concurrent freeset rebuilds. Specifically, if we gnawing away on available memory, probing available() every ms, the triggering heuristic may see the same value of available() for three consecutive probes. Not recognizing that memory has been consumed, it will delay triggering of the next GC cycle, resulting in fewer concurrent GCs with the "unsynchronized" solution. Besides resulting in fewer GC cycles, the late triggers also allow us to get closer to total depletion of the allocatable memory pool, which explains an increase in Shenandoah degenerated cycles.

Presumably, GenShen is also vulnerable to this possibility. But the benefit of eliminating out-of-thin-air available values for GenShen seems to outweigh the risk of occasional stale values that cause late triggers.

kdnilsen · 2025-10-07T15:00:38Z

For further context, here are CI pipeline performance summaries for the initial synchronized solution:

   Control: openjdk-master-aarch64
Experiment: synchronize-available-with-rebuild-gh-aarch64

Genshen
-------------------------------------------------------------------------------------------------------
+45.80% specjbb2015/trigger_failure p=0.00542
  Control:    365.562   (+/-158.45  )        109
  Test:       533.000   (+/-200.37  )         10

+28.53% scimark.lu.large/concurrent_update_refs_young p=0.00020
  Control:      5.608ms (+/-  1.91ms)         34
  Test:         7.208ms (+/-107.48us)          2

+24.44% specjbb2015/concurrent_update_refs_degen_young p=0.00563
  Control:    804.287ms (+/-330.68ms)         41
  Test:         1.001s  (+/-101.83ms)          8

and for the "unsynchronized" solution:

   Control: openjdk-master-aarch64
Experiment: synchronize-available-with-rebuild-gh-aarch64

Genshen
-------------------------------------------------------------------------------------------------------
+51.82% hyperalloc_a2048_o4096/finish_mark_degen_young p=0.00771
  Control:     82.769ms (+/- 66.46ms)         66
  Test:       125.658ms (+/- 78.91ms)         43

The p values for all of these measures are a bit high, based on limited samples of relevant data. The unsynchronized data result is combined with previous measurements taken from the synchronized experiments.

kdnilsen · 2025-10-07T15:04:42Z

One other somewhat subjective observation is that the synchronized solution experienced many more "timeout" failures on the CI pipeline than the unsynchronized solution. These timeout failures correlate with stress workloads that exercise the JVM in abnormal/extreme ways. Under these stresses, the unsynchronized mechanism seems to be a bit more robust.

kdnilsen · 2025-10-07T15:08:39Z

I'm inclined to prefer the synchronized solution so will revert my most recent three commits.

This reverts commit 717e7da.

…ebuild" This reverts commit 1a5e483.

This reverts commit bec73da.

mlbridge · 2025-10-07T15:24:23Z

Webrevs

earthling-amzn

If I understand correctly, the general issue is that available is not accurate when the freeset is being rebuilt. There are three solutions tested:

Existing code, returns a sentinel value (SIZE_MAX) during freeset rebuilt
Return the last known value of available during the rebuild (this appears to cause more aggressive heuristics)
Block threads that call available during rebuild

As it stands, only the regulator thread (which is evaluating heuristics for genshen) will call available during a free set rebuild (though this may change in the future). With the first solution, it seems we would have the heuristics believe there is much more memory available than there actually is. This would risk the heuristic not triggering when it should?

It makes sense that option 2 would trigger more GCs than option 1, but it seems the risk of triggering too late would be lower here. Option 3 might also delay triggering, but at least the heuristic would base the trigger decision on an accurate accounting of available memory.

If we go with the third option, I think we should move the lock management into the freeset and not have to change existing callers.

void ShenandoahFreeSet::prepare_to_rebuild(...) {
  _lock.lock();
  // do preparation
  // ...
}

void ShenandoahFreeSet::finish_rebuild(...) {
  // finish rebuild
  // ...
  _lock.unlock();
}

Could we also now remove the sentinel value (FreeSetUnderConstruction)?

ysramakrishna

Left a few comments. (In particular of what looks like a deadlock possibility in debug builds.)

ysramakrishna · 2025-10-31T00:23:18Z

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.hpp

  ShenandoahHeap* const _heap;
  ShenandoahRegionPartitions _partitions;

+  // This locks the rebuild process (in combination with the global heap lock)


Explain the role of this & the global heap lock vis-a-vis the rebuild process.

Also may be call it _rebuild_lock, rather than just _lock.

Am changing the name. I will add discussion of the rank ordering of locks here as well.

ysramakrishna · 2025-10-31T00:23:53Z

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.hpp

  ShenandoahFreeSet(ShenandoahHeap* heap, size_t max_regions);

+
+  ShenandoahRebuildLock* lock() {


rebuild_lock() instead?

Good suggestion. Making this change.

ysramakrishna · 2025-10-31T00:30:10Z

src/hotspot/share/gc/shenandoah/shenandoahFullGC.cpp

-
+    size_t young_cset_regions, old_cset_regions, first_old, last_old, num_old;
+    ShenandoahFreeSet* free_set = heap->free_set();
+    ShenandoahRebuildLocker rebuild_locker(free_set->lock());


Should you not create a scope around lines 1158 to line 1167, since you don't want to hold the rebuild lock as soon as the rebuild is done (i.e. immediately following finish_rebuild())?

May be it doesn't matter, since no one else is running during a full gc who needs to query available()?

I'll tighten up the context for the rebuild lock. I was thinking that set_mark_incomplete() and clear_cancelled_gc() would be "fast enough" that it wouldn't matter to hold the rebuild_lock this much longer, but I agree it is better to release the lock as soon as possible.

ysramakrishna · 2025-10-31T00:44:11Z

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.hpp

  // Return available_in assuming caller does not hold the heap lock.  In production builds, available is
  // returned without acquiring the lock.  In debug builds, the global heap lock is acquired in order to
  // enforce a consistency assert.


Can the comment be simplified to:

// Return bytes `available` in the given `partition` // while holding the `rebuild_lock`.

Don't say anything about the heap lock in the API comment. Rather, in the part that is ifdef ASSERT where you take the heap lock (line ~244), say:

// Acquire the heap lock to get a consistent // snapshot to check assert.

As I write this, I realize that in the most general case where two threads may call these API's independently in a fastdebug build, you could theoretically get into a deadlock because they attempted to acquire locks in different orders (this possibility exists -- statically -- only in the fastdebug builds).

The general MuteLocker machinery has ranked mutexes to avoid such situations through static ranking and checks while acquiring locks (in debug builds as a way of potentially catching such situations and flagging them).

With such ranking though this code would assert because the locks are acquired in different order between here and elsewhere.

In product builds you are fine because the rebuild lock acts as a "leaf lock" (in hotspot parlance). But there seems to be a definite possibility of deadlock in debug builds if/when the rebuild is attempted by one thread while another checks available and attempts to acquire the heap lock to check the assertion. You could solve it by acquiring the heap lock before calling the work method where the assertion check is done.

However, I'd be much more comfortable if we used some form of lock rank framework, unless it was utterly impossible to do so for some reason. (Here it was easy to spot the lock order inversion because it was in the code. Of course, if a debug build deadlocked you would also figure out the same, but having lock ordering gives you a quick and easy way to verify if there's potential for trouble.)

Not sure of the history of ShenandoahLock or why the parallel infra to MutexLocker was introduced (perhaps for allowing some performance/tunability), but might be worthwhile to see if we want to build lock rank checks in for robustness/maintainability.

I'm coming back to this PR after working on others. Thanks for your comments.

This is a good catch. I know better than to do that! Sorry.

My intention was to rank-order the locks. Whenever multiple locks are held, it should be in this order:
first acquire the global heap lock
In nested context, acquire the rebuild_lock

Any thread that only acquires the global heap lock or only acquires the rebuild_lock will not deadlock.

Multiple threads that acquire both locks will not deadlock because they acquire in the same order.

The code you identified was definitely a problem because we were acquiring the two lock in the wrong order. I'm going to remove that assert and the lock associated with it.

I've updated this comment to clarify the refined intent.

openjdk · 2025-11-12T00:55:29Z

@kdnilsen this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout synchronize-available-with-rebuild
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

…with-rebuild

ysramakrishna · 2025-11-15T00:19:11Z

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.hpp

-  inline size_t available() const { return _partitions.available_in_not_locked(ShenandoahFreeSetPartitionId::Mutator); }
+  inline size_t available() {
+    shenandoah_assert_not_heaplocked();
+    ShenandoahRebuildLocker locker(rebuild_lock());


May be motivate in a brief comment why we need the rebuild lock in this API, but not around the other APIs such as capacity() and used()?

ysramakrishna

This looks good to me. Curious if any performance delta was noted in fresh measurements following this final shape of fix.

Add support for freeset rebuild lock

56291b4

kdnilsen marked this pull request as draft October 2, 2025 17:58

kdnilsen changed the title ~~8369048: GenShen: Defer ShenFreeSet::available() during rebuildAdd support for freeset rebuild lock~~ 8369048: GenShen: Defer ShenFreeSet::available() during rebuild Oct 2, 2025

openjdk bot added hotspot-gc [email protected] shenandoah [email protected] labels Oct 2, 2025

kdnilsen added 3 commits October 2, 2025 18:06

fix whitespace

0944fa8

Treat traditional Shenandoah same as Generational Shenandoah

623b0f0

reduce gratuitous diffs from tip

99d0175

kdnilsen added 3 commits October 6, 2025 21:33

revert introduction of RebuildLock

bec73da

available() returns previous value if called during freeset rebuild

1a5e483

remove obsolete assertion

717e7da

kdnilsen added 3 commits October 7, 2025 15:14

Revert "remove obsolete assertion"

78739b5

This reverts commit 717e7da.

Revert "available() returns previous value if called during freeset r…

d9a8a51

…ebuild" This reverts commit 1a5e483.

Revert "revert introduction of RebuildLock"

a689839

This reverts commit bec73da.

kdnilsen marked this pull request as ready for review October 7, 2025 15:19

openjdk bot added the rfr Pull request is ready for review label Oct 7, 2025

earthling-amzn suggested changes Oct 17, 2025

View reviewed changes

ysramakrishna reviewed Oct 31, 2025

View reviewed changes

kdnilsen added 2 commits November 11, 2025 22:06

Remove ShenandoahFreeSet::FreeSetUnderConstruction sentinel value

9fa1e57

Tighten up context for holding rebuild_lock

36a71c0

kdnilsen added 4 commits November 11, 2025 22:17

Rename rebuild_lock()

3ff8f80

Hide rebuild_lock inside prepare_to_rebuild and finish_rebuild

9eb270f

Add documentation for _rebuild_lock

26d180a

update comment

091e23b

openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Nov 12, 2025

Merge remote-tracking branch 'jdk/master' into synchronize-available-…

8462a29

…with-rebuild

openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Nov 13, 2025

ysramakrishna reviewed Nov 15, 2025

View reviewed changes

ysramakrishna approved these changes Nov 15, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Nov 15, 2025

		ShenandoahFreeSet(ShenandoahHeap* heap, size_t max_regions);


		ShenandoahRebuildLock* lock() {

8369048: GenShen: Defer ShenFreeSet::available() during rebuild #27612

Are you sure you want to change the base?

8369048: GenShen: Defer ShenFreeSet::available() during rebuild #27612

Uh oh!

Conversation

kdnilsen commented Oct 2, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Uh oh!

kdnilsen commented Oct 2, 2025

Uh oh!

bridgekeeper bot commented Oct 2, 2025

Uh oh!

openjdk bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kdnilsen commented Oct 6, 2025

Uh oh!

kdnilsen commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kdnilsen commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kdnilsen commented Oct 7, 2025

Uh oh!

kdnilsen commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlbridge bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

earthling-amzn left a comment

Choose a reason for hiding this comment

Uh oh!

ysramakrishna left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openjdk bot commented Nov 12, 2025

Uh oh!

ysramakrishna Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ysramakrishna left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

kdnilsen commented Oct 2, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Oct 2, 2025 •

edited

Loading

openjdk bot commented Oct 2, 2025 •

edited

Loading

kdnilsen commented Oct 7, 2025 •

edited

Loading

kdnilsen commented Oct 7, 2025 •

edited

Loading

kdnilsen commented Oct 7, 2025 •

edited

Loading

mlbridge bot commented Oct 7, 2025 •

edited

Loading

ysramakrishna Nov 15, 2025 •

edited

Loading