-
Notifications
You must be signed in to change notification settings - Fork 999
runtime (gc_blocks.go): make sweep branchless #5104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
We could probably squeeze more performance out of this by making the state masks bigger, but we would need popcnt on the target machine for that to really work. |
04360ab to
3f822df
Compare
|
The AVR tests are broken due to an unrelated issue I will need to fix in a separate PR first #5111 |
|
@niaow can you please rebase this branch now? |
3f822df to
0deb0d2
Compare
|
I'll try to review this today. |
|
This PR is once again ready for rebase @niaow |
0deb0d2 to
9ce056c
Compare
|
I am going to move the counters in here now that #5105 is merged. I was going to do that in a separate PR but I think it is simpler to just do it here. |
9ce056c to
3ddb44f
Compare
|
Finished moving the counters. |
This is looking good! @dgryski waiting on your feedback... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the garbage collector's sweep phase by replacing the branching loop-based implementation with a branchless bit manipulation algorithm. The key innovation is deinterleaving the 2-bit block state representation (previously stored as sequential 2-bit values) into separate low and high nibbles within each byte. This enables efficient parallel processing of 4 blocks at once using bitwise operations, eliminating branches and improving performance by ~5% in the problematic go/format benchmark.
Key Changes:
- Deinterleaved block state bit layout to enable branchless algorithms (low nibble for one bit per block, high nibble for another)
- Rewrote
sweep()function using bit manipulation to process entire state bytes at once without branches - Refactored
ReadMemStats()to calculate statistics on-demand by counting live blocks instead of maintaining global counters - Removed helper functions (
markFree,unmark) and global counters (gcTotalBlocks,gcFrees,gcFreedBlocks) that are no longer needed
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| src/runtime/gc_blocks.go | Core changes: deinterleaved block state bit layout, branchless sweep implementation, on-demand stats calculation in ReadMemStats, removed obsolete counters and helper functions |
| builder/sizes_test.go | Updated expected binary sizes for three microcontroller targets reflecting code size reductions from the optimization (96-144 bytes saved per target) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
3ddb44f to
849f309
Compare
dgryski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buildFreeRanges LGTM. I just would like some more comments/tests for the bit manipulation pieces in sweep().
src/runtime/gc_blocks.go
Outdated
| // Seperate blocks by type. | ||
| high := stateByte >> blocksPerStateByte | ||
| markedHeads := stateByte & high | ||
| unmarkedHeads := (stateByte & blockStateEach) &^ high |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love comments explaining this with an example. (Indeed, needed to double-check which of high or low had the head mark...)
Also, I feel like the & blockStateEach should be semantically after the stateByte &^ high, but the parens put it first.
| // Adding 1 to a run of bits will clear the run. | ||
| // Add 1 to the next bit in the tails mask after a freed head to clear the corresponding tails. | ||
| // Carry the overflow between state bytes. | ||
| tailClear := tails + (unmarkedHeads << 1) + carry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example here please. Maybe put some of these bit manips into functions which could have their own unit tests? (Yeah, I'm still having trouble convincing myself this is correct in all cases.)
Instead of looping over each block, we can use bit hacks to operate on an entire state byte. I deinterleaved the state bits in order to enable these tricks. Sweep used to count free/freed allocations/blocks. I managed to move/remove all of these counters: - The free space is now calculated in buildFreeRanges by adding the range lengths. - ReadMemStats counts freed objects by subtracting live objects from allocated objects. - gcFreedBlocks was never necessary because MemStats.HeapAlloc is the same as MemStats.HeapInUse.
849f309 to
bfb5b2b
Compare
|
I tried to make this a bit clearer, see if this looks okay. |
dgryski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the expanded comments!
Instead of looping over each block, we can use bit hacks to operate on an entire state byte. This deinterleaves the state bits in order to enable these tricks.
Performance in the problematic
go/formatbenchmark: