runtime (gc_blocks.go): make sweep branchless #5104

niaow · 2025-11-30T20:08:05Z

Instead of looping over each block, we can use bit hacks to operate on an entire state byte. This deinterleaves the state bits in order to enable these tricks.

Performance in the problematic go/format benchmark:

                    │ conservative.txt │    conservative-branchless.txt     │              boehm.txt              │
                    │      sec/op      │   sec/op     vs base               │   sec/op     vs base                │
Format/array1-10000        30.46m ± 2%   28.93m ± 2%  -5.01% (p=0.004 n=20)   22.13m ± 5%  -27.35% (p=0.000 n=20)

                    │ conservative.txt │     conservative-branchless.txt     │              boehm.txt               │
                    │       B/s        │     B/s       vs base               │     B/s       vs base                │
Format/array1-10000       2.027Mi ± 2%   2.136Mi ± 3%  +5.41% (p=0.004 n=20)   2.789Mi ± 5%  +37.65% (p=0.000 n=20)

                    │ conservative.txt │  conservative-branchless.txt   │              boehm.txt               │
                    │       B/op       │     B/op      vs base          │     B/op      vs base                │
Format/array1-10000       4.663Mi ± 0%   4.663Mi ± 0%  ~ (p=1.000 n=20)   6.979Mi ± 0%  +49.68% (p=0.000 n=20)

                    │ conservative.txt │  conservative-branchless.txt  │             boehm.txt              │
                    │    allocs/op     │  allocs/op   vs base          │ allocs/op  vs base                 │
Format/array1-10000        204.3k ± 0%   204.3k ± 0%  ~ (p=1.000 n=20)   0.0k ± 0%  -100.00% (p=0.000 n=20)

niaow · 2025-11-30T20:09:49Z

We could probably squeeze more performance out of this by making the state masks bigger, but we would need popcnt on the target machine for that to really work.

deadprogram · 2025-12-05T10:23:08Z

Please resolve merge conflicts now @niaow since #5102 was merged.

niaow · 2025-12-05T17:30:01Z

The AVR tests are broken due to an unrelated issue I will need to fix in a separate PR first #5111

deadprogram · 2025-12-05T20:05:42Z

@niaow can you please rebase this branch now?

dgryski · 2025-12-09T22:31:39Z

I'll try to review this today.

deadprogram · 2025-12-10T07:49:23Z

This PR is once again ready for rebase @niaow

niaow · 2025-12-10T17:01:00Z

I am going to move the counters in here now that #5105 is merged. I was going to do that in a separate PR but I think it is simpler to just do it here.

niaow · 2025-12-10T18:48:02Z

Finished moving the counters.
Updated perf numbers:

                    │ conservative.txt │        conservative-new.txt         │              boehm.txt              │
                    │      sec/op      │   sec/op     vs base                │   sec/op     vs base                │
Format/array1-10000        23.79m ± 1%   21.36m ± 3%  -10.21% (p=0.000 n=20)   20.80m ± 2%  -12.57% (p=0.000 n=20)

                    │ conservative.txt │         conservative-new.txt         │              boehm.txt               │
                    │       B/s        │     B/s       vs base                │     B/s       vs base                │
Format/array1-10000       2.594Mi ± 1%   2.890Mi ± 4%  +11.40% (p=0.000 n=20)   2.971Mi ± 2%  +14.52% (p=0.000 n=20)

deadprogram · 2025-12-10T19:21:53Z

Finished moving the counters. Updated perf numbers:

                    │ conservative.txt │        conservative-new.txt         │              boehm.txt              │
                    │      sec/op      │   sec/op     vs base                │   sec/op     vs base                │
Format/array1-10000        23.79m ± 1%   21.36m ± 3%  -10.21% (p=0.000 n=20)   20.80m ± 2%  -12.57% (p=0.000 n=20)

                    │ conservative.txt │         conservative-new.txt         │              boehm.txt               │
                    │       B/s        │     B/s       vs base                │     B/s       vs base                │
Format/array1-10000       2.594Mi ± 1%   2.890Mi ± 4%  +11.40% (p=0.000 n=20)   2.971Mi ± 2%  +14.52% (p=0.000 n=20)

This is looking good!

@dgryski waiting on your feedback...

Copilot

Pull request overview

This PR optimizes the garbage collector's sweep phase by replacing the branching loop-based implementation with a branchless bit manipulation algorithm. The key innovation is deinterleaving the 2-bit block state representation (previously stored as sequential 2-bit values) into separate low and high nibbles within each byte. This enables efficient parallel processing of 4 blocks at once using bitwise operations, eliminating branches and improving performance by ~5% in the problematic go/format benchmark.

Key Changes:

Deinterleaved block state bit layout to enable branchless algorithms (low nibble for one bit per block, high nibble for another)
Rewrote sweep() function using bit manipulation to process entire state bytes at once without branches
Refactored ReadMemStats() to calculate statistics on-demand by counting live blocks instead of maintaining global counters
Removed helper functions (markFree, unmark) and global counters (gcTotalBlocks, gcFrees, gcFreedBlocks) that are no longer needed

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
src/runtime/gc_blocks.go	Core changes: deinterleaved block state bit layout, branchless sweep implementation, on-demand stats calculation in ReadMemStats, removed obsolete counters and helper functions
builder/sizes_test.go	Updated expected binary sizes for three microcontroller targets reflecting code size reductions from the optimization (96-144 bytes saved per target)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/runtime/gc_blocks.go

dgryski

buildFreeRanges LGTM. I just would like some more comments/tests for the bit manipulation pieces in sweep().

src/runtime/gc_blocks.go

dgryski · 2025-12-10T19:52:07Z

src/runtime/gc_blocks.go

+		// Seperate blocks by type.
+		high := stateByte >> blocksPerStateByte
+		markedHeads := stateByte & high
+		unmarkedHeads := (stateByte & blockStateEach) &^ high


I would love comments explaining this with an example. (Indeed, needed to double-check which of high or low had the head mark...)

Also, I feel like the & blockStateEach should be semantically after the stateByte &^ high, but the parens put it first.

dgryski · 2025-12-10T20:01:30Z

src/runtime/gc_blocks.go

+		// Adding 1 to a run of bits will clear the run.
+		// Add 1 to the next bit in the tails mask after a freed head to clear the corresponding tails.
+		// Carry the overflow between state bytes.
+		tailClear := tails + (unmarkedHeads << 1) + carry


Example here please. Maybe put some of these bit manips into functions which could have their own unit tests? (Yeah, I'm still having trouble convincing myself this is correct in all cases.)

Instead of looping over each block, we can use bit hacks to operate on an entire state byte. I deinterleaved the state bits in order to enable these tricks. Sweep used to count free/freed allocations/blocks. I managed to move/remove all of these counters: - The free space is now calculated in buildFreeRanges by adding the range lengths. - ReadMemStats counts freed objects by subtracting live objects from allocated objects. - gcFreedBlocks was never necessary because MemStats.HeapAlloc is the same as MemStats.HeapInUse.

niaow · 2025-12-10T22:51:37Z

I tried to make this a bit clearer, see if this looks okay.

dgryski

LGTM. Thanks for the expanded comments!

niaow mentioned this pull request Dec 2, 2025

runtime (gc_blocks.go): use best-fit allocation #5105

Merged

niaow force-pushed the branchless-sweep branch from 04360ab to 3f822df Compare December 5, 2025 17:29

niaow force-pushed the branchless-sweep branch from 3f822df to 0deb0d2 Compare December 5, 2025 20:27

deadprogram requested a review from dgryski December 6, 2025 08:13

niaow force-pushed the branchless-sweep branch from 0deb0d2 to 9ce056c Compare December 10, 2025 08:06

niaow force-pushed the branchless-sweep branch from 9ce056c to 3ddb44f Compare December 10, 2025 18:28

dkegel-fastly requested a review from Copilot December 10, 2025 19:23

Copilot started reviewing on behalf of dkegel-fastly December 10, 2025 19:23 View session

Copilot AI reviewed Dec 10, 2025

View reviewed changes

niaow force-pushed the branchless-sweep branch from 3ddb44f to 849f309 Compare December 10, 2025 19:53

dgryski requested changes Dec 10, 2025

View reviewed changes

niaow force-pushed the branchless-sweep branch from 849f309 to bfb5b2b Compare December 10, 2025 22:51

dgryski approved these changes Dec 10, 2025

View reviewed changes

niaow merged commit d01d0bb into tinygo-org:dev Dec 10, 2025
19 checks passed

runtime (gc_blocks.go): make sweep branchless #5104

runtime (gc_blocks.go): make sweep branchless #5104

Uh oh!

Conversation

niaow commented Nov 30, 2025

Uh oh!

niaow commented Nov 30, 2025

Uh oh!

deadprogram commented Dec 5, 2025

Uh oh!

niaow commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deadprogram commented Dec 5, 2025

Uh oh!

dgryski commented Dec 9, 2025

Uh oh!

deadprogram commented Dec 10, 2025

Uh oh!

niaow commented Dec 10, 2025

Uh oh!

niaow commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deadprogram commented Dec 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dgryski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dgryski Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

dgryski Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

niaow commented Dec 10, 2025

Uh oh!

dgryski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

niaow commented Dec 5, 2025 •

edited

Loading

niaow commented Dec 10, 2025 •

edited

Loading