Skip to content

Conversation

@niaow
Copy link
Member

@niaow niaow commented Nov 30, 2025

Instead of looping over each block, we can use bit hacks to operate on an entire state byte. This deinterleaves the state bits in order to enable these tricks.

Performance in the problematic go/format benchmark:

                    │ conservative.txt │    conservative-branchless.txt     │              boehm.txt              │
                    │      sec/op      │   sec/op     vs base               │   sec/op     vs base                │
Format/array1-10000        30.46m ± 2%   28.93m ± 2%  -5.01% (p=0.004 n=20)   22.13m ± 5%  -27.35% (p=0.000 n=20)

                    │ conservative.txt │     conservative-branchless.txt     │              boehm.txt               │
                    │       B/s        │     B/s       vs base               │     B/s       vs base                │
Format/array1-10000       2.027Mi ± 2%   2.136Mi ± 3%  +5.41% (p=0.004 n=20)   2.789Mi ± 5%  +37.65% (p=0.000 n=20)

                    │ conservative.txt │  conservative-branchless.txt   │              boehm.txt               │
                    │       B/op       │     B/op      vs base          │     B/op      vs base                │
Format/array1-10000       4.663Mi ± 0%   4.663Mi ± 0%  ~ (p=1.000 n=20)   6.979Mi ± 0%  +49.68% (p=0.000 n=20)

                    │ conservative.txt │  conservative-branchless.txt  │             boehm.txt              │
                    │    allocs/op     │  allocs/op   vs base          │ allocs/op  vs base                 │
Format/array1-10000        204.3k ± 0%   204.3k ± 0%  ~ (p=1.000 n=20)   0.0k ± 0%  -100.00% (p=0.000 n=20)

@niaow
Copy link
Member Author

niaow commented Nov 30, 2025

We could probably squeeze more performance out of this by making the state masks bigger, but we would need popcnt on the target machine for that to really work.

@deadprogram
Copy link
Member

Please resolve merge conflicts now @niaow since #5102 was merged.

@niaow
Copy link
Member Author

niaow commented Dec 5, 2025

The AVR tests are broken due to an unrelated issue I will need to fix in a separate PR first #5111

@deadprogram
Copy link
Member

@niaow can you please rebase this branch now?

@deadprogram deadprogram requested a review from dgryski December 6, 2025 08:13
@dgryski
Copy link
Member

dgryski commented Dec 9, 2025

I'll try to review this today.

@deadprogram
Copy link
Member

This PR is once again ready for rebase @niaow

@niaow
Copy link
Member Author

niaow commented Dec 10, 2025

I am going to move the counters in here now that #5105 is merged. I was going to do that in a separate PR but I think it is simpler to just do it here.

@niaow
Copy link
Member Author

niaow commented Dec 10, 2025

Finished moving the counters.
Updated perf numbers:

                    │ conservative.txt │        conservative-new.txt         │              boehm.txt              │
                    │      sec/op      │   sec/op     vs base                │   sec/op     vs base                │
Format/array1-10000        23.79m ± 1%   21.36m ± 3%  -10.21% (p=0.000 n=20)   20.80m ± 2%  -12.57% (p=0.000 n=20)

                    │ conservative.txt │         conservative-new.txt         │              boehm.txt               │
                    │       B/s        │     B/s       vs base                │     B/s       vs base                │
Format/array1-10000       2.594Mi ± 1%   2.890Mi ± 4%  +11.40% (p=0.000 n=20)   2.971Mi ± 2%  +14.52% (p=0.000 n=20)

@deadprogram
Copy link
Member

Finished moving the counters. Updated perf numbers:

                    │ conservative.txt │        conservative-new.txt         │              boehm.txt              │
                    │      sec/op      │   sec/op     vs base                │   sec/op     vs base                │
Format/array1-10000        23.79m ± 1%   21.36m ± 3%  -10.21% (p=0.000 n=20)   20.80m ± 2%  -12.57% (p=0.000 n=20)

                    │ conservative.txt │         conservative-new.txt         │              boehm.txt               │
                    │       B/s        │     B/s       vs base                │     B/s       vs base                │
Format/array1-10000       2.594Mi ± 1%   2.890Mi ± 4%  +11.40% (p=0.000 n=20)   2.971Mi ± 2%  +14.52% (p=0.000 n=20)

This is looking good!

@dgryski waiting on your feedback...

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the garbage collector's sweep phase by replacing the branching loop-based implementation with a branchless bit manipulation algorithm. The key innovation is deinterleaving the 2-bit block state representation (previously stored as sequential 2-bit values) into separate low and high nibbles within each byte. This enables efficient parallel processing of 4 blocks at once using bitwise operations, eliminating branches and improving performance by ~5% in the problematic go/format benchmark.

Key Changes:

  • Deinterleaved block state bit layout to enable branchless algorithms (low nibble for one bit per block, high nibble for another)
  • Rewrote sweep() function using bit manipulation to process entire state bytes at once without branches
  • Refactored ReadMemStats() to calculate statistics on-demand by counting live blocks instead of maintaining global counters
  • Removed helper functions (markFree, unmark) and global counters (gcTotalBlocks, gcFrees, gcFreedBlocks) that are no longer needed

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
src/runtime/gc_blocks.go Core changes: deinterleaved block state bit layout, branchless sweep implementation, on-demand stats calculation in ReadMemStats, removed obsolete counters and helper functions
builder/sizes_test.go Updated expected binary sizes for three microcontroller targets reflecting code size reductions from the optimization (96-144 bytes saved per target)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@dgryski dgryski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildFreeRanges LGTM. I just would like some more comments/tests for the bit manipulation pieces in sweep().

// Seperate blocks by type.
high := stateByte >> blocksPerStateByte
markedHeads := stateByte & high
unmarkedHeads := (stateByte & blockStateEach) &^ high
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love comments explaining this with an example. (Indeed, needed to double-check which of high or low had the head mark...)

Also, I feel like the & blockStateEach should be semantically after the stateByte &^ high, but the parens put it first.

// Adding 1 to a run of bits will clear the run.
// Add 1 to the next bit in the tails mask after a freed head to clear the corresponding tails.
// Carry the overflow between state bytes.
tailClear := tails + (unmarkedHeads << 1) + carry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example here please. Maybe put some of these bit manips into functions which could have their own unit tests? (Yeah, I'm still having trouble convincing myself this is correct in all cases.)

Instead of looping over each block, we can use bit hacks to operate on an entire state byte.
I deinterleaved the state bits in order to enable these tricks.

Sweep used to count free/freed allocations/blocks.
I managed to move/remove all of these counters:
- The free space is now calculated in buildFreeRanges by adding the range lengths.
- ReadMemStats counts freed objects by subtracting live objects from allocated objects.
- gcFreedBlocks was never necessary because MemStats.HeapAlloc is the same as MemStats.HeapInUse.
@niaow
Copy link
Member Author

niaow commented Dec 10, 2025

I tried to make this a bit clearer, see if this looks okay.

Copy link
Member

@dgryski dgryski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the expanded comments!

@niaow niaow merged commit d01d0bb into tinygo-org:dev Dec 10, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants