Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Optimize compiler times by prevent huge loop unrolling when filling inline arrays. #4046

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

msaelices
Copy link
Contributor

The InlineArray(fill=...) constructor loops for the initialization of all the pointee items in the array in comptime.

This makes the compiler really slow when the size of the inline array is >2k, and even shows a warning if the size is bigger than 64k:

image

IMO, I don't find useful to have 64k init_pointee_copy lines when we can unroll by batches, keeping the compiler fast and the actual runtime still fast.

@owenhilyard
Copy link
Contributor

Could we make unroll_threshold into a parameter that's defaulted to something a bit more reasonable, like 64? That way people can customize it, since unrolling 1000 iterations is already icache pollution.

@parameter
for i in range(size):
for i in range(unrolled):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to unroll the inner loop, not the outer one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -140,8 +140,21 @@ struct InlineArray[
_inline_array_construction_checks[size]()
__mlir_op.`lit.ownership.mark_initialized`(__get_mvalue_as_litref(self))

alias unroll_threshold = 1000
alias unrolled = size // unroll_threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

math.align_down makes intent more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msaelices msaelices force-pushed the inlinearray-compiler-slowness branch from 57c599e to edadb98 Compare March 3, 2025 20:28
@msaelices
Copy link
Contributor Author

Could we make unroll_threshold into a parameter that's defaulted to something a bit more reasonable, like 64? That way people can customize it, since unrolling 1000 iterations is already icache pollution.

Done: msaelices@edadb98

@msaelices msaelices requested a review from owenhilyard March 3, 2025 20:29
@msaelices
Copy link
Contributor Author

@owenhilyard Thanks for the review. Could you please take another look?

@@ -131,7 +132,7 @@ struct InlineArray[

@always_inline
@implicit
fn __init__(out self, fill: Self.ElementType):
fn __init__[batch_size: Int = 100](out self, fill: Self.ElementType):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a power of 2 in order to keep the chunks aligned well. 32 or 64 is good enough to not bloat the icache but still be fast, especially since processing the loop variable will happen in parallel with the vector ops on any superscalar processor, which is most of them at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's an inline array of bytes, then 100 will need to do some really odd things with instructions, especially on AVX512 since it will move 64, then 32, then 4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msaelices msaelices requested a review from owenhilyard March 3, 2025 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants