Fix corruption due to lock sharding issues by centralizing locking#5838
Open
ngoldbaum wants to merge 2 commits into
Open
Fix corruption due to lock sharding issues by centralizing locking#5838ngoldbaum wants to merge 2 commits into
ngoldbaum wants to merge 2 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #5836.
I used both Claude and Codex to work on this. I typed this PR description by hand.
Summary
Fixes the specific lock sharding issue described in #5836 as well as other related issues caused by the current locking strategy allowing shared overlapping calls of different kernels. The locks are function-local and so appear as different mutex objects in each separate compilation unit.
I fixed that by moving the locking to its own compilation unit with a new internal helper functions for serializing calls. This also centralizes the platform-specific locking logic into the new file.
This has the net effect of reducing throughput for workloads that make overlapping external calls into different threaded kernels, because those calls are now serialized consistently instead of using per-kernel lock state. I think the existing behavior is a bug rather than an intentional design choice but I wanted to raise that front and center.
It also fixes several other issues, more or less as a consequence of making the above change systematically:
parallel_section_left. Now there's only one version of this variable so there are far fewer OpenMP threads when one mixes kernels.Testing
To verify the correctness of the fix, I added a new multithreaded stress test based on the reproducer in #5836. I also enabled multithreaded stress testing for msys2 on a Windows host to test the Windows threading model.
Additionally there is a thread sanitizer test run. I manually verified that the new tests trigger validation errors and/or TSan race reports. I also verified the new mixed DGEMM stress test (no TSan) fails with incorrect results on an unpatched develop build and passes with this change.
Right now there's only TSan testing for
OPENBLAS_NUM_THREADS=2and the pthreads backend. TSan detected data races withOPENBLAS_NUM_THREADS=4and also with the OpenMP backend. I am intentionally leaving the TSan CI as-is in this PR. We'll need to look at other issues before setting up more thorough TSan CI.