-
Notifications
You must be signed in to change notification settings - Fork 110
[BOLT] Add sanity check for frozen llvm-bolt #487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[BOLT] Add sanity check for frozen llvm-bolt #487
Conversation
Some patches can cause the llvm-bolt binary to hang, which stalls or fails the test pipeline. Add a simple sanity check that runs: ``` llvm-bolt --version ``` with a 30-second timeout. If the command does not complete in time, flunk the build. Set maxTime for nfc-check-validation and reduce the number of lit workers for in-tree tests.
It looks like Buildbot stalls on a particular patch that freezes I limited the lit workers for With this patch, I detect hangs early and bail out cleanly. |
TL;DR: Not an incremental build failure, but leftover llvm-project wrapper scripts from the previous NFC-Mode logic that I should have communicated and cleaned up. A) Why this happens:NFC-Mode runs tests only when the How NFC-Mode is set up:
On disk:
Reproducing the problem(1) Buildbot runs llvm/llvm-project#146148 which modifies the python wrapper script (2) After PR (1), buildbot runs any patch that does not modify llvm-bolt, like: llvm/llvm-project#145812 Under the updated ninja flow, it results on disk:
This creates an infinite recursion. May happen when two consecutive merged PRs do no code changes (i.e., a wrapper llvm-bolt passes unscathed from first to second PR). B) Proposed solution:Since BOLTBuilder no longer requires a wrapper for This change would consistently avoid the above scenario. I tested it on an internal buildbot with a manual local edit of #146209 and running C) Is this PR needed?If we remove the wrapper, this patch becomes optional. It can serve as a fail-safe if |
I have no knowledge of the previous setup, but I can say that if someone committed a patch that made any use of clang loop forever, every existing bot would timeout the thousands of clang tests. Might hit the overall test step timeout, but might not if it printed something once in a while. My point with that is - this NFC idea sounds nice but it's adding a lot of complexity. Even if it's not as complex as I think, the fact that I think it is, is at least cognitive complexity. (but I have literally just read this PR and your explanation, so definitely could be a skill issue on my part) So not having it would be no worse than the other bots. You can decide if that's a good or a bad thing. Is Bolt particularly prone to these timeouts? It's always theoretically possible with clang but never happened enough to try to prevent it. If bolt experts are cool with this or the other patch, great. If undecided maybe you can explain more what you're doing and I'll try to give an outsider's perspective on the implementation. |
Myself and @paschalis-mpeis discussed this and I understand better the sequence of events here.
This is not a check we do on the other bots for clang etc., but that's just because we've never had anyone (intentionally or otherwise) cause a widespread problem. If they do in future, maybe we will go down this road too. And if no one ever makes bolt stall, then great! We just spend a few cycles in case. |
Hey David, Thanks for your review. Please let me know of any further comments. |
Looks good to me, I leave the deciding vote to @aaupov as the domain expert. |
This would produce a list of the ran lit tests on the output
Some patches can cause the llvm-bolt binary to hang, which stalls or fails the test pipeline.
Add a simple sanity check that runs:
with a 30-second timeout. If the command does not complete in time, flunk the build.
Also, sets maxTime for nfc-check-validation and reduce the number of lit workers for in-tree tests.