[BOLT] Add sanity check for frozen llvm-bolt #487

paschalis-mpeis · 2025-06-28T09:18:34Z

Some patches can cause the llvm-bolt binary to hang, which stalls or fails the test pipeline.

Add a simple sanity check that runs:

llvm-bolt --version

with a 30-second timeout. If the command does not complete in time, flunk the build.

Also, sets maxTime for nfc-check-validation and reduce the number of lit workers for in-tree tests.

Some patches can cause the llvm-bolt binary to hang, which stalls or fails the test pipeline. Add a simple sanity check that runs: ``` llvm-bolt --version ``` with a 30-second timeout. If the command does not complete in time, flunk the build. Set maxTime for nfc-check-validation and reduce the number of lit workers for in-tree tests.

paschalis-mpeis · 2025-06-28T09:30:36Z

It looks like Buildbot stalls on a particular patch that freezes llvm-bolt, causing the worker process to panic.
This appears to be some incremental build inconsistency? From your experience, how often are such cases observed?

I limited the lit workers for nfc-check-bolt to two, but both this step, as well the original nfc-check-large-bolt, still struggle. At best, they take TIMEOUT/WORKERS to complete.

With this patch, I detect hangs early and bail out cleanly.
Any further advice would be appreciated.

paschalis-mpeis · 2025-06-28T18:37:40Z

TL;DR: Not an incremental build failure, but leftover llvm-project wrapper scripts from the previous NFC-Mode logic that I should have communicated and cleaned up.

A) Why this happens:

NFC-Mode runs tests only when the llvm-bolt binary changes between revisions, or when relevant files (under bolt/) are modified.

How NFC-Mode is set up:

build-bolt step: compiles the llvm-bolt binary.
nfc-check-setup step:
- renames the current llvm-bolt binary to llvm-bolt.new.
- Builds the previous SHA into llvm-bolt.old.
- Creates a helper python wrapper llvm-bolt that redirects to llvm-bolt.new

On disk:

llvm-bolt → llvm-bolt.new (python wrapper)
llvm-bolt.new (current SHA binary)
llvm-bolt.old (previous SHA binary)

nfc-check-bolt step: builds test utilities and runs tests, with check-bolt, which relinks llvm-bolt and overwrites the wrapper.

Reproducing the problem

(1) Buildbot runs llvm/llvm-project#146148 which modifies the python wrapper script

(2) After PR (1), buildbot runs any patch that does not modify llvm-bolt, like: llvm/llvm-project#145812

Under the updated ninja flow, it results on disk:

llvm-bolt → llvm-bolt.new (python wrapper)
llvm-bolt.new → itself (python wrapper recursion) 🔁
llvm-bolt.old (previous SHA binary)

This creates an infinite recursion. May happen when two consecutive merged PRs do no code changes (i.e., a wrapper llvm-bolt passes unscathed from first to second PR).

B) Proposed solution:

Since BOLTBuilder no longer requires a wrapper for llvm-bolt, we can retire that logic with #146209. @aaupov, do you agree?

This change would consistently avoid the above scenario. I tested it on an internal buildbot with a manual local edit of #146209 and running git update-index --assume-unchanged bolt/utils/nfc-check-setup.py. Repeated on the staging buildbot; works OK.

C) Is this PR needed?

If we remove the wrapper, this patch becomes optional. It can serve as a fail-safe if llvm-bolt loops (for some reason). Tuning niceness for vCPU is not a bad idea, and the extra step won't hurt. I can still increase the lit workers though.

DavidSpickett · 2025-06-30T10:47:23Z

I have no knowledge of the previous setup, but I can say that if someone committed a patch that made any use of clang loop forever, every existing bot would timeout the thousands of clang tests. Might hit the overall test step timeout, but might not if it printed something once in a while.

My point with that is - this NFC idea sounds nice but it's adding a lot of complexity. Even if it's not as complex as I think, the fact that I think it is, is at least cognitive complexity.

(but I have literally just read this PR and your explanation, so definitely could be a skill issue on my part)

So not having it would be no worse than the other bots. You can decide if that's a good or a bad thing.

Is Bolt particularly prone to these timeouts? It's always theoretically possible with clang but never happened enough to try to prevent it.

If bolt experts are cool with this or the other patch, great. If undecided maybe you can explain more what you're doing and I'll try to give an outsider's perspective on the implementation.

DavidSpickett · 2025-06-30T14:12:02Z

Myself and @paschalis-mpeis discussed this and I understand better the sequence of events here.

Bolt has an NFC mode that allows the builder to skip running expensive external tests.
- This is very cool and worth keeping IMO.
Even if the bolt binary did not change, if the change set has changes to Bolt tests, they should be re-run at least.
ninja <the target for doing that> may try to re-link or rebuild the bolt binary, except that it's not a binary, it's a script to implement this NFC mode, so that script would get replaced.
This situation occurred recently and motivated this PR to mitigate the impact when that does happen.
In future, @paschalis-mpeis wants to improve the NFC mode to avoid this situation completely.
If the NFC mode no longer has this flaw, this PR is not needed to prevent this specific situation.
However it would mitigate someone causing bolt to stall by actually changing bolt.

This is not a check we do on the other bots for clang etc., but that's just because we've never had anyone (intentionally or otherwise) cause a widespread problem. If they do in future, maybe we will go down this road too.

And if no one ever makes bolt stall, then great! We just spend a few cycles in case.

zorg/buildbot/builders/BOLTBuilder.py

paschalis-mpeis · 2025-06-30T15:39:36Z

Hey David,

Thanks for your review. Please let me know of any further comments.

DavidSpickett · 2025-06-30T15:52:28Z

Looks good to me, I leave the deciding vote to @aaupov as the domain expert.

This would produce a list of the ran lit tests on the output

paschalis-mpeis marked this pull request as ready for review June 28, 2025 09:19

paschalis-mpeis requested review from aaupov, gkistanova and DavidSpickett June 28, 2025 09:19

Use nice when running tests

352259d

DavidSpickett reviewed Jun 30, 2025

View reviewed changes

zorg/buildbot/builders/BOLTBuilder.py Show resolved Hide resolved

zorg/buildbot/builders/BOLTBuilder.py Show resolved Hide resolved

zorg/buildbot/builders/BOLTBuilder.py Outdated Show resolved Hide resolved

zorg/buildbot/builders/BOLTBuilder.py Outdated Show resolved Hide resolved

Addressing reviewers

9f1fea2

paschalis-mpeis requested a review from DavidSpickett June 30, 2025 15:39

paschalis-mpeis added 2 commits July 2, 2025 17:21

Add more comments in code.

8e2ded2

Drop -s from llvm-lit

114db12

This would produce a list of the ran lit tests on the output

aaupov approved these changes Jul 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BOLT] Add sanity check for frozen llvm-bolt #487

[BOLT] Add sanity check for frozen llvm-bolt #487

paschalis-mpeis commented Jun 28, 2025 •

edited

Loading

Uh oh!

paschalis-mpeis commented Jun 28, 2025 •

edited

Loading

Uh oh!

paschalis-mpeis commented Jun 28, 2025 •

edited

Loading

Uh oh!

DavidSpickett commented Jun 30, 2025

Uh oh!

DavidSpickett commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paschalis-mpeis commented Jun 30, 2025 •

edited

Loading

Uh oh!

DavidSpickett commented Jun 30, 2025

Uh oh!

Uh oh!

[BOLT] Add sanity check for frozen llvm-bolt #487

Are you sure you want to change the base?

[BOLT] Add sanity check for frozen llvm-bolt #487

Conversation

paschalis-mpeis commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paschalis-mpeis commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paschalis-mpeis commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

A) Why this happens:

How NFC-Mode is set up:

Reproducing the problem

B) Proposed solution:

C) Is this PR needed?

Uh oh!

DavidSpickett commented Jun 30, 2025

Uh oh!

DavidSpickett commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paschalis-mpeis commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DavidSpickett commented Jun 30, 2025

Uh oh!

Uh oh!

paschalis-mpeis commented Jun 28, 2025 •

edited

Loading

paschalis-mpeis commented Jun 28, 2025 •

edited

Loading

paschalis-mpeis commented Jun 28, 2025 •

edited

Loading

paschalis-mpeis commented Jun 30, 2025 •

edited

Loading