Skip to content

[BOLT] Add sanity check for frozen llvm-bolt #487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

paschalis-mpeis
Copy link
Member

@paschalis-mpeis paschalis-mpeis commented Jun 28, 2025

Some patches can cause the llvm-bolt binary to hang, which stalls or fails the test pipeline.

Add a simple sanity check that runs:

llvm-bolt --version

with a 30-second timeout. If the command does not complete in time, flunk the build.

Also, sets maxTime for nfc-check-validation and reduce the number of lit workers for in-tree tests.

Some patches can cause the llvm-bolt binary to hang, which stalls or
fails the test pipeline.

Add a simple sanity check that runs:
```
llvm-bolt --version
```

with a 30-second timeout. If the command does not complete in time,
flunk the build.

Set maxTime for nfc-check-validation and reduce the number of lit
workers for in-tree tests.
@paschalis-mpeis paschalis-mpeis marked this pull request as ready for review June 28, 2025 09:19
@paschalis-mpeis
Copy link
Member Author

paschalis-mpeis commented Jun 28, 2025

It looks like Buildbot stalls on a particular patch that freezes llvm-bolt, causing the worker process to panic.
This appears to be some incremental build inconsistency? From your experience, how often are such cases observed?

I limited the lit workers for nfc-check-bolt to two, but both this step, as well the original nfc-check-large-bolt, still struggle. At best, they take TIMEOUT/WORKERS to complete.

With this patch, I detect hangs early and bail out cleanly.
Any further advice would be appreciated.

@paschalis-mpeis
Copy link
Member Author

paschalis-mpeis commented Jun 28, 2025

TL;DR: Not an incremental build failure, but leftover llvm-project wrapper scripts from the previous NFC-Mode logic that I should have communicated and cleaned up.

A) Why this happens:

NFC-Mode runs tests only when the llvm-bolt binary changes between revisions, or when relevant files (under bolt/) are modified.

How NFC-Mode is set up:

  1. build-bolt step: compiles the llvm-bolt binary.

  2. nfc-check-setup step:

    • renames the current llvm-bolt binary to llvm-bolt.new.
    • Builds the previous SHA into llvm-bolt.old.
    • Creates a helper python wrapper llvm-bolt that redirects to llvm-bolt.new

On disk:

  • llvm-boltllvm-bolt.new (python wrapper)
  • llvm-bolt.new (current SHA binary)
  • llvm-bolt.old (previous SHA binary)
  1. nfc-check-bolt step: builds test utilities and runs tests, with check-bolt, which relinks llvm-bolt and overwrites the wrapper.

Reproducing the problem

(1) Buildbot runs llvm/llvm-project#146148 which modifies the python wrapper script

(2) After PR (1), buildbot runs any patch that does not modify llvm-bolt, like: llvm/llvm-project#145812

Under the updated ninja flow, it results on disk:

  • llvm-boltllvm-bolt.new (python wrapper)
  • llvm-bolt.new → itself (python wrapper recursion) 🔁
  • llvm-bolt.old (previous SHA binary)

This creates an infinite recursion. May happen when two consecutive merged PRs do no code changes (i.e., a wrapper llvm-bolt passes unscathed from first to second PR).

B) Proposed solution:

Since BOLTBuilder no longer requires a wrapper for llvm-bolt, we can retire that logic with #146209. @aaupov, do you agree?

This change would consistently avoid the above scenario. I tested it on an internal buildbot with a manual local edit of #146209 and running git update-index --assume-unchanged bolt/utils/nfc-check-setup.py. Repeated on the staging buildbot; works OK.

C) Is this PR needed?

If we remove the wrapper, this patch becomes optional. It can serve as a fail-safe if llvm-bolt loops (for some reason). Tuning niceness for vCPU is not a bad idea, and the extra step won't hurt. I can still increase the lit workers though.

@DavidSpickett
Copy link
Contributor

I have no knowledge of the previous setup, but I can say that if someone committed a patch that made any use of clang loop forever, every existing bot would timeout the thousands of clang tests. Might hit the overall test step timeout, but might not if it printed something once in a while.

My point with that is - this NFC idea sounds nice but it's adding a lot of complexity. Even if it's not as complex as I think, the fact that I think it is, is at least cognitive complexity.

(but I have literally just read this PR and your explanation, so definitely could be a skill issue on my part)

So not having it would be no worse than the other bots. You can decide if that's a good or a bad thing.

Is Bolt particularly prone to these timeouts? It's always theoretically possible with clang but never happened enough to try to prevent it.

If bolt experts are cool with this or the other patch, great. If undecided maybe you can explain more what you're doing and I'll try to give an outsider's perspective on the implementation.

@DavidSpickett
Copy link
Contributor

Myself and @paschalis-mpeis discussed this and I understand better the sequence of events here.

  • Bolt has an NFC mode that allows the builder to skip running expensive external tests.
    • This is very cool and worth keeping IMO.
  • Even if the bolt binary did not change, if the change set has changes to Bolt tests, they should be re-run at least.
  • ninja <the target for doing that> may try to re-link or rebuild the bolt binary, except that it's not a binary, it's a script to implement this NFC mode, so that script would get replaced.
  • This situation occurred recently and motivated this PR to mitigate the impact when that does happen.
  • In future, @paschalis-mpeis wants to improve the NFC mode to avoid this situation completely.
  • If the NFC mode no longer has this flaw, this PR is not needed to prevent this specific situation.
  • However it would mitigate someone causing bolt to stall by actually changing bolt.

This is not a check we do on the other bots for clang etc., but that's just because we've never had anyone (intentionally or otherwise) cause a widespread problem. If they do in future, maybe we will go down this road too.

And if no one ever makes bolt stall, then great! We just spend a few cycles in case.

@paschalis-mpeis
Copy link
Member Author

paschalis-mpeis commented Jun 30, 2025

Hey David,

Thanks for your review. Please let me know of any further comments.

@DavidSpickett
Copy link
Contributor

Looks good to me, I leave the deciding vote to @aaupov as the domain expert.

This would produce a list of the ran lit tests on the output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants