Skip to content

bootstrap: retry cargo invocations if stderr contains a known pattern #134472

Open
@marcoieni

Description

@marcoieni
Member

Why

Our CI auto builds sometimes fail for known reasons that are not related to the PRs we are trying to merge.

Most of the times, these errors are hard to understand and fix (or can't be fixed at all), decreasing the success rate of the auto builds for several weeks or months.
The impact is that we loose days of parallel compute time and hours of maintainers time that need to analyze the error message and reschedule the PRs in the merge queue.

Feature

We want to list the the stderr of the known issues we are aware of in the config.toml file that bootstrap uses. These patterns can be expressed as regex.
We want bootstrap to retry cargo invocations up to two times if stderr matches one of the listed patterns.

This would help reduce the failure rate of our CI because it would significantly reduce the percentage of jobs failing due to spurious errors.

The error messages need to be precise enough to avoid retrying cargo invocations over genuine problems.

Known error patterns can be found here. Not all of them can be listed.

As a start, we could just have 1 stderr string in the list (this one doesn't need to be a regex):

Questions

  • Is config.toml the right place to put the known stderr patterns? In Zulip, Jieyou proposed introducing another file: retry-patterns.toml. I'll leave it to the bootstrap team to decide.
  • Which format do we use to write the stderr patterns in the config.toml file? For example, it can be an array of strings. It could also be an "object" if we want to customize how many times to retry per error message. I'll leave it to the boostrap team to decide.
  • how do we make sure these patterns are present in the config.toml used for CI? I'm not familiar with how the config.toml for the CI is generated.

Zulip links

  • idea proposed here
  • agreement reached here

Activity

added
A-bootstrap-configArea: bootstrap `config.toml` and the config system
T-bootstrapRelevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap)
on Dec 18, 2024
added
needs-triageThis issue may need triage. Remove it if it has been sufficiently triaged.
on Dec 18, 2024
added
E-hardCall for participation: Hard difficulty. Experience needed to fix: A lot.
C-enhancementCategory: An issue proposing an enhancement or a PR with one.
and removed
needs-triageThis issue may need triage. Remove it if it has been sufficiently triaged.
on Dec 18, 2024
jieyouxu

jieyouxu commented on Dec 18, 2024

@jieyouxu
Member

Is config.toml the right place to put the known stderr patterns? In Zulip, Jieyou proposed introducing another file: retry-patterns.toml. I'll leave it to the bootstrap team to decide.

FTR config.toml is usually for local user config (CI sometimes writes their own config.toml) and that is specifically .gitignore'd, but I haven't thought too hard about the pros and cons of including it in the same config file.

jieyouxu

jieyouxu commented on Dec 18, 2024

@jieyouxu
Member

Notably, if we decide to pursue this approach, we need to be precise enough in the error patterns to not retry genuine failures.

marcoieni

marcoieni commented on Dec 18, 2024

@marcoieni
MemberAuthor

Ok. Let us know when the bootstrap team is able to answer the questions listed, so that somebody can proceed with the implementation.
I could have some capacity to work on this in the week of 6 Jan for example 👍

jieyouxu

jieyouxu commented on Dec 18, 2024

@jieyouxu
Member

cc @rust-lang/bootstrap: what do we think about this?

Kobzol

Kobzol commented on Dec 18, 2024

@Kobzol
Contributor

This sounds reasonable, it would definitely be useful.

Note that implementing this directly in bootstrap might be tricky. We sometimes don't capture stderr of cargo (in fact I think that in most situations we don't capture it?), but rather redirect it to stderr of the bootstrap. Furthermore, we would probably need to do this on a per-Cargo invocation, and we invoke Cargo on many places. So to avoid doing these restarts on a lot of source locations inside bootstrap, which would be quite crazy, it would be better to somehow thread it through this, which would ideally do the restart automatically inside, without the rest of bootstrap having to think about it.

I wouldn't be opposed against doing this outside of bootstrap, tbh, by just wrapping the top-level run build CI step in a loop that would do rm -rf build && run.sh and then restart the loop N times if the generated stderr contains the invalid substrings. More specifically, I'm not sure if I agree with this comment, given the state of bootstrap's codebase 😆 But this has to be tried out.

Anyway, if we do it inside bootstrap, we can either put it into config.toml or add a CLI flag, that's probably not super important. We can modify config.toml on CI using the --set argument, we already do that on several places.

Btw, it looks like the linked issue is not a spurious failure, but rather a known issue (with a known fix). A better candidate might be the MSVC locking failures.

onur-ozkan

onur-ozkan commented on Dec 18, 2024

@onur-ozkan
Member

This could be useful only for CI, right? I think this should be done outside of the bootstrap using a wrapper script that handles this magic. It doesn't seem necessary to include it in the bootstrap as it's unrelated to its core purpose.

Also, how do we plan to handle error reports? If we silently skip these errors, we might forget to fix them which could increase build time for certain runners.

clubby789

clubby789 commented on Dec 18, 2024

@clubby789
Contributor

What if we just have the existing failure/RLA messages with a note that the step is being retried?

onur-ozkan

onur-ozkan commented on Dec 19, 2024

@onur-ozkan
Member

What if we just have the existing failure/RLA messages with a note that the step is being retried?

Yeah, that should be enough.

marcoieni

marcoieni commented on Dec 19, 2024

@marcoieni
MemberAuthor

If we do this outside of bootstrap can we retry at the same level of granularity?

E.g. CI, could retry ./x.py --stage 2 test, but that would take a lot of time.

Instead, bootstrap knows the cargo command that failed, so it can retry only that command, which is faster, right?

jieyouxu

jieyouxu commented on Dec 19, 2024

@jieyouxu
Member

If we do this outside of bootstrap can we retry at the same level of granularity?

E.g. CI, could retry ./x.py --stage 2 test, but that would take a lot of time.

Instead, bootstrap knows the cargo command that failed, so it can retry only that command, which is faster, right?

If we don't invalidate the build cache (e.g. do it in the same job invocation), then bootstrap (via cargo) does not need to rebuild things that have already successfully built (modulo incompatible-rustflags bugs) (at the crate-level granularity I believe).

marcoieni

marcoieni commented on Dec 19, 2024

@marcoieni
MemberAuthor

Right! So the cost is not 2x for sure. But we would need to re-run all the other tests that already passed, right? We could encounter the same error in another test that previously passed potentially. 🤔

Of course the final decision is yours btw and retrying outside of bootstrap might already be a good starting point. I'm just giving my 2 cents 👍

Kobzol

Kobzol commented on Dec 19, 2024

@Kobzol
Contributor

It will be probably quite faster due to the cache already being on disk, and also some tests will probably get skipped if the compiler didn't get rebuilt.

But even if it took the full CI time again, I don't think that's too bad. The important thing is that it would be automated, so people don't have to retry all the time, and that it would save CI resources, because only a single job would be restarted.

What might be problematic regarding longer CI time is the timeout, which is currently probably too short for a retry mechanism that reruns the full CI.

jieyouxu

jieyouxu commented on Dec 19, 2024

@jieyouxu
Member

But we would need to re-run all the other tests that already passed, right?

Not quite either! For compiletest-managed tests under tests/*, if the previous results were successful (or ignored), then the test will not be rerun if itself and its dependencies have not changed (modulo test dependency check bugs). Let's call this "lazy-rerun" property (I don't recall what a better term is on top of my head).

I believe other cargo test-managed tests (crate tests, tool tests) have similar lazy-rerun properties.

(Of course, they can be forced to rerun via --force-rerun.)

We could encounter the same error in another test that previously passed potentially.

That's indicative of a flaky test, which is going to be problematic anyway (in the previous lazy-rerun scheme, this violates the assumption that a test is expected to be reproducible).

Of course the final decision is yours btw and retrying outside of bootstrap might already be a good starting point. I'm just giving my 2 cents

@marcoieni to be clear, I'm not against giving it a shot inside bootstrap. I have not tried to implement this myself, so you may find it to be more tricky than expected / run into other issues that we have not and could not have predicted at this present time. You may also want to also give the non-bootstrap approach a rough experiment -- to compare which approach seems more maintainable/reliable. FWIW, I'm happy to review and discuss a spurious build retry implementation if you want to give it a shot. Without any concrete experiments, all we can do is roughly speculate what problems either approach may have.

marcoieni

marcoieni commented on Dec 19, 2024

@marcoieni
MemberAuthor

Ok, thanks everyone!

It seems that the safest approach for now is experimenting in CI retrying ./x.py test if we see some known patterns 👍

Than we could move this mechanism to bootstrap itself if we think that there are some major benefits.

added
T-infraRelevant to the infrastructure team, which will review and decide on the PR/issue.
A-CIArea: Our Github Actions CI
on Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-CIArea: Our Github Actions CIA-bootstrap-configArea: bootstrap `config.toml` and the config systemC-enhancementCategory: An issue proposing an enhancement or a PR with one.E-hardCall for participation: Hard difficulty. Experience needed to fix: A lot.T-bootstrapRelevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap)T-infraRelevant to the infrastructure team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @Kobzol@marcoieni@clubby789@jieyouxu@onur-ozkan

        Issue actions

          bootstrap: retry `cargo` invocations if stderr contains a known pattern · Issue #134472 · rust-lang/rust