-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Description
Usually I develop on the stable
channel and wanted to see how my project performs on the upcoming beta
or nightly
channels. I saw performance regressions between 30-80% across the board in native and Wasm targets. Those regressions were confirmed by our benchmarking CI as can be seen by the link.
Was it the LLVM 15 Update?
I conducted a bisect and found that the change happened between nightly-2022-08-12
and nightly-2022-08-13
.
After short research I saw that Rust updated from LLVM 14 to 15 in exactly this time period: #99464 Other merged commits in this time period were not as suspicious to me.
Past Regressions
Also this unfortunately is not the first time we saw such massive regressions ....
It is extremely hard to craft a minimal code snippet out of wasmi
since it is a very heavily optimized bunch of code with lots of interdependencies.
Unfortunately the wasmi
project is incredibly performance critical to us. Even 10-15% performance regression are a disaster to us let alone those 30-80% we just saw ...
Hint for Minimal Code Example
I have one major suspicion: Due to missing guaranteed tail calls in Rust we are heavily reliant on a non-guaranteed optimization for our loop-switch
based interpreter hot path that pulls jumps to the match arms which results to very similar code as what threaded-code interpreter code would produce. The code that depends on this particular optimization can be found here.
This suspicion is underlined by the fact that especially non call-intense workloads show most regressions in the linked benchmarks. This implies to me that the regressions have something to do with instruction dispatch.
Potential Future Solutions
- The Rust compiler could add a few benchmarks concerning those
loop-switch
optimizations to its set of benchmarks so that future LLVM updates won't invalidate those optimizations. I am not sure how viable this approach is to the Rust compiler developers though. Also this only works if we find all the fragile parts that cause these regressions. - Ideally Rust offered abstractions that allow to develop efficient interpreters in Rust without relying on Rust/LLVM optimizations: for example guaranteed tail calls.
Reproduce
The current stable
Rust channel is the following:
stable-x86_64-unknown-linux-gnu (default)
rustc 1.64.0 (a55dd71d5 2022-09-19)
In order to reproduce these benchmarks do the following:
git clone git@github.com:paritytech/wasmi.git
cd wasmi
git checkout 21e12da67a765c8c8b8a62595d2c9d21e1fa2ef6
rustup toolchain install nightly-2022-08-12
rustup toolchain install nightly-2022-08-13
git submodule update --init --recursive
cargo +stable bench --bench benches execute -- --save-baseline stable
cargo +nightly-2022-08-12 bench --bench benches execute -- --baseline stable
cargo +nightly-2022-08-13 bench --bench benches execute -- --baseline stable
Activity
the8472 commentedon Oct 12, 2022
That commit doesn't compile
apiraino commentedon Oct 13, 2022
Meanwhile I'll nominate this issue for T-compiler discussion, I think it's a wider topic that benefits from comments of the team. Wg-prioritization discussion on Zulip
@rustbot label i-compiler-nominated
Robbepop commentedon Oct 13, 2022
@the8472 I am sorry, I forgot to mention that you need to initialize
git
submodules before running benchmarks.Execute
before running the benchmarks. I will update my post above to include this information.
Robbepop commentedon Oct 15, 2022
I don't know if this is related but today I removed 4 unnecessary instructions from the
wasmi
interpreter. My expectation was that the removal shouldn't change performance at all. However, there were massive regressions again. This time I took thecargo-asm
tool to analyze theExecutor::execute
function I linked earlier in this thread just to see massive differences between themaster
branch and the PR branch:master
: https://gist.github.com/Robbepop/cde5a25f00b78259a11170a6614aca90master
+nightly
: https://gist.github.com/Robbepop/a498308cd53c7e75c55f8342786f667dThe diff between
master
onstable
Rust andmaster
onnightly
Rust:https://gist.github.com/Robbepop/88660d17ec1ede77562732bb68670c8c
So it indeed seems to be the culprit of the issue that Rust + LLVM easily fails to properly optimize this function using the threaded-code style branching technique when the stars are misaligned.
Reproduce
The PR that I created today can be found here: wasmi-labs/wasmi#518
In order to make
cargo-asm
able to display the function I had to insert the following code:And also add
to the
Cargo.toml
of the workspace sincecargo-asm
does not yet support the--profile
argument and we want to optimize withcodegen-units=1
andlto="fat"
.Install the
cargo-asm
tool via:cargo install cargo-show-asm
.Run the
cargo-asm
tool via:cargo asm -p wasmi --lib --release execute_dummy > execute_dummy.asm
We need to pipe it into a file since the output is quite large.
the8472 commentedon Oct 15, 2022
This indeed does look like the difference between tail calls vs. dispatching from a loop with computed jumps.
hottest instruction in the fast version. note the
jmp %*rax
hottest instruction in the slow version. note all the jumps back to 24c0 and the incoming jumps at 24c5:
apiraino commentedon Oct 22, 2022
WG-prioritization assigning priority (Zulip discussion).
Discussed by T-compiler (notes), a smaller reproducible would probably help:
As mentioned in the opening comment, a bisection seems to point at the LLVM15 upgrade (#99464), specifically between
nightly-2022-08-12
andnightly-2022-08-13
.@rustbot label -I-prioritize +P-high E-needs-mcve -I-compiler-nominated
8 remaining items
pacak commentedon Oct 27, 2022
Right, that's mostly for other people who might be following the steps :)
Robbepop commentedon Feb 13, 2023
Any news on this?
I have the strong feeling that I encountered this performance regression bug today again in this
wasmi
PR:wasmi-labs/wasmi#676
The PR does not change the amount of instructions but merely changes the implementation of a handful of the over hundred instructions that are part of the big
match
expression. However, this unexpectedly leads to massive performance regressions of up to 200%. Note that also benchmarks are affected that do not execute the changed instructions.I used
cargo-show-asm
to display the differences betweenmaster
branch and the PR:master
: https://gist.github.com/Robbepop/0048641202d77628cb948d84dc4d736d#file-master-asmedit: I was able to fix the performance regression. The idea is that I thought that it was maybe important for the optimizer that all
match
arms end in the same set of instructions. This is the commit: wasmi-labs/wasmi@325bdf1 Note that this commit doesn't change semantics, it simply moves an terminator instruction from a closure into the enclosing scope.^ I will keep this as a reminder to myself for future regressions.
workingjubilee commentedon Mar 3, 2023
@Robbepop Regarding the "one potential solution" header you mentioned a while back. Rust for some time has reserved the word "become" for exactly this reason: it has been imagined that it will be used for explicit tail calls.
So I believe the thing you are talking about is not really controversial. It is mostly awaiting someone assembling a draft RFC, thinking through all the implications, presenting it to the community (and also especially T-lang), and implementing it.
Mind, the same person need not accomplish all of these steps, by far, not even "thinking through all the implications" alone. Indeed, if one were to make this happen, they probably would be best off starting by talking to the people working on MIR opts. If explicit tail calls is something that can be done before code hits LLVM, that simplifies a lot of things (though maybe it makes other things more complex, that sort of thing happens).
Robbepop commentedon Mar 5, 2023
@workingjubilee Thank you for your response!
I think I see where you are heading with your reply ...
The guaranteed tail call proposal has a very long history in Rust. The first mention that I am aware of is this issue by Graydon Hoare himself in year 2011: #217
It turned out back then that the ecosystem around Rust was not yet ready, especially LLVM.
Since then there have been many proposals to add tail calls to Rust, among others:
rustc
)rustc
)A small summary of the issues: rust-lang/rfcs#1888 (comment)
Following the endless discussions in all those threads I never felt that this feature in particular received a lot of support from the core Rust team. Technical reasons for this usually were open questions about borrow/drops that has been resolved by rust-lang/rfcs#2691 (comment) but never received a proper response to follow-up as well as a major open question about the calling ABI that needs to be adjusted for tail calls from what I understood.
Furthermore tail calls frequently were incorrectly perceived as an "elegant" feature for language enthusiasts oftentimes ignoring the fact that it solves niche problems that cannot be solved with any other language feature available in Rust. Therefore tail call proposals were usually handled as very low priority feature request.
This gave me personally the feeling that there is missing support from the group of people from which a potential external contributor urgently needs support. Writing yet another pre-RFC, a third base implementation for
rustc
or another feature proposal issue didn't seem like a good idea to me concerning the history of this feature. What is needed is commitment and support by the language team in order for someone like me to step up.I am very open to ideas.
pnkfelix commentedon Jun 30, 2023
Discussed in the T-compiler P-high review
At this point I am interpreting this issue as a request for some form of proper tail call (PTC) support. In particular, I interpret the issue author's comments as saying that they are willing to adapt their code in order to get the tail-call-elimination guarantees they want.
I too want some kind of PTC (I make no secret of my Scheme background). but given the many issues that the project has, I also want to make sure we properly prioritize them. In this case, this issue strikes me as a P-medium feature request, not a P-high regression. Therefore I am downgrading it to P-medium.
@rustbot label: -P-high +P-medium
nikic commentedon Jun 30, 2023
FYI there is some ongoing work for implementing fail call support, see #112788 for the tracking issue and rust-lang/rfcs#3407 for recent activity on the RFC.
Robbepop commentedon Jun 30, 2023
@pnkfelix Indeed this issue can be closed once Rust tail calls have been merged as I expect it to fix the underlying issue given it provides the control and stack growth guarantees stated in the current MVP design.
v0.32-beta.16
fordebug
builds with profile overwrites wasmi-labs/wasmi#1048