Closed
Description
In the following code f
and g
have these lines swapped:
let item = SpecificDisplayItem::PopStackingContext;
clip.unwrap();
Unwrapping later causes f
to have an additional memcpy.
#[inline(never)]
pub fn f(clip: Option<&bool>) {
let item = SpecificDisplayItem::PopStackingContext;
clip.unwrap();
do_item(&DI {
item,
});
}
#[inline(never)]
pub fn g(clip: Option<&bool>) {
clip.unwrap();
let item = SpecificDisplayItem::PopStackingContext;
do_item(&DI {
item,
});
}
pub enum SpecificDisplayItem {
PopStackingContext,
Other([f64; 22]),
}
struct DI {
item: SpecificDisplayItem,
}
fn do_item(di: &DI) { unsafe { ext(di) } }
extern {
fn ext(di: &DI);
}
Compiles to:
example::f:
sub rsp, 360
test rdi, rdi
je .LBB0_1
mov qword ptr [rsp], 0
lea rdi, [rsp + 8]
lea rsi, [rsp + 184]
mov edx, 176
call memcpy@PLT
mov rdi, rsp
call ext@PLT
add rsp, 360
ret
.LBB0_1:
lea rdi, [rip + .Lbyte_str.2]
call core::panicking::panic@PLT
ud2
example::g:
sub rsp, 184
test rdi, rdi
je .LBB1_1
mov qword ptr [rsp], 0
mov rdi, rsp
call ext@PLT
add rsp, 184
ret
.LBB1_1:
lea rdi, [rip + .Lbyte_str.2]
call core::panicking::panic@PLT
ud2
Ideally, f
and g
should compile to the same thing.
Metadata
Metadata
Assignees
Labels
Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.Area: MIR optimizationsFixed by the Named Return Value Opt. (NRVO)Category: An issue proposing an enhancement or a PR with one.Issue: Problems and improvements with respect to performance of generated code.Relevant to the compiler team, which will review and decide on the PR/issue.
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
Improve code quality for push_new_empty_item.
Auto merge of #3341 - jrmuizel:empty-item, r=gw3583
lqd commentedon Nov 23, 2018
cc @rust-lang/wg-codegen
eddyb commentedon Nov 23, 2018
IIRC @pcwalton found that LLVM only elides copies in the same basic block (and
unwrap
can panic, so it terminates a basic block, splitting the code before and after into separate basic blocks).nox commentedon Nov 23, 2018
Yes, cf. @jrmuizel's PR on Webrender:
nikic commentedon Dec 1, 2018
There were some attempts to make memcpyopt work across BBs, but they were reverted due to regressions, see https://bugs.llvm.org/show_bug.cgi?id=35519.
dotdash commentedon Jun 26, 2020
It might be worth noting that the memcpy here is especially pointless because it copies uninitialized memory. When the memcpy optimizations fold a memset into a memcpy it checks whether the memcpy copies more memory than what has been memset, and if so and the remainder is uninitialized, the memcpy for the uninitialized part is dropped.
In this case here, SROA splits the alloca for the
item
, into the discriminant part, and 176 uninitialized bytes. But since it runs without memory dependency analysis, is has no easy way to drop the memcpy for the uninitialized part. And before the MemCpy pass can kill the memcpy, the inliner comes and breaks the code into multiple basic blocks. So for this constellation, it would help to run the MemCpy pass before the inliner, but I have no real idea what tradeoff that makes.Another option would be to add an optimization that drops memcpys from uninitialized sources to a pass that does cross-bb memory dependence analysis anyway, but a quick look didn't reveal any obvious place to do that. Given the rather common use of enums like this, where only some variants have a payload, that might be worth a try though.
oli-obk commentedon Jun 26, 2020
This may get resolved with #72632 by doing the necessary work on MIR.
dotdash commentedon Jun 26, 2020
The approach from #72632 breaks if you assign the same source to multiple destinations, because there's no simple chain that can be reduced to a single destination. I think you need to do copy-propagation (replacing the destination with the source, instead of the other way around) to handle that.
The following doesn't get properly optimized by #72632, but is handled by the memcpy pass being run before the inliner. That approach of course also doesn't handle all the cases, thus the proposal for the optimization to catch copies from uninitialized memory.
In fact #72632 even stops the patched (MemCpyOpt before Inliner) LLVM from optimizing this version, because SROA can no longer split the alloca and so there's no memcpy that copies only uninitialized memory. For the modified
f
function #72632 still produces better code than nightly, but if you apply the same change tog
, then nightly produces a properly optimized version, while dest-prop causes the lifetimes of the two DI instances to overlap, forcing double stack usage and a memcpy.Edit: I'm not trying to criticize the work that went into #72632. I didn't notice the comment on the PR (and didn't realize it had progressed so far) when I started to look into this. After being made aware of it, I wanted to see that it works for myself and got confused by the way that optimization pass works because I (for some reason) always assumed it would do copy propagation and was just named weirdly, so I tried to figure out why it does things the way it does, and noticed that it breaks this modified example.
jrmuizel commentedon Mar 12, 2021
Fixed by #82806