Description
In #80290, some people raised concerns about the quality of the code that ptr::write
compiles to in debug builds. Given that, to my knowledge, reads are much more common than writes, I would think that one should be much more concerned with the code that ptr::read
compiles to -- and currently, there's going to be quite a few function calls in there, so without inlining, that code will be pretty slow.
ptr::read
could be improved with techniques similar to what I did for ptr::write
(call intrinsics directly, and inline everything else by hand). This would result in (something like) the following implementation: (EDIT see below for why this is wrong)
pub const unsafe fn read<T>(src: *const T) -> T {
// We are calling the intrinsics directly to avoid function calls in the generated code
// as `intrinsics::copy_nonoverlapping` is a wrapper function.
extern "rust-intrinsic" {
fn copy_nonoverlapping<T>(src: *const T, dst: *mut T, count: usize);
}
// For the same reason, we also side-step `mem::MaybeUninit` and use a custom `union` directly.
#[repr(C)]
union MaybeUninit<T> {
init: mem::ManuallyDrop<T>,
uninit: (),
}
let mut tmp: MaybeUninit<T> = MaybeUninit { uninit: () };
// SAFETY: the caller must guarantee that `src` is valid for reads.
// `src` cannot overlap `tmp` because `tmp` was just allocated on
// the stack as a separate allocated object.
//
// `MaybeUninit` is repr(C), so we can assume `init` is at offset 0, justifying the pointer
// casts.
//
// Finally, since we just wrote a valid value into `tmp`, it is guaranteed
// to be properly initialized.
unsafe {
copy_nonoverlapping(src, &mut tmp as *mut _ as *mut T, 1);
mem::transmute_copy(&tmp.init)
}
}
However, here we have the extra difficulty that read
is (unstably) a const fn
, so the above implementation is rejected. &tmp.init
can be replaced by &mut tmp.init
and that works (or we wait for a bootstrap bump so we can make use of #80418), but transmute_copy
is non-const
, so there's still more work to be done. (transmute
does not work since the compiler does not recognize that T
and ManuallyDrop<T>
have the same size.)
I will stop here, but if someone else strongly cares about ptr::read
performance/codesize in debug builds, feel free to pick this up and drive it to completion.
Activity
RalfJung commentedon Jan 18, 2021
Also Cc #80377
RalfJung commentedon Jan 18, 2021
Actually,
transmute_copy
is implemented usingread
, so scratch that plan.^^I don't think there is a way to do this without calling
ManuallyDrop::into_inner
.nagisa commentedon Jan 18, 2021
If all of
ManuallyDrop
,MaybeUninit
andptr::read
are defined in the same crate, the fields of the wrappers could be madepub
toptr::read
.RalfJung commentedon Jan 19, 2021
Yeah, that would work... but we'd have to make them public to the entire
ptr
module I think, so there is a significant risk of this being misused down the road. IOW, this is a bit too hacky for my taste.^^therealprof commentedon Jan 19, 2021
Any chance of implementing those lowest level functions exclusively with intrinsics? It seems non-ideal to rely on a (non-existing in debug mode) optimizer to turn this into a decent binary code.
I would argue that the most fundamental building blocks to low-level applications should never cause unnecessary overhead, even in debug mode.
RalfJung commentedon Jan 19, 2021
Intrinsics carry a significant implementation cost in various layers of the compiler (typechecking, CTFE/Miri, codegen), so IMO we should keep the number of intrinsics to the absolute minimum necessary.
This is not entirely true; some MIR simplification passes do run even in debug mode.
therealprof commentedon Jan 19, 2021
Fair enough. I still think
write
,read
,write_volatile
andread_volatile
qualify as essentials for a systems programming language.If we can get efficient implementations for the mentioned functions another way that would be great.
RalfJung commentedon Jan 19, 2021
That might be a good enough justification to make the fields of
MaybeUninit
andManuallyDrop
"public enough" to avoid function calls inread
.OTOH, the volatile operations have to be intrinsics anyway, so potentially much of the code could be shared between
read_volatile
and aread
intrinsic.alecmocatta commentedon Jan 19, 2021
I may be off the mark here, but might we see net perf gains with certain cheap wrapper functions (e.g.
ptr::copy_nonoverlapping
) just being marked#[inline(always)]
? Might be worth testing before expending effort on manual inlining.bjorn3 commentedon Jan 19, 2021
#[inline(always)]
won't help for cg_clif until the MIR inliner is enabled by default. It also requires the LLVM inliner to do more work, thus causing debug builds to become a bit slower.RalfJung commentedon Jan 19, 2021
I thought
inline(always)
only affects release build? If it also affects debug builds, that might still be better overall than some of the more hacky alternatives.bjorn3 commentedon Jan 19, 2021
#[inline(always)]
is also enabled when optimizations are disabled. Except when-Cno-prepopulate-passes
is used.For the old pass manager:
rust/compiler/rustc_codegen_llvm/src/back/write.rs
Line 1047 in 4e208f6
For the new pass manager:
rust/compiler/rustc_llvm/llvm-wrapper/PassWrapper.cpp
Line 911 in 2ba7ca2
29 remaining items