Skip to content

Improve ptr::read code in debug builds #81163

Closed
@RalfJung

Description

@RalfJung
Member

In #80290, some people raised concerns about the quality of the code that ptr::write compiles to in debug builds. Given that, to my knowledge, reads are much more common than writes, I would think that one should be much more concerned with the code that ptr::read compiles to -- and currently, there's going to be quite a few function calls in there, so without inlining, that code will be pretty slow.

ptr::read could be improved with techniques similar to what I did for ptr::write (call intrinsics directly, and inline everything else by hand). This would result in (something like) the following implementation: (EDIT see below for why this is wrong)

pub const unsafe fn read<T>(src: *const T) -> T {
    // We are calling the intrinsics directly to avoid function calls in the generated code
    // as `intrinsics::copy_nonoverlapping` is a wrapper function.
    extern "rust-intrinsic" {
        fn copy_nonoverlapping<T>(src: *const T, dst: *mut T, count: usize);
    }

    // For the same reason, we also side-step `mem::MaybeUninit` and use a custom `union` directly.
    #[repr(C)]
    union MaybeUninit<T> {
        init: mem::ManuallyDrop<T>,
        uninit: (),
    }

    let mut tmp: MaybeUninit<T> = MaybeUninit { uninit: () };
    // SAFETY: the caller must guarantee that `src` is valid for reads.
    // `src` cannot overlap `tmp` because `tmp` was just allocated on
    // the stack as a separate allocated object.
    //
    // `MaybeUninit` is repr(C), so we can assume `init` is at offset 0, justifying the pointer
    // casts.
    //
    // Finally, since we just wrote a valid value into `tmp`, it is guaranteed
    // to be properly initialized.
    unsafe {
        copy_nonoverlapping(src, &mut tmp as *mut _ as *mut T, 1);
        mem::transmute_copy(&tmp.init)
    }
}

However, here we have the extra difficulty that read is (unstably) a const fn, so the above implementation is rejected. &tmp.init can be replaced by &mut tmp.init and that works (or we wait for a bootstrap bump so we can make use of #80418), but transmute_copy is non-const, so there's still more work to be done. (transmute does not work since the compiler does not recognize that T and ManuallyDrop<T> have the same size.)

I will stop here, but if someone else strongly cares about ptr::read performance/codesize in debug builds, feel free to pick this up and drive it to completion.

Cc @bjorn3 @therealprof @usbalbin

Activity

RalfJung

RalfJung commented on Jan 18, 2021

@RalfJung
MemberAuthor

Also Cc #80377

RalfJung

RalfJung commented on Jan 18, 2021

@RalfJung
MemberAuthor

Actually, transmute_copy is implemented using read, so scratch that plan.^^
I don't think there is a way to do this without calling ManuallyDrop::into_inner.

added
T-libsRelevant to the library team, which will review and decide on the PR/issue.
A-raw-pointersArea: raw pointers, MaybeUninit, NonNull
I-slowIssue: Problems and improvements with respect to performance of generated code.
on Jan 18, 2021
nagisa

nagisa commented on Jan 18, 2021

@nagisa
Member

I don't think there is a way to do this without calling ManuallyDrop::into_inner.

If all of ManuallyDrop, MaybeUninit and ptr::read are defined in the same crate, the fields of the wrappers could be made pub to ptr::read.

RalfJung

RalfJung commented on Jan 19, 2021

@RalfJung
MemberAuthor

Yeah, that would work... but we'd have to make them public to the entire ptr module I think, so there is a significant risk of this being misused down the road. IOW, this is a bit too hacky for my taste.^^

therealprof

therealprof commented on Jan 19, 2021

@therealprof
Contributor

Any chance of implementing those lowest level functions exclusively with intrinsics? It seems non-ideal to rely on a (non-existing in debug mode) optimizer to turn this into a decent binary code.

I would argue that the most fundamental building blocks to low-level applications should never cause unnecessary overhead, even in debug mode.

RalfJung

RalfJung commented on Jan 19, 2021

@RalfJung
MemberAuthor

Intrinsics carry a significant implementation cost in various layers of the compiler (typechecking, CTFE/Miri, codegen), so IMO we should keep the number of intrinsics to the absolute minimum necessary.

It seems non-ideal to rely on a (non-existing in debug mode) optimizer to turn this into a decent binary code.

This is not entirely true; some MIR simplification passes do run even in debug mode.

therealprof

therealprof commented on Jan 19, 2021

@therealprof
Contributor

Intrinsics carry a significant implementation cost in various layers of the compiler (typechecking, CTFE/Miri, codegen), so IMO we should keep the number of intrinsics to the absolute minimum necessary.

Fair enough. I still think write, read, write_volatile and read_volatile qualify as essentials for a systems programming language.

This is not entirely true; some MIR simplification passes do run even in debug mode.

If we can get efficient implementations for the mentioned functions another way that would be great.

RalfJung

RalfJung commented on Jan 19, 2021

@RalfJung
MemberAuthor

I still think write, read, write_volatile and read_volatile qualify as essentials for a systems programming language.

That might be a good enough justification to make the fields of MaybeUninit and ManuallyDrop "public enough" to avoid function calls in read.

OTOH, the volatile operations have to be intrinsics anyway, so potentially much of the code could be shared between read_volatile and a read intrinsic.

alecmocatta

alecmocatta commented on Jan 19, 2021

@alecmocatta
Contributor

I may be off the mark here, but might we see net perf gains with certain cheap wrapper functions (e.g. ptr::copy_nonoverlapping) just being marked #[inline(always)]? Might be worth testing before expending effort on manual inlining.

bjorn3

bjorn3 commented on Jan 19, 2021

@bjorn3
Member

#[inline(always)] won't help for cg_clif until the MIR inliner is enabled by default. It also requires the LLVM inliner to do more work, thus causing debug builds to become a bit slower.

RalfJung

RalfJung commented on Jan 19, 2021

@RalfJung
MemberAuthor

I thought inline(always) only affects release build? If it also affects debug builds, that might still be better overall than some of the more hacky alternatives.

bjorn3

bjorn3 commented on Jan 19, 2021

@bjorn3
Member

#[inline(always)] is also enabled when optimizations are disabled. Except when -Cno-prepopulate-passes is used.

For the old pass manager:

llvm::LLVMRustAddAlwaysInlinePass(builder, config.emit_lifetime_markers);

For the new pass manager:
MPM.addPass(AlwaysInlinerPass(EmitLifetimeMarkers));

29 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

A-mir-optArea: MIR optimizationsA-mir-opt-inliningArea: MIR inliningA-raw-pointersArea: raw pointers, MaybeUninit, NonNullI-slowIssue: Problems and improvements with respect to performance of generated code.T-libsRelevant to the library team, which will review and decide on the PR/issue.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @RalfJung@oli-obk@jrmuizel@nagisa@wesleywiser

      Issue actions

        Improve ptr::read code in debug builds · Issue #81163 · rust-lang/rust