-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Closed
Labels
A-codegenArea: Code generationArea: Code generationA-thread-localsArea: Thread local storage (TLS)Area: Thread local storage (TLS)C-enhancementCategory: An issue proposing an enhancement or a PR with one.Category: An issue proposing an enhancement or a PR with one.I-slowIssue: Problems and improvements with respect to performance of generated code.Issue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.Relevant to the compiler team, which will review and decide on the PR/issue.
Description
Proposed solution
Right now access to thread locals defined by thread_local!
aren't inlined across crates, causing performance problems that wouldn't otherwise be seen within one crate. This can probably be solved with a few new minor language features:
- First, the
#[inline]
annotation could be processed onstatic
variables. If the variable does not have any internal mutability, then the definition can be inlined into other LLVM modules and tagged withavailable_externally
. That means that the contents are available for optimization, but if you're taking the address it's available elsewhere. - Second, when we inline the contents of a static across modules, function pointers should also be chased and inlined if they're tagged with
#[inline]
.
Those two pieces I believe should provide enough inlining opportunities to ensure that accesses are as fast when done from external crates as they are done with internal crates.
Original description
This hurts performance for the locks in std::sync
, as they call std::rt::unwind::panicking()
(which just reads a thread-local). For uncontended locks the cost is quite significant.
There are two problems:
std::rt::unwind::panicking()
isn't marked inline. This is trivial to solve.- Accessing a
thread_local!
goes through function pointers, which LLVM fails to see through. These are the__getit
functions inlibstd/thread/local.rs
. Consider these two files:
// tls.rs
use std::cell::Cell;
thread_local! { static FOO: Cell<bool> = Cell::new(false) }
#[inline]
pub fn get_foo() -> bool {
FOO.with(|s| s.get())
}
// other.rs
extern crate tls;
#[inline(never)]
fn call_foo() -> bool {
tls::get_foo()
}
call_foo
gets the following IR with everything compiled with full optimization. Note the call through a function pointer:
define internal fastcc zeroext i1 @_ZN8call_foo20h55fb2c2ba5981f51PaaE() unnamed_addr #3 {
entry-block:
%0 = load %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* ()** getelementptr inbounds (%"1.std::thread::local::LocalKey<core::cell::Cell<bool>>"* @_ZN3FOO20hefc7cdadc988b7defaaE, i64 0, i32 0), align 8
%1 = tail call dereferenceable(4) %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %0()
%2 = getelementptr inbounds %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1, i64 0, i32 0, i32 0, i32 0, i32 0
%3 = load i8* %2, align 1, !range !13
%cond.i.i = icmp eq i8 %3, 1
br i1 %cond.i.i, label %match_case.i.i, label %match_else.i.i
match_else.i.i: ; preds = %entry-block
%4 = load i8 ()** getelementptr inbounds (%"1.std::thread::local::LocalKey<core::cell::Cell<bool>>"* @_ZN3FOO20hefc7cdadc988b7defaaE, i64 0, i32 1), align 8
%5 = tail call i8 %4()
%6 = zext i8 %5 to i16
%7 = shl nuw i16 %6, 8
%8 = or i16 %7, 1
%9 = bitcast %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1 to i16*
store i16 %8, i16* %9, align 1
%10 = getelementptr inbounds %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1, i64 0, i32 0, i32 0, i32 0, i32 2, i64 0
%11 = bitcast i8* %10 to %"2.core::cell::Cell<bool>"*
br label %_ZN7get_foo20h73631102cae2c5b4lbaE.exit
match_case.i.i: ; preds = %entry-block
%12 = getelementptr inbounds %"1.std::thread::local::imp::Key<core::cell::UnsafeCell<core::option::Option<core::cell::Cell<bool>>>>"* %1, i64 0, i32 0, i32 0, i32 0, i32 2
%13 = bitcast [1 x i8]* %12 to %"2.core::cell::Cell<bool>"*
br label %_ZN7get_foo20h73631102cae2c5b4lbaE.exit
_ZN7get_foo20h73631102cae2c5b4lbaE.exit: ; preds = %match_else.i.i, %match_case.i.i
%.0.i.i = phi %"2.core::cell::Cell<bool>"* [ %13, %match_case.i.i ], [ %11, %match_else.i.i ]
%.0.idx.i.i = getelementptr %"2.core::cell::Cell<bool>"* %.0.i.i, i64 0, i32 0, i32 0
%.0.idx.val.i.i = load i8* %.0.idx.i.i, align 1
%14 = icmp ne i8 %.0.idx.val.i.i, 0
ret i1 %14
}
Metadata
Metadata
Assignees
Labels
A-codegenArea: Code generationArea: Code generationA-thread-localsArea: Thread local storage (TLS)Area: Thread local storage (TLS)C-enhancementCategory: An issue proposing an enhancement or a PR with one.Category: An issue proposing an enhancement or a PR with one.I-slowIssue: Problems and improvements with respect to performance of generated code.Issue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.Relevant to the compiler team, which will review and decide on the PR/issue.