Description
#59693 is a nice speed-up for rustc, reducing instruction counts by as much as 12%. #59693 (comment) shows that approximately half the speedup is from avoiding TLS lookups.
So I thought: what else is using TLS lookups? I did some profiling and found that syntax_pos::GLOBALS
accounts for most of it. It has three pieces, symbol_interner
, hygiene_data
, span_interner
. I did some profiling of the places where they are accessed via GLOBALS::with
:
rustc:
791545069 counts:
( 1) 499029030 (63.0%, 63.0%): symbol_interner
( 2) 181386140 (22.9%, 86.0%): hygiene_data
( 3) 109861627 (13.9%, 99.8%): span_interner
ripgrep:
5455319 counts:
( 1) 2819190 (51.7%, 51.7%): symbol_interner
( 2) 2015746 (37.0%, 88.6%): hygiene_data
( 3) 599975 (11.0%, 99.6%): span_interner
style-servo
79839701 counts:
( 1) 36436621 (45.6%, 45.6%): hygiene_data
( 2) 31539114 (39.5%, 85.1%): symbol_interner
( 3) 11562409 (14.5%, 99.6%): span_interner
webrender
27006839 counts:
( 1) 11021232 (40.8%, 40.8%): hygiene_data
( 2) 9218693 (34.1%, 74.9%): symbol_interner
( 3) 6707365 (24.8%, 99.8%): span_interner
These measurements are from a rustc that didn't have #59693's change applied, which avoids almost all of the span_interner
accesses. And those accesses were only 11.0-24.8% of the syntax_pos::GLOBALS
accesses. In other words, if we could eliminate most or all of the hygiene_data
and symbol_interner
accesses, we'd get even bigger wins than what we saw in #59693.
I admit that I don't understand how syntax_pos::GLOBALS
works, why the TLS reference is needed for a global value.
One possible idea is to increase the size of Symbol
from 4 bytes to 8 bytes, and then store short symbols (7 bytes or less) inline. Some preliminary profiling suggests this could capture roughly half of the symbols. hygiene_data
is a harder nut to crack, being a more complicated structure.
cc @rust-lang/wg-compiler-performance
Activity
eddyb commentedon Apr 5, 2019
Could these be sped up by using
#[thread_local]
directly, and maybe static linking instead of dynamic linking?cc @alexcrichton
If you're asking why it's not a plain
static
, that's because those should pretty much never be used, as rustc supports multiple instances per process (and e.g. rustdoc uses that to compile doc tests).Zoxc commentedon Apr 5, 2019
#59655 allows you to compare symbols against a predefined list of symbols without doing a TLS lookup and a string comparison. That will hopefully help some.
I'm also working on a PR which removes
Symbol
usage from symbol names (which tend to be unique and doesn't benefit from interning).I'd also like to replace the
u32
inSymbol
with&'tcx str
, but that would require adding a lifetime to the AST.eddyb commentedon Apr 5, 2019
I think arena-allocating the AST is the way forward anyway, so I wouldn't mind the lifetime tbh.
mati865 commentedon Apr 5, 2019
Is it related to #25088?
eddyb commentedon Apr 5, 2019
@mati865 We can figure out by trying to use a
#[thread_local] static GLOBALS: Cell<Option<...>> = Cell::new(None);
directly.alexcrichton commentedon Apr 5, 2019
While
#[thread_local]
can be used to test performance AFAIK it still doesn't work on MSVC. We do in fact already know that dynamic linking has a hit on performance wrt instruction counts. As to whether that's PLT lookups vs thread local lookups I'm not sure. (I'm hoping to revive that once I get access to Windows again)nnethercote commentedon Apr 5, 2019
I'm asking why a global data structure requires TLS to access it... global data structures and TLS seem entirely orthogonal and incompatible to me. Clearly I'm missing something. What does "multiple instances per process" mean -- instances of what?
Mark-Simulacrum commentedon Apr 6, 2019
Rustdoc will use rustc_driver and a set of other APIs to essentially attempt to call rustc as if it was a function. That spawns a thread (or more, with the parallel compiler enabled); each of those threads receives its own copy of these proto-globals; that means that they aren't necessarily global in the standard sense -- more so rustc-local.
eddyb commentedon Apr 6, 2019
@nnethercote All "globals" in rustc are "thread-local globals" - as in, they're "global" in the sense of "accessible from a function with no arguments" but scoped to a thread.
And by "rustc supports multiple instances" I meant "multiple instances of itself", i.e. multiple
rustc
invocations, running concurrently, on disjoint threads, but not interfering eachother.(But @Mark-Simulacrum explained it better anyway)
petrochenkov commentedon Apr 6, 2019
cc #59749 (Measure upper limit for performance of 32 bit
Span
)The same thing can be measured for the symbol interner as well, I guess, to estimate the impact.
nnethercote commentedon Apr 6, 2019
So "thread" doesn't actually mean OS thread, but a rustc invocation that contains one or more OS threads, depending on whether rustc is serial or parellel. And
GLOBALS
isn't properly global, but only global w.r.t. a single rustc invocation.These names are... well... I now feel more justified about my prior confusion. I've seen the word "session" used in the code, does that match "rustc invocation" as I've used it above?
I still don't understand how, in a parellel rustc, multiple OS threads can access the same TLS. Does each OS thread end up with a reference to the single mutex-protected quasi-global?
How important is the ability to run multiple rustc invocations? @eddyb said it's used for "rustdoc uses that to compile doc tests". Is it used for anything else?
Mark-Simulacrum commentedon Apr 7, 2019
The threads do correspond to OS threads. However, my understanding is that
GLOBALS
is Session-like (just available earlier in the compilation session). I believe that's your understanding as well.Yes, sessions are rustc "invocation" specific.
Yes, the TLS just contains a pointer to the actual "global."
My understanding is that doc tests would be considerably slower if we didn't have this in-process multi-invocationy style of building tests. I don't think it's used for anything else, necessarily, beyond perhaps unit tests in a few compiler tests.
I think historically the scoped TLS in the compiler has been used as an implicit context for things like Span, TyCtxt, etc. where there's some associated state that we don't currently thread through manually. I think it's possible that over time we could migrate away from TLS and towards other methods of threading the state through (and/or true globals via e.g. lazy_static) but I am unsure if that's feasible. I think historically it's not really been viable to completely remove (we use it too much, and it may be better than the alternative).
eddyb commentedon Apr 7, 2019
We certainly do not consider "true globals" a reasonable limitation for "rustc as a library" (not to mention they'd need locks in cases where today we can use Cell/RefCell), and likely RLS would be impacted too (at least before we add multi-crate sessions to rustc).
Ideally we'd move to some language-integrated "implicit contexts" but that is nowhere near on the horizon.
nnethercote commentedon Apr 29, 2019
A problem with the current
Symbol
implementation is that if you want to convert aSymbol
to a string -- which is common -- you have to access the array withinInterner
, which involves TLS.I tried changing
Symbol
so that instead of an index into the interner, it just held a raw (thin) pointer to the string's chars, which avoids this problem. (This required putting the string length in the arena next to the chars. A fat pointer would have madeSymbol
16 bytes, which is much too big.) And I also made the interner truly global (usinglazy_static
) and immortal -- symbols added are never removed. This makes it simpler because there's no subtle reasoning about lifetimes like the current implementation. (And the distinction betweenInternedString
andLocalInternedString
might not be necessary.)I got it working, but unfortunately it was a clear slowdown of a few percent.
22 remaining items