Skip to content

TLS lookups in libsyntax_pos are expensive #59718

Closed
@nnethercote

Description

@nnethercote
Contributor

#59693 is a nice speed-up for rustc, reducing instruction counts by as much as 12%. #59693 (comment) shows that approximately half the speedup is from avoiding TLS lookups.

So I thought: what else is using TLS lookups? I did some profiling and found that syntax_pos::GLOBALS accounts for most of it. It has three pieces, symbol_interner, hygiene_data, span_interner. I did some profiling of the places where they are accessed via GLOBALS::with:

rustc:
791545069 counts:
(  1) 499029030 (63.0%, 63.0%):     symbol_interner
(  2) 181386140 (22.9%, 86.0%):     hygiene_data
(  3) 109861627 (13.9%, 99.8%):     span_interner

ripgrep:
5455319 counts:
(  1)  2819190 (51.7%, 51.7%):     symbol_interner
(  2)  2015746 (37.0%, 88.6%):     hygiene_data
(  3)   599975 (11.0%, 99.6%):     span_interner

style-servo
79839701 counts:
(  1) 36436621 (45.6%, 45.6%):     hygiene_data
(  2) 31539114 (39.5%, 85.1%):     symbol_interner
(  3) 11562409 (14.5%, 99.6%):     span_interner

webrender
27006839 counts:
(  1) 11021232 (40.8%, 40.8%):     hygiene_data
(  2)  9218693 (34.1%, 74.9%):     symbol_interner
(  3)  6707365 (24.8%, 99.8%):     span_interner

These measurements are from a rustc that didn't have #59693's change applied, which avoids almost all of the span_interner accesses. And those accesses were only 11.0-24.8% of the syntax_pos::GLOBALS accesses. In other words, if we could eliminate most or all of the hygiene_data and symbol_interner accesses, we'd get even bigger wins than what we saw in #59693.

I admit that I don't understand how syntax_pos::GLOBALS works, why the TLS reference is needed for a global value.

One possible idea is to increase the size of Symbol from 4 bytes to 8 bytes, and then store short symbols (7 bytes or less) inline. Some preliminary profiling suggests this could capture roughly half of the symbols. hygiene_data is a harder nut to crack, being a more complicated structure.

cc @rust-lang/wg-compiler-performance

Activity

added
I-compiletimeIssue: Problems and improvements with respect to compile times.
on Apr 5, 2019
eddyb

eddyb commented on Apr 5, 2019

@eddyb
Member

Could these be sped up by using #[thread_local] directly, and maybe static linking instead of dynamic linking?
cc @alexcrichton

I admit that I don't understand how syntax_pos::GLOBALS works, why the TLS reference is needed for a global value.

If you're asking why it's not a plain static, that's because those should pretty much never be used, as rustc supports multiple instances per process (and e.g. rustdoc uses that to compile doc tests).

Zoxc

Zoxc commented on Apr 5, 2019

@Zoxc
Contributor

#59655 allows you to compare symbols against a predefined list of symbols without doing a TLS lookup and a string comparison. That will hopefully help some.

I'm also working on a PR which removes Symbol usage from symbol names (which tend to be unique and doesn't benefit from interning).

I'd also like to replace the u32 in Symbol with &'tcx str, but that would require adding a lifetime to the AST.

eddyb

eddyb commented on Apr 5, 2019

@eddyb
Member

I think arena-allocating the AST is the way forward anyway, so I wouldn't mind the lifetime tbh.

mati865

mati865 commented on Apr 5, 2019

@mati865
Member

Is it related to #25088?

eddyb

eddyb commented on Apr 5, 2019

@eddyb
Member

@mati865 We can figure out by trying to use a #[thread_local] static GLOBALS: Cell<Option<...>> = Cell::new(None); directly.

alexcrichton

alexcrichton commented on Apr 5, 2019

@alexcrichton
Member

While #[thread_local] can be used to test performance AFAIK it still doesn't work on MSVC. We do in fact already know that dynamic linking has a hit on performance wrt instruction counts. As to whether that's PLT lookups vs thread local lookups I'm not sure. (I'm hoping to revive that once I get access to Windows again)

nnethercote

nnethercote commented on Apr 5, 2019

@nnethercote
ContributorAuthor

If you're asking why it's not a plain static, that's because those should pretty much never be used, as rustc supports multiple instances per process (and e.g. rustdoc uses that to compile doc tests).

I'm asking why a global data structure requires TLS to access it... global data structures and TLS seem entirely orthogonal and incompatible to me. Clearly I'm missing something. What does "multiple instances per process" mean -- instances of what?

Mark-Simulacrum

Mark-Simulacrum commented on Apr 6, 2019

@Mark-Simulacrum
Member

Rustdoc will use rustc_driver and a set of other APIs to essentially attempt to call rustc as if it was a function. That spawns a thread (or more, with the parallel compiler enabled); each of those threads receives its own copy of these proto-globals; that means that they aren't necessarily global in the standard sense -- more so rustc-local.

eddyb

eddyb commented on Apr 6, 2019

@eddyb
Member

@nnethercote All "globals" in rustc are "thread-local globals" - as in, they're "global" in the sense of "accessible from a function with no arguments" but scoped to a thread.

And by "rustc supports multiple instances" I meant "multiple instances of itself", i.e. multiple rustc invocations, running concurrently, on disjoint threads, but not interfering eachother.
(But @Mark-Simulacrum explained it better anyway)

petrochenkov

petrochenkov commented on Apr 6, 2019

@petrochenkov
Contributor

cc #59749 (Measure upper limit for performance of 32 bit Span)

The same thing can be measured for the symbol interner as well, I guess, to estimate the impact.

nnethercote

nnethercote commented on Apr 6, 2019

@nnethercote
ContributorAuthor

That spawns a thread (or more, with the parallel compiler enabled)

So "thread" doesn't actually mean OS thread, but a rustc invocation that contains one or more OS threads, depending on whether rustc is serial or parellel. And GLOBALS isn't properly global, but only global w.r.t. a single rustc invocation.

These names are... well... I now feel more justified about my prior confusion. I've seen the word "session" used in the code, does that match "rustc invocation" as I've used it above?

I still don't understand how, in a parellel rustc, multiple OS threads can access the same TLS. Does each OS thread end up with a reference to the single mutex-protected quasi-global?

How important is the ability to run multiple rustc invocations? @eddyb said it's used for "rustdoc uses that to compile doc tests". Is it used for anything else?

Mark-Simulacrum

Mark-Simulacrum commented on Apr 7, 2019

@Mark-Simulacrum
Member

The threads do correspond to OS threads. However, my understanding is that GLOBALS is Session-like (just available earlier in the compilation session). I believe that's your understanding as well.

Yes, sessions are rustc "invocation" specific.

I still don't understand how, in a parellel rustc, multiple OS threads can access the same TLS. Does each OS thread end up with a reference to the single mutex-protected quasi-global?

Yes, the TLS just contains a pointer to the actual "global."

How important is the ability to run multiple rustc invocations? @eddyb said it's used for "rustdoc uses that to compile doc tests". Is it used for anything else?

My understanding is that doc tests would be considerably slower if we didn't have this in-process multi-invocationy style of building tests. I don't think it's used for anything else, necessarily, beyond perhaps unit tests in a few compiler tests.

I think historically the scoped TLS in the compiler has been used as an implicit context for things like Span, TyCtxt, etc. where there's some associated state that we don't currently thread through manually. I think it's possible that over time we could migrate away from TLS and towards other methods of threading the state through (and/or true globals via e.g. lazy_static) but I am unsure if that's feasible. I think historically it's not really been viable to completely remove (we use it too much, and it may be better than the alternative).

eddyb

eddyb commented on Apr 7, 2019

@eddyb
Member

We certainly do not consider "true globals" a reasonable limitation for "rustc as a library" (not to mention they'd need locks in cases where today we can use Cell/RefCell), and likely RLS would be impacted too (at least before we add multi-crate sessions to rustc).

Ideally we'd move to some language-integrated "implicit contexts" but that is nowhere near on the horizon.

nnethercote

nnethercote commented on Apr 29, 2019

@nnethercote
ContributorAuthor

A problem with the current Symbol implementation is that if you want to convert a Symbol to a string -- which is common -- you have to access the array within Interner, which involves TLS.

I tried changing Symbol so that instead of an index into the interner, it just held a raw (thin) pointer to the string's chars, which avoids this problem. (This required putting the string length in the arena next to the chars. A fat pointer would have made Symbol 16 bytes, which is much too big.) And I also made the interner truly global (using lazy_static) and immortal -- symbols added are never removed. This makes it simpler because there's no subtle reasoning about lifetimes like the current implementation. (And the distinction between InternedString and LocalInternedString might not be necessary.)

I got it working, but unfortunately it was a clear slowdown of a few percent.

22 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-thread-localsArea: Thread local storage (TLS)C-enhancementCategory: An issue proposing an enhancement or a PR with one.I-compiletimeIssue: Problems and improvements with respect to compile times.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @Zoxc@alexcrichton@eddyb@mati865@jonas-schievink

        Issue actions

          TLS lookups in libsyntax_pos are expensive · Issue #59718 · rust-lang/rust