-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parser::Lexer#advance method is too big for JIT compilers #871
Comments
Maybe we could experiment by doing that as postprocessing (e.g. using when 1
some
logic
when 2 in a lambda like when 1
-> {
some
logic
}.call
when 2 And possibly only do so when there is more than some threshold of code for a semantic action. |
I don't know, last time I checked it (~2y ago) it worked but I had similar number of LOCs.
No.
True, but it will also make it slightly slower on MRI, right?
Not sure. A long time ago I made this - it's a port of the lexer to C, to make Opal faster (mostly based on the PR you mentioned). Last time I updated it ~3 years ago, basically it's no longer maintained. Is it possible to check how it behaves on TruffleRuby? On C and register allocation optimisation - I think if two locals do not co-exist at the same time compiler can assign both of them to the same register. This
IIRC the plan is to release it in a year or two (cc @kddnewton) so I don't know if it's ok for you. |
It could, we'd have to measure it to know. Probably not much if it's extracted methods. We could also potentially generate two lexer files, especially with the automated lambda approach in #871 (comment) that'd make more sense. So we'd use the current state on CRuby but the "lambda-ified" versions for TruffleRuby, JRuby, etc.
Ah yes I remembered something like that but couldn't find it anymore. I'll try it.
Right, that's correct. It's probably much harder to analyze if it's |
The parser side of things in the new project is pretty far off, but the lexer side of things is actually quite a bit closer. Maybe it's worth investigating using the shared library just for the lexing bit? |
No, I think this won't work. In MRI parser tells lexer what to do in multiple places, i.e. you can't tokenize without performing reduce actions in the parser 😄 . What that means is that the state of the parser is tightly coupled with the state of the lexer, and any other lexer (with different internal state) won't match the parser (and other way around), and so only this (or very similar) lexer can work with this parser. Yeah, this is one of the reasons why |
This issue applies to opal likewise, when the compiler is run in node. V8 cannot optimize, because the method is too big. |
On that subject, https://sillycross.github.io/2022/11/22/2022-11-22/ makes a good case for C compilers not being to optimize well big switch/case statements. Regarding I think these huge lexing/parsing functions/methods are a problem regardless of the language. When written in C and compiled AOT at least it compiles but it's probably suboptimal (still much better than interpreted of course). |
lib-ruby-parser also uses Bison and its main function consists of 8513 LOC and 775 case-switch branches but it's very performant (~ 600K LOC/s), so I'm pretty sure it can be potentially optimised. Could you test it please? Or is it incompatible with TruffleRuby in terms of C APIs? Anyway, I don't see any other solution, so I guess it's worth to try. |
BTW, I noticed in Ripper, the lexing function seems to be I tried
I'll try lib-ruby-parser. The gem is only distributed as precompiled and that can't be used on TruffleRuby, so I'll try to build it from the repo. I think a potential solution here is to extract actions into lambdas like in #871 (comment), or do the same in TruffleRuby's parser/translator based on some heuristic. I haven't attempted that yet though, and it seems easier to prototype it with post-processing than changing the TruffleRuby translator for that. |
I don't think it's an easy task. The library is written in Rust, C bindings are a wrapper around Rust lib (and it's pre-compiled for major platforms in Releases section), Ruby bindings are a wrapper around C bindings, so... you'll have to re-compile the whole chain of libraries into TruffleRuby format.
Many actions in lexer depend on local variables that are a part of this giant switch, so it's possible only in a small subset of trivial actions. IIRC any Ragel DSL (like |
I gave it a quick try now. The main hurdle is the precompiled
But it gives me
I don't think that's an issue because lambdas can simply read/write outer variables, as if the lambda was not there. |
Ah, as I understand you want to define these lambdas in the Do you know how much should be extracted to get this method suitable for JIT? There are several big actions that are emitted 16 times 😄 Also, I've just realised that Ragel supports embedding one FSM into another, maybe then it will generate code for them separately. Can't google it actually right now, but I think I've seen it somewhere. |
This also impacts JRuby, although to a lesser extent. MRI
JRuby
Indeed the
This "instruction count 25005 exceeds threshold of 1000" is an artificial limit in JRuby, but even if I bump it up to 26000 is still fails because then the JVM bytecode produced is too large. We set the limit this low because we know the JVM itself has limits on how large a method it's willing to JIT to native code. Years ago I experimented with pulling out I also experimented with using a tree of case/when, simply chunking the method into pieces. That can probably make it small enough but it requires post-processing the ragel output. This issue may be more relevant today with CRuby starting to gain JIT capabilities, since this giant method may also confound yjit. |
Right. Even though these lambdas would be part of the
Yeah, but since that could be an automated process we could have the same lexer as today for CRuby/non-JIT and the lambda-ized or so lexer for the rest, so two files. Another way to make the lexer method even smaller would be to have it as some kind of table dispatch, and so have an Array or Hash of Integer state -> lambda. That means we would need to declare all those lambdas in some method and declare the shared variables there. That's clearly more automated rewriting though. |
I've tried to extract However it doesn't make any difference on JRuby. I've also tried to extract part of the lexer into a separate file using |
@iliabylich Nice, could you share a branch or PR with that change? I'd like to take a closer look if it compiles, which tier and also try it on bigger files. |
@iliabylich I've tried it (with
If I increase the limit from 150000 (default) to 300000 it does compile, but the compilation takes ~20 seconds which is a big part of why such limits exist:
Which command line/script are you using? I guess you're measuring 15 parses at once or so like Charles above?
So there is a clear speedup once it gets compiled in Tier 1 from ~2.5s to ~0.4s and not much difference in Tier 2 (because it doesn't inline anything anyway since it's so big).
In any case, your PR is helpful, I suggest to merge it. |
I ran this (on
|
With diff --git a/Rakefile b/Rakefile
index 74af484..c9e0391 100644
--- a/Rakefile
+++ b/Rakefile
@@ -152,7 +152,7 @@ task :changelog do
end
rule '.rb' => '.rl' do |t|
- sh "ragel -F1 -R #{t.source} -o #{t.name}"
+ sh "ragel -F0 -R #{t.source} -o #{t.name}"
end
rule '.rb' => '.y' do |t|
diff --git a/lib/parser/lexer.rl b/lib/parser/lexer.rl
index cea2fc6..3ac32e8 100644
--- a/lib/parser/lexer.rl
+++ b/lib/parser/lexer.rl
@@ -272,6 +272,7 @@ class Parser::Lexer
_lex_to_state_actions = klass.send :_lex_to_state_actions
_lex_from_state_actions = klass.send :_lex_from_state_actions
_lex_eof_trans = klass.send :_lex_eof_trans
+ _lex_actions = klass.send :_lex_actions
pe = @source_pts.size + 2
p, eof = @p, pe It still doesn't fit in the default limit, but the size is already much smaller.
AST -36% Calls -18% IR(after) -24% CodeSize -21% Also it fits in --engine.MaximumGraalGraphSize=200000 unlike F1 which needs --engine.MaximumGraalGraphSize=250000 for the
I think that's a big enough improvement to actually generate two CRuby 3.1 seems slightly slower with F0:
I'll try -T0 too. |
It seems to compile a bit faster but unsure if significant.
CRuby:
Diff for the record: diff --git a/Rakefile b/Rakefile
index 74af484..9455dce 100644
--- a/Rakefile
+++ b/Rakefile
@@ -152,7 +152,7 @@ task :changelog do
end
rule '.rb' => '.rl' do |t|
- sh "ragel -F1 -R #{t.source} -o #{t.name}"
+ sh "ragel -T0 -R #{t.source} -o #{t.name}"
end
rule '.rb' => '.y' do |t|
diff --git a/lib/parser/lexer.rl b/lib/parser/lexer.rl
index cea2fc6..d0d8db6 100644
--- a/lib/parser/lexer.rl
+++ b/lib/parser/lexer.rl
@@ -264,7 +264,7 @@ class Parser::Lexer
# Ugly, but dependent on Ragel output. Consider refactoring it somehow.
klass = self.class
_lex_trans_keys = klass.send :_lex_trans_keys
- _lex_key_spans = klass.send :_lex_key_spans
+ # _lex_key_spans = klass.send :_lex_key_spans
_lex_index_offsets = klass.send :_lex_index_offsets
_lex_indicies = klass.send :_lex_indicies
_lex_trans_targs = klass.send :_lex_trans_targs
@@ -272,6 +272,10 @@ class Parser::Lexer
_lex_to_state_actions = klass.send :_lex_to_state_actions
_lex_from_state_actions = klass.send :_lex_from_state_actions
_lex_eof_trans = klass.send :_lex_eof_trans
+ _lex_actions = klass.send :_lex_actions
+ _lex_key_offsets = klass.send :_lex_key_offsets
+ _lex_single_lengths = klass.send :_lex_single_lengths
+ _lex_range_lengths = klass.send :_lex_range_lengths
pe = @source_pts.size + 2
p, eof = @p, pe So |
* F0 generates a much smaller Parser::Lexer#advance method which is much better for JITs. * See whitequark#871
* F0 generates a much smaller Parser::Lexer#advance method which is much better for JITs. * See whitequark#871
Running on TruffleRuby JVM CE dev, with F0 and
With the same limit and F0 it's 10x slower (e.g. 2.463123072999224). With a high limit and F1 it's basically the same speed but F0 is much faster to compile:
|
The F0 patch is slower on JRuby. Did anyone test that before merging? |
The F0 patch just merged degrades performance on JRuby and does not appear to help TruffleRuby except with experimental options: #897 I'm not sure why this was merged so quickly without more testing. It penalizes JRuby performance by almost 40% and only helps non-experimental TruffleRuby by about 3%. |
There was a lot of testing, I already spent almost a day on it. But indeed I forgot to test on JRuby. See #898. |
FWIW, I wonder if Ragel is a good choice for So let's keep trimming |
I wonder if Ragel's Ruby generator is good for anything, to be honest. It suffers from the same problems as the Java generator: giant methods with no gotos, so it is a big ugly state machine inside a case/when or switch. The same problems plague the JRuby json library, which uses Ragel to generate part of the Java implementation. If it generated bytecode with gotos, it would be far more efficient, but it generates a giant .java method instead. |
I remeasured today, with the latest So the numbers from #871 (comment) look 3-4x times faster now.
So about 3.5x as fast as CRuby, pretty cool! Trying latest master 357fefa:
About the same performance, and now it fits in MaximumGraalGraphSize=180000, progress! Trying iliabylich/extract-actions 5b2135e:
The performance is about the same, maybe a tiny bit slower but unsure if that's significant. So I think using
Performance-wise On the bright side, 5b2135e manages to reduce the graph size by ~20000 (and 180k-20k almost fits in the default 150k!), the AST by 2090 nodes and the CodeSize by 51425 bytes (Tier2). So I think there are two conclusions:
I personally think it looks nice to pass Also, from this comment, I will try without having any block inside #advance to see the effect on MaximumGraalGraphSize and performance. Note I won't be on this computer until January so won't be able to measure until then. |
So when moving all blocks outside of Lexer#advance in #902, I see this:
That seems faster for performance (0.062 vs 0.065 for 357fefa), which is a good start. Update, after fixing #902 to pass the CI, here are the numbers, very similar:
|
* This gives big compilation size and time benefit, because the local variables of Lexer#advance are no longer captured by those blocks (even if unused due to Ruby semantics and `Proc#binding`). * See whitequark#871 (comment)
I've extracted A LOT in #901 and it's still not enough? I'm not sure that we are even able to reach minimum threshold that you mentioned, there are still a few actions that could be extracted but they are either small and repeated < 5 times or mid-size and repeated only once. I'm not going to merge it yet, so no rush. |
* This gives big compilation size and time benefit, because the local variables of Lexer#advance are no longer captured by those blocks (even if unused due to Ruby semantics and `Proc#binding`). * See whitequark#871 (comment)
I am unsure, maybe it is already enough, the above is just an estimate. |
At this point I lean toward agreeing with @iliabylich that the gains from these changes may not warrant the additional code complexity. It's clear that even with the most aggressive outlining of code we cannot get the method body small enough for TruffleRuby or JRuby to compile out of the box. On TruffleRuby, experimental options can be passed to force the method to compile, but I assume those limits were put in place to prevent an explosion of graph size and native code across the runtime. In JRuby, we can force it to try to compile to JVM bytecode, but the resulting bytecode size is still far too large for the JVM's classfile format (64k of bytecode). Even if we could get the size under the 64k limit, we'd end up in the same place as TruffleRuby, with the HotSpot JIT refusing to native-compile such a large method. The plan going forward for JRuby, as it always has been, is to enable JIT-level "chunking" of methods. We have some prototype code floating around that can automatically break a large method into smaller virtual methods, but work remains to make it generally useful. When combined with homogeneous "switch" optimizations for "case/when" we might be able to get those virtual method chunks to JIT individually in JRuby and then the JVM can inline and native-compile the hot paths as needed. I think the PRs that pull out obvious code duplication should remain in place, so long as they do not affect CRuby performance. The other more exotic changes, such as moving local state variables into instance variables, probably do not help anyone enough to warrant the complexity. |
I think we could try experimenting with extracting parts of the Lexer to sub-lexers. For example, strings lexing could be moved to a separate class that is called if main lexer's state is Many other things (keywords, numeric literals, etc) could be moved if this approach doesn't make it significantly slower (I really hope it doesn't). It's a major refactoring, so I need some time. I'll ping you once there's something to try. |
@iliabylich Thank you for your continued work on this! Yes, that seems like a good approach; basically you'll be manually "chunking" these lexers into smaller components that might be able to compile on their own. I and @enebo are standing by to help. I also want to make it clear: we very much do want to be able to native-compile this code, but we're so far away from that goal at this point I think it's worth reevaluating our approach. I outlined a ton of code, reducing instruction counts by a large percentage. Those changes are probably worth keeping, since they just turn a few common blocks of code into utility methods, but the later changes started to have diminishing returns (and we're still miles away from native-compiling this monster on JRuby/JVM). |
I think for TruffleRuby we're getting pretty close to the default limit and then we are able to compile the Of course anything to make it even smaller is good for warmup time, but for TruffleRuby the only requirement to JIT it is it fits in those 150k nodes. |
I tried #905, it looks great. Before: 667d60c
After: 5b16cd0
So it fits in 153k graph nodes with that PR! (vs 180k before) Also as we can see the performance difference is huge when it's compiled vs not, it's 38 times as fast when compiled! |
@eregon Thanks, I'll try get this PR merged this week. Another thing I wanted to try is moving keywords handling from compile-time (i.e. Ragel rules) to runtime by merging it with a generic identifier rule and doing comparison with known keywords on the fly. I hope it will reduce the number of branches in the biggest case-when by a lot. |
I've merged #905, now I'll try to move keywords handling at runtime, hopefully it won't hurt performance |
I have been investigating why
parser
is so slow on TruffleRuby.I had a suspicion it's a too big generated method and indeed that's now clearly the case from this analysis: oracle/truffleruby#2290 (comment)
This has been found before, and notably I found #524 from a quick search and #248 (comment), #248 (comment).
To summarize, the
Parser::Lexer#advance
method is currently 13525 lines long which is humongous.I would think most if not all Ruby JITs cannot compile such a big method (or if they do they will compile it in a very suboptimal manner).
And so that method needs to be run in the interpreter, and that's typically slower the more advanced the JIT is due to more profiling and other reasons.
Hence parser is slow on anything but CRuby, on MJIT it hangs, on YJIT it's a bit faster like 11% (YJIT seems to compile a few blocks of
advance
, not all). I usedruby -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 10.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }'
.So on TruffleRuby we can see it tried to compile it, saw it was way too big and bailed out, and then executed the
advance
method all the time in interpreter.It's easy to reproduce, for example in the parser repo:
(The
Graph Size: 150001
just means it aborted after reaching the limit)Interesting it does compile when lexing very small and simple files
So maybe it's not so much above the limit of what TruffleRuby/Graal can compile.
Of course, the lexer.rb file is generated by ragel from lexer.rl.
One idea would be to use different ragel flags to produce smaller method(s).
I tried all options from the
--help
with ragel 6.10 to see if we can make the method smaller, here are my notes:So the smallest is F0 (and T0), but that's still 6094 lines which still sounds a lot for a single method.
But maybe it helps enough for JITs.
Some points:
For some reason only F1 works, F0/T0/T1 fail as noted above.
Does anyone know why? Maybe that's a bug of Ragel? EDIT: I've found It's time to sizzle #248 (comment)
There is a BlockNode in Truffle which enables automatically splitting huge methods in smaller ones, this is not yet supported on TruffleRuby but would be worth a try. I think it depends on the structure of the code for whether this is able to split
Parser::Lexer#advance
in smaller chunks (e.g., it wouldn't work with a hugecase/when
spanning most of the method, which looks like it is the case here :/). And it will be compiled less optimally than if it was reasonably-sized methods in the first place.Have there been other experiments in this area to reduce the size of
Parser::Lexer#advance
?Extracting larger semantic actions into methods/lambdas would definitely help. (It's time to sizzle #248 (comment))
Maybe generating C or Java code would be a better for this kind of stuff (although Java also has a size limit to compile, and C compilers can have pretty bad register allocation with huge functions)?
I guess one possibility is of course to wait for the new Ruby parser by @kddnewton, as that will have the lexer and parser written in C (and probably not generated with ragel so much more reasonable function sizes). And possibly we could use it for this gem for recent Ruby versions.
The text was updated successfully, but these errors were encountered: