-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove MinimizationOperations #17268
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17268 +/- ##
============================================
- Coverage 72.40% 72.39% -0.01%
+ Complexity 65554 65488 -66
============================================
Files 5292 5291 -1
Lines 304493 304323 -170
Branches 44218 44176 -42
============================================
- Hits 220463 220316 -147
+ Misses 65975 65951 -24
- Partials 18055 18056 +1 ☔ View full report in Codecov by Sentry. |
@@ -147,6 +146,8 @@ private static CharacterRunAutomaton buildCharacterRunAutomaton(Collection<Syste | |||
Optional<Automaton> automaton = descriptors.stream() | |||
.map(descriptor -> Regex.simpleMatchToAutomaton(descriptor.getIndexPattern())) | |||
.reduce(Operations::union); | |||
return new CharacterRunAutomaton(MinimizationOperations.minimize(automaton.orElse(Automata.makeEmpty()), Integer.MAX_VALUE)); | |||
return new CharacterRunAutomaton( | |||
Operations.determinize(automaton.orElse(Automata.makeEmpty()), Operations.DEFAULT_DETERMINIZE_WORK_LIMIT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some of these, it looks like we're
- parsing wildcard-type expressions with this simpleMatchToAutomaton,
- determinizing them (Operations.determinize) presumably just for the next step.
- tableizing them (CharRunAutomaton) for fastest execution.
This is a lot of investment, that makes sense if you are going to match gazillions of terms in the term dictionary.
But if you are just matching a couple things, maybe consider less heavy options?
determinize()
, but don't tableize, and then match strings withOperations.run()
- tableize, but don't
determinize()
by usingNFARunAutomaton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for these cases it probably makes the most sense to run them as NFAs. Almost 100% of the time, they're just going to be prefix matches (or exact matches, or unions of exact matches) anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one area i'd say to take caution would be: multiple threads. So this is just a note, something to think about, as it requires some thought.
NFARunAutomaton is mutable and determinizes pieces "as-it-goes" and caches its work. This is different from existing RunAutomaton which is simple immutable arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh -- dang. Thanks a lot for the word of caution!
This SystemIndices
is a singleton initialized on startup and gets called from all over the place. For example, in a _bulk
request, each index name gets checked against the registered system index name patterns. Since we regularly process a bunch of bulk requests at once, the NFARunAutomaton would be a no-go.
It sounds like the Operations.run()
approach is the one that won't blow up.
Of course, given that it's a one-time cost to determinize and tableize on startup, maybe that's the better choice if it shaves even a bit of effort at runtime?
Edit: Same deal with the ReindexValidator
. It's a singleton created on node startup, so I think I'm inclined to just eat the setup overhead to save on runtime cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well nothing there will blow up. but tableizing the thing does have a memory cost. Maybe just keep it the same, for now.
I need to followup and investigate how much performance these RunAutomatons even buy us anymore vs Operations.run(). The (non-tableized) automata representation has become much more efficient, thanks to @mikemccand.
Originally "Automaton" was a typical inefficient java representation, and "RunAutomaton" would tableize it into efficient array, which gives you e.g. cache locality and all that. But now Automaton has that locality, it is represented by two arrays: states[]
and transitions[]
.
So it is possible the entire RunAutomaton is just a waste of code and RAM and that we could nuke it: needs investigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for removing this: it is a killer.
A lot of these will just come out minimal-from-the-getgo with lucene, but minimizing is never necessary, and its very costly. Just look at how terrifying the code you deleted is :)
Thanks @msfroh , this one is solely on me: I brought it into 3.x line to unlock the Lucene 10 migration, thanks for cleaning things up! |
❌ Gradle check result for 0eb2b5a: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
This class was removed from Lucene because it is no longer needed. It was copied into the OpenSearch code base because we were using it and we didn't want our code to break. In fact, the correct choice would have been to remove the references to `MinimizationOperations.minimize()` and either replace them with calls to `Operations.determinize()` or simply drop them altogether because the automaton is deterministic. Signed-off-by: Michael Froh <[email protected]>
Signed-off-by: Michael Froh <[email protected]>
0eb2b5a
to
bad6e8b
Compare
Description
This class was removed from Lucene because it is no longer needed. It was copied into the OpenSearch code base because we were using it and we didn't want our code to break.
In fact, the correct choice would have been to remove the references to
MinimizationOperations.minimize()
and either replace them with calls toOperations.determinize()
or simply drop them altogether because the automaton is deterministic.Related Issues
Resolves #16975
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.