Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove MinimizationOperations #17268

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

msfroh
Copy link
Collaborator

@msfroh msfroh commented Feb 6, 2025

Description

This class was removed from Lucene because it is no longer needed. It was copied into the OpenSearch code base because we were using it and we didn't want our code to break.

In fact, the correct choice would have been to remove the references to
MinimizationOperations.minimize() and either replace them with calls to Operations.determinize() or simply drop them altogether because the automaton is deterministic.

Related Issues

Resolves #16975

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added bug Something isn't working Search:Resiliency labels Feb 6, 2025
Copy link
Contributor

github-actions bot commented Feb 6, 2025

✅ Gradle check result for f011da3: SUCCESS

Copy link

codecov bot commented Feb 6, 2025

Codecov Report

Attention: Patch coverage is 69.23077% with 4 lines in your changes missing coverage. Please review.

Project coverage is 72.39%. Comparing base (77e4112) to head (bad6e8b).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...nsearch/common/lucene/search/AutomatonQueries.java 62.50% 2 Missing and 1 partial ⚠️
...a/org/opensearch/index/mapper/TextFieldMapper.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17268      +/-   ##
============================================
- Coverage     72.40%   72.39%   -0.01%     
+ Complexity    65554    65488      -66     
============================================
  Files          5292     5291       -1     
  Lines        304493   304323     -170     
  Branches      44218    44176      -42     
============================================
- Hits         220463   220316     -147     
+ Misses        65975    65951      -24     
- Partials      18055    18056       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -147,6 +146,8 @@ private static CharacterRunAutomaton buildCharacterRunAutomaton(Collection<Syste
Optional<Automaton> automaton = descriptors.stream()
.map(descriptor -> Regex.simpleMatchToAutomaton(descriptor.getIndexPattern()))
.reduce(Operations::union);
return new CharacterRunAutomaton(MinimizationOperations.minimize(automaton.orElse(Automata.makeEmpty()), Integer.MAX_VALUE));
return new CharacterRunAutomaton(
Operations.determinize(automaton.orElse(Automata.makeEmpty()), Operations.DEFAULT_DETERMINIZE_WORK_LIMIT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some of these, it looks like we're

  1. parsing wildcard-type expressions with this simpleMatchToAutomaton,
  2. determinizing them (Operations.determinize) presumably just for the next step.
  3. tableizing them (CharRunAutomaton) for fastest execution.

This is a lot of investment, that makes sense if you are going to match gazillions of terms in the term dictionary.

But if you are just matching a couple things, maybe consider less heavy options?

  1. determinize(), but don't tableize, and then match strings with Operations.run()
  2. tableize, but don't determinize() by using NFARunAutomaton

Copy link
Collaborator Author

@msfroh msfroh Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for these cases it probably makes the most sense to run them as NFAs. Almost 100% of the time, they're just going to be prefix matches (or exact matches, or unions of exact matches) anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one area i'd say to take caution would be: multiple threads. So this is just a note, something to think about, as it requires some thought.

NFARunAutomaton is mutable and determinizes pieces "as-it-goes" and caches its work. This is different from existing RunAutomaton which is simple immutable arrays.

Copy link
Collaborator Author

@msfroh msfroh Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh -- dang. Thanks a lot for the word of caution!

This SystemIndices is a singleton initialized on startup and gets called from all over the place. For example, in a _bulk request, each index name gets checked against the registered system index name patterns. Since we regularly process a bunch of bulk requests at once, the NFARunAutomaton would be a no-go.

It sounds like the Operations.run() approach is the one that won't blow up.

Of course, given that it's a one-time cost to determinize and tableize on startup, maybe that's the better choice if it shaves even a bit of effort at runtime?

Edit: Same deal with the ReindexValidator. It's a singleton created on node startup, so I think I'm inclined to just eat the setup overhead to save on runtime cost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well nothing there will blow up. but tableizing the thing does have a memory cost. Maybe just keep it the same, for now.

I need to followup and investigate how much performance these RunAutomatons even buy us anymore vs Operations.run(). The (non-tableized) automata representation has become much more efficient, thanks to @mikemccand.

Originally "Automaton" was a typical inefficient java representation, and "RunAutomaton" would tableize it into efficient array, which gives you e.g. cache locality and all that. But now Automaton has that locality, it is represented by two arrays: states[] and transitions[].

So it is possible the entire RunAutomaton is just a waste of code and RAM and that we could nuke it: needs investigation.

Copy link
Contributor

@rmuir rmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for removing this: it is a killer.

A lot of these will just come out minimal-from-the-getgo with lucene, but minimizing is never necessary, and its very costly. Just look at how terrifying the code you deleted is :)

@reta
Copy link
Collaborator

reta commented Feb 6, 2025

Thanks @msfroh , this one is solely on me: I brought it into 3.x line to unlock the Lucene 10 migration, thanks for cleaning things up!

Copy link
Contributor

github-actions bot commented Feb 6, 2025

❌ Gradle check result for 0eb2b5a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

This class was removed from Lucene because it is no longer needed. It
was copied into the OpenSearch code base because we were using it and
we didn't want our code to break. In fact, the correct choice would
have been to remove the references to
`MinimizationOperations.minimize()` and either replace them with
calls to `Operations.determinize()` or simply drop them altogether
because the automaton is deterministic.

Signed-off-by: Michael Froh <[email protected]>
Signed-off-by: Michael Froh <[email protected]>
@msfroh msfroh force-pushed the remove_minimize_operations branch from 0eb2b5a to bad6e8b Compare February 7, 2025 00:14
Copy link
Contributor

github-actions bot commented Feb 7, 2025

✅ Gradle check result for bad6e8b: SUCCESS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search:Resiliency
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Can (easily) run out of memory with case-insensitive term queries
3 participants