Remove MinimizationOperations #17268

msfroh · 2025-02-06T00:26:05Z

Description

This class was removed from Lucene because it is no longer needed. It was copied into the OpenSearch code base because we were using it and we didn't want our code to break.

In fact, the correct choice would have been to remove the references to
MinimizationOperations.minimize() and either replace them with calls to Operations.determinize() or simply drop them altogether because the automaton is deterministic.

Related Issues

Resolves #16975

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2025-02-06T01:18:53Z

✅ Gradle check result for f011da3: SUCCESS

codecov · 2025-02-06T01:19:25Z

Codecov Report

Attention: Patch coverage is 69.23077% with 4 lines in your changes missing coverage. Please review.

Project coverage is 72.39%. Comparing base (77e4112) to head (bad6e8b).
Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
...nsearch/common/lucene/search/AutomatonQueries.java	62.50%	2 Missing and 1 partial ⚠️
...a/org/opensearch/index/mapper/TextFieldMapper.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #17268      +/-   ##
============================================
- Coverage     72.40%   72.39%   -0.01%     
+ Complexity    65554    65488      -66     
============================================
  Files          5292     5291       -1     
  Lines        304493   304323     -170     
  Branches      44218    44176      -42     
============================================
- Hits         220463   220316     -147     
+ Misses        65975    65951      -24     
- Partials      18055    18056       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rmuir · 2025-02-06T06:09:01Z

server/src/main/java/org/opensearch/indices/SystemIndices.java

@@ -147,6 +146,8 @@ private static CharacterRunAutomaton buildCharacterRunAutomaton(Collection<Syste
        Optional<Automaton> automaton = descriptors.stream()
            .map(descriptor -> Regex.simpleMatchToAutomaton(descriptor.getIndexPattern()))
            .reduce(Operations::union);
-        return new CharacterRunAutomaton(MinimizationOperations.minimize(automaton.orElse(Automata.makeEmpty()), Integer.MAX_VALUE));
+        return new CharacterRunAutomaton(
+            Operations.determinize(automaton.orElse(Automata.makeEmpty()), Operations.DEFAULT_DETERMINIZE_WORK_LIMIT)


For some of these, it looks like we're

parsing wildcard-type expressions with this simpleMatchToAutomaton,

determinizing them (Operations.determinize) presumably just for the next step.

tableizing them (CharRunAutomaton) for fastest execution.

This is a lot of investment, that makes sense if you are going to match gazillions of terms in the term dictionary.

But if you are just matching a couple things, maybe consider less heavy options?

determinize(), but don't tableize, and then match strings with Operations.run()

tableize, but don't determinize() by using NFARunAutomaton

Yeah, for these cases it probably makes the most sense to run them as NFAs. Almost 100% of the time, they're just going to be prefix matches (or exact matches, or unions of exact matches) anyway.

The one area i'd say to take caution would be: multiple threads. So this is just a note, something to think about, as it requires some thought.

NFARunAutomaton is mutable and determinizes pieces "as-it-goes" and caches its work. This is different from existing RunAutomaton which is simple immutable arrays.

Oh -- dang. Thanks a lot for the word of caution!

This SystemIndices is a singleton initialized on startup and gets called from all over the place. For example, in a _bulk request, each index name gets checked against the registered system index name patterns. Since we regularly process a bunch of bulk requests at once, the NFARunAutomaton would be a no-go.

It sounds like the Operations.run() approach is the one that won't blow up.

Of course, given that it's a one-time cost to determinize and tableize on startup, maybe that's the better choice if it shaves even a bit of effort at runtime?

Edit: Same deal with the ReindexValidator. It's a singleton created on node startup, so I think I'm inclined to just eat the setup overhead to save on runtime cost.

Well nothing there will blow up. but tableizing the thing does have a memory cost. Maybe just keep it the same, for now.

I need to followup and investigate how much performance these RunAutomatons even buy us anymore vs Operations.run(). The (non-tableized) automata representation has become much more efficient, thanks to @mikemccand.

Originally "Automaton" was a typical inefficient java representation, and "RunAutomaton" would tableize it into efficient array, which gives you e.g. cache locality and all that. But now Automaton has that locality, it is represented by two arrays: states[] and transitions[].

So it is possible the entire RunAutomaton is just a waste of code and RAM and that we could nuke it: needs investigation.

rmuir

Thanks for removing this: it is a killer.

A lot of these will just come out minimal-from-the-getgo with lucene, but minimizing is never necessary, and its very costly. Just look at how terrifying the code you deleted is :)

reta · 2025-02-06T13:13:11Z

Thanks @msfroh , this one is solely on me: I brought it into 3.x line to unlock the Lucene 10 migration, thanks for cleaning things up!

github-actions · 2025-02-06T23:15:50Z

❌ Gradle check result for 0eb2b5a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

This class was removed from Lucene because it is no longer needed. It was copied into the OpenSearch code base because we were using it and we didn't want our code to break. In fact, the correct choice would have been to remove the references to `MinimizationOperations.minimize()` and either replace them with calls to `Operations.determinize()` or simply drop them altogether because the automaton is deterministic. Signed-off-by: Michael Froh <[email protected]>

Signed-off-by: Michael Froh <[email protected]>

github-actions · 2025-02-07T01:09:04Z

✅ Gradle check result for bad6e8b: SUCCESS

github-actions bot added bug Something isn't working Search:Resiliency labels Feb 6, 2025

rmuir reviewed Feb 6, 2025

View reviewed changes

rmuir approved these changes Feb 6, 2025

View reviewed changes

msfroh added 2 commits February 6, 2025 16:13

Add changelog entry

bad6e8b

Signed-off-by: Michael Froh <[email protected]>

msfroh force-pushed the remove_minimize_operations branch from 0eb2b5a to bad6e8b Compare February 7, 2025 00:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove MinimizationOperations #17268

Remove MinimizationOperations #17268

msfroh commented Feb 6, 2025

github-actions bot commented Feb 6, 2025

codecov bot commented Feb 6, 2025 •

edited

Loading

rmuir Feb 6, 2025

msfroh Feb 6, 2025 •

edited

Loading

rmuir Feb 6, 2025

msfroh Feb 6, 2025 •

edited

Loading

rmuir Feb 7, 2025

rmuir left a comment

reta commented Feb 6, 2025

github-actions bot commented Feb 6, 2025

github-actions bot commented Feb 7, 2025

Remove MinimizationOperations #17268

Are you sure you want to change the base?

Remove MinimizationOperations #17268

Conversation

msfroh commented Feb 6, 2025

Description

Related Issues

Check List

github-actions bot commented Feb 6, 2025

codecov bot commented Feb 6, 2025 • edited Loading

Codecov Report

rmuir Feb 6, 2025

Choose a reason for hiding this comment

msfroh Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

rmuir Feb 6, 2025

Choose a reason for hiding this comment

msfroh Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

rmuir Feb 7, 2025

Choose a reason for hiding this comment

rmuir left a comment

Choose a reason for hiding this comment

reta commented Feb 6, 2025

github-actions bot commented Feb 6, 2025

github-actions bot commented Feb 7, 2025

codecov bot commented Feb 6, 2025 •

edited

Loading

msfroh Feb 6, 2025 •

edited

Loading

msfroh Feb 6, 2025 •

edited

Loading