Refine row count retrieval to skip redundant Size() scans #605
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
#600
Description of changes:
This PR optimizes the use of Size() metrics in AnalysisRunner to avoid redundant
Spark scans when obtaining the row count for grouping analyzers.
Specifically:
Adds a conditional check to include Size(None) only when:
Introduces a helper method
actuallyNeedsScanning
to detect whether allrequired analyzer states are already available from
aggregateWith
. If so, we can skip a new scan entirely.Extracts the row count from the Size() metric only if it was actually included in the scanning analyzers, preventing unnecessary computations.
This change addresses the TODO comment in the AnalysisRunner.doAnalysisRun method regarding row count retrieval efficiency and reduces unnecessary Spark actions, thereby improving performance.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.