Refine row count retrieval to skip redundant Size() scans #605

lawofcycles · 2025-02-23T12:30:51Z

Issue #, if available:
#600
Description of changes:

This PR optimizes the use of Size() metrics in AnalysisRunner to avoid redundant
Spark scans when obtaining the row count for grouping analyzers.

Specifically:

Adds a conditional check to include Size(None) only when:
1. Grouping analyzers require a global row count (i.e., not a FrequencyBasedAnalyzer).
2. We haven't already included Size(None) implicitly.
3. There's at least one analyzer whose state isn't already loaded (so an actual scan is needed).
Introduces a helper method actuallyNeedsScanning to detect whether all
required analyzer states are already available from aggregateWith. If so, we can skip a new scan entirely.
Extracts the row count from the Size() metric only if it was actually included in the scanning analyzers, preventing unnecessary computations.

This change addresses the TODO comment in the AnalysisRunner.doAnalysisRun method regarding row count retrieval efficiency and reduces unnecessary Spark actions, thereby improving performance.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

lawofcycles · 2025-02-24T00:58:04Z

I am conducting tests to see how these changes lead to improved performance.

lawofcycles · 2025-03-03T10:09:38Z

While this PR does fulfill the TODO about avoiding redundant row-count scans in certain edge cases—particularly if a grouping analyzer already retrieves numRows—in most typical scenarios the performance improvement is quite limited. For instance, I ran tests using TPC-DS at the 3 TB scale factor and saw no noticeable difference in runtime. Deequ already consolidates many scan-based analyzers into a single pass, so the main benefit occurs when users unintentionally include Size() even though a grouping analyzer is sufficient.

That said, the change still tidies up the logic and prevents double-counting in rare cases. I’m happy to leave the decision on merging or closing in your hands—just let me know if you’d like any additional updates or tests!

Optimize Size usage to avoid redundant scans

f645bd0

lawofcycles changed the title ~~Optimize Size usage to avoid redundant scans and improve performance~~ [WIP] Optimize Size usage to avoid redundant scans and improve performance Feb 24, 2025

lawofcycles changed the title ~~[WIP] Optimize Size usage to avoid redundant scans and improve performance~~ Optimize Size usage to avoid redundant scans and improve performance Mar 3, 2025

lawofcycles changed the title ~~Optimize Size usage to avoid redundant scans and improve performance~~ Refine row count retrieval to skip redundant Size() scans Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine row count retrieval to skip redundant Size() scans #605

Refine row count retrieval to skip redundant Size() scans #605

lawofcycles commented Feb 23, 2025

lawofcycles commented Feb 24, 2025

lawofcycles commented Mar 3, 2025

Refine row count retrieval to skip redundant Size() scans #605

Are you sure you want to change the base?

Refine row count retrieval to skip redundant Size() scans #605

Conversation

lawofcycles commented Feb 23, 2025

lawofcycles commented Feb 24, 2025

lawofcycles commented Mar 3, 2025