Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine row count retrieval to skip redundant Size() scans #605

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lawofcycles
Copy link

Issue #, if available:
#600
Description of changes:

This PR optimizes the use of Size() metrics in AnalysisRunner to avoid redundant
Spark scans when obtaining the row count for grouping analyzers.

Specifically:

  • Adds a conditional check to include Size(None) only when:

    1. Grouping analyzers require a global row count (i.e., not a FrequencyBasedAnalyzer).
    2. We haven't already included Size(None) implicitly.
    3. There's at least one analyzer whose state isn't already loaded (so an actual scan is needed).
  • Introduces a helper method actuallyNeedsScanning to detect whether all
    required analyzer states are already available from aggregateWith. If so, we can skip a new scan entirely.

  • Extracts the row count from the Size() metric only if it was actually included in the scanning analyzers, preventing unnecessary computations.

This change addresses the TODO comment in the AnalysisRunner.doAnalysisRun method regarding row count retrieval efficiency and reduces unnecessary Spark actions, thereby improving performance.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@lawofcycles lawofcycles changed the title Optimize Size usage to avoid redundant scans and improve performance [WIP] Optimize Size usage to avoid redundant scans and improve performance Feb 24, 2025
@lawofcycles
Copy link
Author

I am conducting tests to see how these changes lead to improved performance.

@lawofcycles
Copy link
Author

While this PR does fulfill the TODO about avoiding redundant row-count scans in certain edge cases—particularly if a grouping analyzer already retrieves numRows—in most typical scenarios the performance improvement is quite limited. For instance, I ran tests using TPC-DS at the 3 TB scale factor and saw no noticeable difference in runtime. Deequ already consolidates many scan-based analyzers into a single pass, so the main benefit occurs when users unintentionally include Size() even though a grouping analyzer is sufficient.

That said, the change still tidies up the logic and prevents double-counting in rare cases. I’m happy to leave the decision on merging or closing in your hands—just let me know if you’d like any additional updates or tests!

@lawofcycles lawofcycles changed the title [WIP] Optimize Size usage to avoid redundant scans and improve performance Optimize Size usage to avoid redundant scans and improve performance Mar 3, 2025
@lawofcycles lawofcycles changed the title Optimize Size usage to avoid redundant scans and improve performance Refine row count retrieval to skip redundant Size() scans Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant