Skip to content

Conversation

@gene-bordegaray
Copy link
Contributor

@gene-bordegaray gene-bordegaray commented Nov 19, 2025

Will finish later

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Benchmark results:

TPCH remains unchange as expected since the files don't use Hive-Partitioning

Screenshot 2025-11-19 at 4 14 21 PM

I created my own benchmark using partitioned files on large datasets and saw these speed-ups:

Query 1: Simple Aggregate by Partition Key

  SELECT f_dkey, COUNT(*), SUM(value), AVG(value)
  FROM hive_facts
  GROUP BY f_dkey

  | Version                   | Time   | Speedup      |
  |---------------------------|--------|--------------|
  | Before (FinalPartitioned) | 0.599s | baseline     |
  | After (SinglePartitioned) | 0.303s | 1.98× faster |

  Query 2: Prefix Matching (Partition + Timestamp)

  SELECT f_dkey, timestamp, COUNT(*), SUM(value)
  FROM hive_facts
  GROUP BY f_dkey, timestamp
  LIMIT 10

  | Version                   | Time   | Speedup      |
  |---------------------------|--------|--------------|
  | Before (FinalPartitioned) | 1.160s | baseline     |
  | After (SinglePartitioned) | 0.852s | 1.36× faster |

Are there any user-facing changes?

@github-actions github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels Nov 19, 2025
@gene-bordegaray gene-bordegaray changed the title Parallelize Aggregates when Partitioned By Group By Superset Enable Parallel Aggregation for Non-Overlapping Partitioned Data Nov 19, 2025
@gene-bordegaray gene-bordegaray force-pushed the gene.bordegaray/2025/11/parallelize_aggregations_when_partitioning_allows branch from 78e9af0 to be508b7 Compare November 19, 2025 20:49
@github-actions github-actions bot added proto Related to proto crate physical-plan Changes to the physical-plan crate labels Nov 19, 2025
@gene-bordegaray gene-bordegaray force-pushed the gene.bordegaray/2025/11/parallelize_aggregations_when_partitioning_allows branch from 48b2e63 to 57aa0f1 Compare November 19, 2025 22:11
@gene-bordegaray
Copy link
Contributor Author

cc: @NGA-TRAN

@NGA-TRAN
Copy link
Contributor

Thanks @gene-bordegaray — impressive performance improvement! I’ll plan to review this during the first week of December. Adding a feature design description to the PR would be helpful for context.

If you identify any follow‑up work, feel free to open tickets, link this PR there, and continue working on next PRs on top of this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable Parallel Aggregation for Non-Overlapping Partitioned Data

2 participants