Skip to content

Conversation

@heyihong
Copy link
Contributor

@heyihong heyihong commented Dec 29, 2025

What changes were proposed in this pull request?

This PR introduces a new method foreachWithSubqueriesAndPruning in QueryPlan.scala that provides a pruning-enabled variant of foreachWithSubqueries. The method only traverses nodes that match a given condition, improving efficiency. The PR also updates two existing usages:

  1. SparkConnectPlanner - Changed from transformUpWithSubqueriesAndPruning to foreachWithSubqueriesAndPruning since the code was only collecting observations without transforming the plan
  2. ObservationManager - Changed from foreach to foreachWithSubqueriesAndPruning with a condition to only visit nodes containing COLLECT_METRICS pattern

Why are the changes needed?

The changes are needed to:

  1. Provide a more efficient way to traverse query plans when only specific nodes matching certain patterns need to be visited (avoiding unnecessary traversal of irrelevant subtrees)
  2. Optimize observation management in ObservationManager by only traversing nodes that contain COLLECT_METRICS pattern instead of visiting every node

Does this PR introduce any user-facing change?

No. This is an internal optimization that improves performance and code correctness without changing any user-facing behavior or APIs.

How was this patch tested?

build/sbt "catalyst/testOnly org.apache.spark.sql.catalyst.plans.QueryPlanSuite"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 2.2.44

@heyihong heyihong force-pushed the SPARK-54865 branch 2 times, most recently from 5c1f30e to b0ed386 Compare December 29, 2025 17:20
@heyihong heyihong changed the title [SPARK-54865] Add foreachWithSubqueriesAndPruning method to QueryPlan [SPARK-54865][CONNECT][SQL] Add foreachWithSubqueriesAndPruning method to QueryPlan Dec 29, 2025
return
}
f(this)
subqueries.foreach(_.foreachWithSubqueriesAndPruning(cond)(f))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think foreachWithSubqueries traverse the children first, not subqueries. Shall we follow it here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or it's actually interleaved in foreachWithSubqueries

Copy link
Contributor Author

@heyihong heyihong Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The traversal order of foreachWithSubqueries should also be subqueries first, then children, but its traversal implementation is a bit hard to read and understand, in my opinion.

After taking a closer look at the code, I think it may be a style choice for TreeNode methods to handle plan traversal logic (including withPruning), while QueryPlan simply leverages that to perform additional tasks, such as handling subqueries.

So the correctness may not be affected, but style-wise, it makes more sense to move the pruning logic to TreeNode.

Copy link
Contributor Author

@heyihong heyihong Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @cloud-fan offline, while the current convention is to let TreeNode methods to handle plan traversal logic (including withPruning), for foreachWithSubqueriesAndPruning, a clearer implementation should avoid doing so.

* Only traverses nodes that match the given condition.
*/
def foreachWithSubqueriesAndPruning(
cond: TreePatternBits => Boolean)(f: PlanType => Unit): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other pruning methods also have a ruleId parameter, shall we follow?

Copy link
Contributor Author

@heyihong heyihong Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, ruleId is not applicable here, since it is used for transformation—it represents the transformation rule ID. Also, ruleId is not used by default, as the default value is UnknownRuleId.

If we really need this parameter, we can extend it later (less is more).

@heyihong heyihong requested a review from cloud-fan January 5, 2026 13:31
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 1e8048a Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants