[SPARK-52975][SQL] Simplify field names in pushdown join sql #51686

dengziming · 2025-07-28T12:16:59Z

What changes were proposed in this pull request?

When pushing down join SQL, we generated aliases for duplicated names, but the aliases are too long to read and nondeterministic.

Before this change:

SELECT "ID_bf822dc6_e06d_492c_a489_1e92a6fe84a0","AMOUNT_c9f3fc67_62f8_4ec6_9c3f_b7ee7bafcb5a","ADDRESS_d937a313_3e09_4b97_b91f_b2a47ef5e31d","ID","AMOUNT","ADDRESS" FROM xxxx     

RelationV2[ID_bf822dc6_e06d_492c_a489_1e92a6fe84a0#18, AMOUNT_c9f3fc67_62f8_4ec6_9c3f_b7ee7bafcb5a#19, ADDRESS_d937a313_3e09_4b97_b91f_b2a47ef5e31d#20, ID#21, AMOUNT#22, ADDRESS#23] join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_1

After this change.

SELECT "ID","AMOUNT","ADDRESS","ID_1","AMOUNT_1","ADDRESS_1" FROM xxx   

RelationV2[ID#18, AMOUNT#19, ADDRESS#20, ID_1#21, AMOUNT_1#22, ADDRESS_1#23] join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_1

Why are the changes needed?

Make code-generated JDBC SQL clearer and deterministic.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests can ensure no side effects are introduced.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Trae.

dengziming · 2025-07-28T12:22:18Z

cc @PetarVasiljevic-DB

PetarVasiljevic-DB

@dengziming this way, if you have left side column COL and right side columns COL, COL_0, alias generator will generate COL_0 which would conflict with COL_0 from right side.

dengziming · 2025-07-28T12:47:18Z

@dengziming this way, if you have left side column COL and right side columns COL, COL_0, alias generator will generate COL_0 which would conflict with COL_0 from right side.

Good catch @PetarVasiljevic-DB , let me think another way.

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

PetarVasiljevic-DB

LGTM, the generated text is much clearer, and more importantly, it is deterministic now. Thanks for the change!

By the way, could we move generateColumnAliasesForDuplicatedName under the pushdownJoin. Or above, doesn't really matter, I just find it too big have it as a nested method.

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourcePushdownTestUtils.scala

…l [0,1,2,3,0,-1,-2,-3]

dengziming · 2025-07-30T08:22:12Z

...ore/src/test/scala/org/apache/spark/sql/jdbc/v2/JDBCV2JoinPushdownIntegrationSuiteBase.scala

@@ -657,7 +657,7 @@ trait JDBCV2JoinPushdownIntegrationSuiteBase
    withSQLConf(SQLConf.DATA_SOURCE_V2_JOIN_PUSHDOWN.key -> "true") {
      val df = sql(sqlQuery)
      val row = df.collect()(0)
-      assert(row == Row(0, 1, 2, 3, 0, -1, -2, -3))
+      assert(row.toString == Row(0, 1, 2, 3, 0, -1, -2, -3).toString)


It seems that Oracle will use DecimalType, so we can't compare Row directly.

dengziming · 2025-07-30T08:56:53Z

Hello @cloud-fan
Please take a look at this. We have conducted a thorough check and @PetarVasiljevic-DB has already approved.

cloud-fan · 2025-07-30T09:31:24Z

...ore/src/test/scala/org/apache/spark/sql/jdbc/v2/JDBCV2JoinPushdownIntegrationSuiteBase.scala

+
+  test("Test complex duplicate column name alias") {
+    sql(s"create table $catalogAndNamespace.t1(id int, id_1 int, id_2 int, id_1_1 int)")
+    sql(s"create table $catalogAndNamespace.t2(id int, id_1 int, id_2 int, id_2_1 int)")


can we create them in def tablePreparation?

cloud-fan · 2025-07-30T09:33:47Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+    //  Count occurrences of each column name across both sides to identify duplicates.
+    val allRequiredColumnNames = leftSideRequiredColumnNames ++ rightSideRequiredColumnNames
+    val allNameCounts: Map[String, Int] =
+      allRequiredColumnNames.groupBy(identity).view.mapValues(_.size).toMap


shall we consider case sensitivity? if the left side has col and right side has COL, do we need to generate alias?

I don't think it's necessary after some investigation, if our sql is select * from a(id,sid) join b(id,Sid), we can have 2 versions of SQL pushdown to database:

select id, sid, id_1, Sid from (select id, sid from a) join (select id as id_1, Sid from b)

select id, sid, id_1, sid_1 from (select id, sid from a) join (select id as id_1, Sid as sid_1 from b)

I added this to my test case to show version 1 also can work, and version 2 doesn't make the sql clearer.
Is it possible we will meet AMBIGUOUS_REFERENCE in version 1?

The generated SQL is being processed by the underlying database, so we assume all dialects are case sensitive?

No, I thought they were case sensitive at first, but I tested locally and found that SqlServer is not case sensitive, so I have updated this PR, please review my latest commit and latest comment here: #51686 (comment)

cloud-fan · 2025-07-30T09:38:23Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+      allRequiredColumnNames.groupBy(identity).view.mapValues(_.size).toMap
+    // Use Set for O(1) lookups when checking existing column names, claim all names
+    // that appears only once to ensure they have highest priority.
+    val allClaimedAliases = mutable.HashSet.empty ++ allNameCounts.filter(_._2 == 1).keySet


Suggested change

val allClaimedAliases = mutable.HashSet.empty ++ allNameCounts.filter(_._2 == 1).keySet

val allClaimedAliases = allNameCounts.filter(_._2 == 1).keySet.to[mutable.Set]

dengziming · 2025-07-31T03:38:09Z

@cloud-fan, your idea is worth considering. SQL Server will get "Ambiguous column name 'sid'" when running my test. so we need to generate different alais if 2 columns equal ignore case. Please review my latest commit, cc @PetarVasiljevic-DB

...c/test/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDownSuite.scala

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

cloud-fan · 2025-07-31T13:20:25Z

...c/test/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDownSuite.scala

+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.connector.read.SupportsPushDownJoin.ColumnWithAlias
+
+class V2ScanRelationPushDownSuite extends SparkFunSuite {


Suggested change

class V2ScanRelationPushDownSuite extends SparkFunSuite {

class DSV2JoinPushDownAliasGenerationSuite extends SparkFunSuite {

...ore/src/test/scala/org/apache/spark/sql/jdbc/v2/JDBCV2JoinPushdownIntegrationSuiteBase.scala

dengziming · 2025-08-02T08:41:38Z

@cloud-fan comments resolved.

cloud-fan · 2025-08-04T05:28:34Z

thanks, merging to master!

dongjoon-hyun

Hi, @dengziming and @cloud-fan .

This seems to break non-ANSI GitHub CI. Could you take a look at the failure?

https://github.com/apache/spark/actions/workflows/build_non_ansi.yml

[info] - scan with filter push-down with date time functions *** FAILED *** (531 milliseconds)
[info]   List(Filter (month(cast(DATE1#3188 as date)) = 5)
[info]   +- RelationV2[NAME#3187, DATE1#3188] oracle.SYSTEM.DATETIME
[info]   ) was not empty (DataSourcePushdownTestUtils.scala:44)

dengziming · 2025-08-11T02:36:37Z

This seems to break non-ANSI GitHub CI. Could you take a look at the failure?

I will take a look right now.

dongjoon-hyun · 2025-08-11T04:11:23Z

Thank you, @dengziming .

github-actions bot added the SQL label Jul 28, 2025

PetarVasiljevic-DB suggested changes Jul 28, 2025

View reviewed changes

PetarVasiljevic-DB reviewed Jul 29, 2025

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala Outdated Show resolved Hide resolved

PetarVasiljevic-DB approved these changes Jul 29, 2025

View reviewed changes

dengziming force-pushed the SPARK-52975 branch from c0dd5f0 to 3241b26 Compare July 29, 2025 10:04

PetarVasiljevic-DB approved these changes Jul 29, 2025

View reviewed changes

abhiips07 reviewed Jul 29, 2025

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala Outdated Show resolved Hide resolved

dengziming added 6 commits July 30, 2025 08:41

[SPARK-52975][SQL] Simplify field names in pushdown join sql

91d9e17

More deterministic way

fafd878

More deterministic way

93ae8e6

Avoid o(n^2) worst case

6ad97ac

refactor: move big method out.

6a97817

more improvement

f71d4a3

dengziming force-pushed the SPARK-52975 branch from 5de7386 to f71d4a3 Compare July 30, 2025 00:41

dengziming added 2 commits July 30, 2025 11:00

Improve test cases

59efef2

Improve test cases

a732f8b

dengziming commented Jul 30, 2025

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourcePushdownTestUtils.scala Show resolved Hide resolved

OracleJoinPushdownIntegrationSuite: [0,1,2,3,0,-1,-2,-3] did not equa…

c255fe1

…l [0,1,2,3,0,-1,-2,-3]

dengziming commented Jul 30, 2025

View reviewed changes

cloud-fan reviewed Jul 30, 2025

View reviewed changes

dengziming added 3 commits July 30, 2025 21:22

code optimized.

9fe40b4

test consider case sensitivity

adf1c2c

Make generation case-sensitive(sql-server is different)

76d0fe2

PetarVasiljevic-DB reviewed Jul 31, 2025

View reviewed changes

...c/test/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDownSuite.scala Outdated Show resolved Hide resolved

PetarVasiljevic-DB approved these changes Jul 31, 2025

View reviewed changes

improve test style and assert

ba8fff5

cloud-fan reviewed Jul 31, 2025

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 31, 2025

View reviewed changes

...ore/src/test/scala/org/apache/spark/sql/jdbc/v2/JDBCV2JoinPushdownIntegrationSuiteBase.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Jul 31, 2025

View reviewed changes

resolve comments

f420c85

cloud-fan closed this in 34d7a3c Aug 4, 2025

dongjoon-hyun reviewed Aug 8, 2025

View reviewed changes

	val allClaimedAliases = mutable.HashSet.empty ++ allNameCounts.filter(_._2 == 1).keySet
	val allClaimedAliases = allNameCounts.filter(_._2 == 1).keySet.to[mutable.Set]

	class V2ScanRelationPushDownSuite extends SparkFunSuite {
	class DSV2JoinPushDownAliasGenerationSuite extends SparkFunSuite {

[SPARK-52975][SQL] Simplify field names in pushdown join sql #51686

[SPARK-52975][SQL] Simplify field names in pushdown join sql #51686

Uh oh!

Conversation

dengziming commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dengziming commented Jul 28, 2025

Uh oh!

PetarVasiljevic-DB left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dengziming commented Jul 28, 2025

Uh oh!

Uh oh!

PetarVasiljevic-DB left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dengziming commented Jul 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dengziming commented Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dengziming commented Aug 2, 2025

Uh oh!

cloud-fan commented Aug 4, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dengziming commented Aug 11, 2025

Uh oh!

dongjoon-hyun commented Aug 11, 2025

Uh oh!

Uh oh!

dengziming commented Jul 28, 2025 •

edited

Loading

PetarVasiljevic-DB left a comment •

edited

Loading