Skip to content

Conversation

dengziming
Copy link
Member

@dengziming dengziming commented Jul 28, 2025

What changes were proposed in this pull request?

When pushing down join SQL, we generated aliases for duplicated names, but the aliases are too long to read and nondeterministic.

Before this change:

SELECT "ID_bf822dc6_e06d_492c_a489_1e92a6fe84a0","AMOUNT_c9f3fc67_62f8_4ec6_9c3f_b7ee7bafcb5a","ADDRESS_d937a313_3e09_4b97_b91f_b2a47ef5e31d","ID","AMOUNT","ADDRESS" FROM xxxx     

RelationV2[ID_bf822dc6_e06d_492c_a489_1e92a6fe84a0#18, AMOUNT_c9f3fc67_62f8_4ec6_9c3f_b7ee7bafcb5a#19, ADDRESS_d937a313_3e09_4b97_b91f_b2a47ef5e31d#20, ID#21, AMOUNT#22, ADDRESS#23] join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_1

After this change.

SELECT "ID","AMOUNT","ADDRESS","ID_1","AMOUNT_1","ADDRESS_1" FROM xxx   

RelationV2[ID#18, AMOUNT#19, ADDRESS#20, ID_1#21, AMOUNT_1#22, ADDRESS_1#23] join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_1

Why are the changes needed?

Make code-generated JDBC SQL clearer and deterministic.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests can ensure no side effects are introduced.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Trae.

@github-actions github-actions bot added the SQL label Jul 28, 2025
@dengziming
Copy link
Member Author

cc @PetarVasiljevic-DB

Copy link
Contributor

@PetarVasiljevic-DB PetarVasiljevic-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dengziming this way, if you have left side column COL and right side columns COL, COL_0, alias generator will generate COL_0 which would conflict with COL_0 from right side.

@dengziming
Copy link
Member Author

@dengziming this way, if you have left side column COL and right side columns COL, COL_0, alias generator will generate COL_0 which would conflict with COL_0 from right side.

Good catch @PetarVasiljevic-DB , let me think another way.

Copy link
Contributor

@PetarVasiljevic-DB PetarVasiljevic-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the generated text is much clearer, and more importantly, it is deterministic now. Thanks for the change!

By the way, could we move generateColumnAliasesForDuplicatedName under the pushdownJoin. Or above, doesn't really matter, I just find it too big have it as a nested method.

@@ -657,7 +657,7 @@ trait JDBCV2JoinPushdownIntegrationSuiteBase
withSQLConf(SQLConf.DATA_SOURCE_V2_JOIN_PUSHDOWN.key -> "true") {
val df = sql(sqlQuery)
val row = df.collect()(0)
assert(row == Row(0, 1, 2, 3, 0, -1, -2, -3))
assert(row.toString == Row(0, 1, 2, 3, 0, -1, -2, -3).toString)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that Oracle will use DecimalType, so we can't compare Row directly.

@dengziming
Copy link
Member Author

Hello @cloud-fan
Please take a look at this. We have conducted a thorough check and @PetarVasiljevic-DB has already approved.


test("Test complex duplicate column name alias") {
sql(s"create table $catalogAndNamespace.t1(id int, id_1 int, id_2 int, id_1_1 int)")
sql(s"create table $catalogAndNamespace.t2(id int, id_1 int, id_2 int, id_2_1 int)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create them in def tablePreparation?

// Count occurrences of each column name across both sides to identify duplicates.
val allRequiredColumnNames = leftSideRequiredColumnNames ++ rightSideRequiredColumnNames
val allNameCounts: Map[String, Int] =
allRequiredColumnNames.groupBy(identity).view.mapValues(_.size).toMap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we consider case sensitivity? if the left side has col and right side has COL, do we need to generate alias?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary after some investigation, if our sql is select * from a(id,sid) join b(id,Sid), we can have 2 versions of SQL pushdown to database:

  1. select id, sid, id_1, Sid from (select id, sid from a) join (select id as id_1, Sid from b)
  2. select id, sid, id_1, sid_1 from (select id, sid from a) join (select id as id_1, Sid as sid_1 from b)

I added this to my test case to show version 1 also can work, and version 2 doesn't make the sql clearer.
Is it possible we will meet AMBIGUOUS_REFERENCE in version 1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated SQL is being processed by the underlying database, so we assume all dialects are case sensitive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I thought they were case sensitive at first, but I tested locally and found that SqlServer is not case sensitive, so I have updated this PR, please review my latest commit and latest comment here: #51686 (comment)

allRequiredColumnNames.groupBy(identity).view.mapValues(_.size).toMap
// Use Set for O(1) lookups when checking existing column names, claim all names
// that appears only once to ensure they have highest priority.
val allClaimedAliases = mutable.HashSet.empty ++ allNameCounts.filter(_._2 == 1).keySet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val allClaimedAliases = mutable.HashSet.empty ++ allNameCounts.filter(_._2 == 1).keySet
val allClaimedAliases = allNameCounts.filter(_._2 == 1).keySet.to[mutable.Set]

@dengziming
Copy link
Member Author

@cloud-fan, your idea is worth considering. SQL Server will get "Ambiguous column name 'sid'" when running my test. so we need to generate different alais if 2 columns equal ignore case. Please review my latest commit, cc @PetarVasiljevic-DB

import org.apache.spark.SparkFunSuite
import org.apache.spark.sql.connector.read.SupportsPushDownJoin.ColumnWithAlias

class V2ScanRelationPushDownSuite extends SparkFunSuite {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class V2ScanRelationPushDownSuite extends SparkFunSuite {
class DSV2JoinPushDownAliasGenerationSuite extends SparkFunSuite {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@dengziming
Copy link
Member Author

@cloud-fan comments resolved.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 34d7a3c Aug 4, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @dengziming and @cloud-fan .

This seems to break non-ANSI GitHub CI. Could you take a look at the failure?

https://github.com/apache/spark/actions/workflows/build_non_ansi.yml

Screenshot 2025-08-08 at 11 01 10
[info] - scan with filter push-down with date time functions *** FAILED *** (531 milliseconds)
[info]   List(Filter (month(cast(DATE1#3188 as date)) = 5)
[info]   +- RelationV2[NAME#3187, DATE1#3188] oracle.SYSTEM.DATETIME
[info]   ) was not empty (DataSourcePushdownTestUtils.scala:44)

@dengziming
Copy link
Member Author

This seems to break non-ANSI GitHub CI. Could you take a look at the failure?

I will take a look right now.

@dongjoon-hyun
Copy link
Member

Thank you, @dengziming .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants