Skip to content

Conversation

pan3793
Copy link
Member

@pan3793 pan3793 commented Aug 25, 2025

What changes were proposed in this pull request?

This PR proposes introducing a JDBC Driver for Spark Connect Server.

Note: The JDBC standard defines hundreds of APIs, most JDBC drivers only implement a subset of those. This PR is a PoC work that only implements a few sets of JDBC API, but allows integrating with BeeLine to use as a SQL CLI.

This PoC PR only handles NULL, BOOLEAN, BYTE, SHORT, INT, BIGINT, FLOAT, DOUBLE, STRING in ResultSet.

The JDBC URL reuses the current URL used by the Spark Connect client, with an additional prefix jdbc:, e.g., jdbc:sc://localhost:15002

Why are the changes needed?

This enables more pure SQL use cases for Spark Connect Server.

Does this PR introduce any user-facing change?

Yes, a new feature.

How was this patch tested?

1. Add some basic UTs.

2. Manual test with BeeLine

Start a Connect Server first. (I use Spark 4.0.0 as example)

$ sbin/start-connect-server.sh

Package with Hive and STS (required by BeeLine)

$ build/sbt -Phive,hive-thriftserver package

Run BeeLine in interactive mode.

$ SPARK_PREPEND_CLASSES=true bin/beeline -u jdbc:sc://localhost:15002
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
WARNING: Using incubator modules: jdk.incubator.vector
Connecting to jdbc:sc://localhost:15002
Connected to: Apache Spark Connect Server (version 4.0.0)
Driver: Apache Spark Connect JDBC Driver (version 4.1.0-SNAPSHOT)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.10 by Apache Hive
0: jdbc:sc://localhost:15002> select 'Hello, Spark Connect', version();
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/25 12:04:09 WARN Utils: Your hostname, H27212-MAC-01.local, resolves to a loopback address: 127.0.0.1; using 10.242.159.140 instead (on interface en0)
25/08/25 12:04:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
+-----------------------+-------------------------------------------------+
| Hello, Spark Connect  |                    version()                    |
+-----------------------+-------------------------------------------------+
| Hello, Spark Connect  | 4.0.0 fa33ea000a0bda9e5a3fa1af98e8e85b8cc5e4d4  |
+-----------------------+-------------------------------------------------+
1 row selected (1.759 seconds)
0: jdbc:sc://localhost:15002>

Run BeeLine to execute a SQL file

$ cat > /tmp/select.sql <<EOF
select 'Hello, Spark Connect';
select version();
EOF
$ SPARK_PREPEND_CLASSES=true bin/beeline -u jdbc:sc://localhost:15002 -f /tmp/select.sql
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
WARNING: Using incubator modules: jdk.incubator.vector
Connecting to jdbc:sc://localhost:15002
Connected to: Apache Spark Connect Server (version 4.0.0)
Driver: Apache Spark Connect JDBC Driver (version 4.1.0-SNAPSHOT)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:sc://localhost:15002> select 'Hello, Spark Connect';
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/25 12:04:52 WARN Utils: Your hostname, H27212-MAC-01.local, resolves to a loopback address: 127.0.0.1; using 10.242.159.140 instead (on interface en0)
25/08/25 12:04:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
+-----------------------+
| Hello, Spark Connect  |
+-----------------------+
| Hello, Spark Connect  |
+-----------------------+
1 row selected (0.462 seconds)
0: jdbc:sc://localhost:15002> select version();
+-------------------------------------------------+
|                    version()                    |
+-------------------------------------------------+
| 4.0.0 fa33ea000a0bda9e5a3fa1af98e8e85b8cc5e4d4  |
+-------------------------------------------------+
1 row selected (0.046 seconds)
0: jdbc:sc://localhost:15002>
0: jdbc:sc://localhost:15002> Closing: 0: jdbc:sc://localhost:15002
$

Was this patch authored or co-authored using generative AI tooling?

No.

@pan3793
Copy link
Member Author

pan3793 commented Aug 25, 2025

cc @HyukjinKwon @grundprinzip @hvanhovell @LuciferYang @yaooqinn

Please let me know if the Spark community likes this feature, if so, I will continue the work.

@itskals
Copy link

itskals commented Aug 25, 2025

This appears promising. JDBC constructs will facilitate easier integration with numerous clients. Please specify the scope of work involved and your expectations to call it done-done. It seems that an SPIP or a more detailed document than the current proposal may be appropriate.

@LuciferYang
Copy link
Contributor

If this new approach has the potential to be fully compatible with and replace STS, and can enable Spark to completely remove the STS code in a future version, I will strongly support the introduction of this new feature.

@LuciferYang
Copy link
Contributor

LuciferYang commented Aug 25, 2025

also cc @cloud-fan and @zhengruifeng

@pan3793
Copy link
Member Author

pan3793 commented Aug 25, 2025

@itskals thanks for your response!

Please specify the scope of work involved ...

JDBC has well-defined APIs, so there isn't much room for implementation flexibility.

... your expectations to call it done-done.

This is a good question. I can imagine 3 milestones:

  1. Usable: supports all Spark Primitive Data Types, works well with BeeLine as a SQL CLI to execute SQL, and retrieve results
  2. Done: supports all Spark Data Types, and JDBC API implementation coverage reaches the Hive JDBC driver level, then it can compete with the Spark Thrift Server.
  3. Improvement: implement more JDBC APIs to enable Spark Connect JDBC driver to integrate with more tools, e.g., DBeaver.

It seems that an SPIP or a more detailed document than the current proposal may be appropriate.

If my above two answers do not solve your concerns, I can follow the SPIP guide to process this feature. Thank you again for your quick reply!

@pan3793
Copy link
Member Author

pan3793 commented Aug 25, 2025

If this new approach has the potential to be fully compatible with and replace STS, and can enable Spark to completely remove the STS code in a future version, I will strongly support the introduction of this new feature.

@LuciferYang thanks for your reply!

I suppose this feature could make Connect Server a drop-in replacement for STS in two typical use cases: 1) use spark-sql/beeline to run SQL; 2) use the JDBC driver to access STS. But for users who use other APIs to access STS, e.g., ODBC driver, Thrift APIs, it requires additional work to migrate from STS to Connect Server.

@HyukjinKwon
Copy link
Member

I feel like this would need an SPIP ...

@pan3793
Copy link
Member Author

pan3793 commented Aug 25, 2025

Okay, will prepare an SPIP soon. Feedback is still welcome here :)

@grundprinzip
Copy link
Contributor

I like the idea of a native JDBC driver that is based on Spark Connect, that makes a lot of sense! I'm supportive of going through a SPIP here and I think that replacing the Spark Thrift Server is for sure a good idea :)

@hvanhovell
Copy link
Contributor

@pan3793 nice work! Very much in favor of this!

@pan3793
Copy link
Member Author

pan3793 commented Aug 26, 2025

@grundprinzip @hvanhovell thanks for your positive feedback, likely I will submit SPIP docs and start discussion next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants