[jvm-packages] Support spark connect #11381

wbo4958 · 2025-03-31T10:26:58Z

This PR is to make xgboost jvm package run on spark connect. Currently, I put the xgboost4j python wrapper into jvm-packages/xgboost4j-spark/python, but I can change it to other place. @WeichenXu123 @trivialfis let's discuss it on this thread.

To Do,

add read/write
add XGBoostRegressor
add XGBoostRanker

trivialfis

I think the Python code should be part of the Python package, we can reuse a lot of code there and it also helps unify the interfaces. In addition, Python packaging is not trivial, @hcho3 has been working on related projects and I can feel the difficulty there. Let's not do it for another package.

trivialfis · 2025-03-31T13:23:51Z

jvm-packages/xgboost4j-spark/python/pyproject.toml

+  "Environment :: GPU :: NVIDIA CUDA :: 11",
+  "Environment :: GPU :: NVIDIA CUDA :: 11.4",
+  "Environment :: GPU :: NVIDIA CUDA :: 11.5",
+  "Environment :: GPU :: NVIDIA CUDA :: 11.6",
+  "Environment :: GPU :: NVIDIA CUDA :: 11.7",
+  "Environment :: GPU :: NVIDIA CUDA :: 11.8",
+  "Environment :: GPU :: NVIDIA CUDA :: 12",
+  "Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.0",
+  "Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.1",
+  "Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.2",
+  "Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.3",
+  "Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.4",
+  "Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.5",
+  "Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.6",


These can be removed, we dropped suport for all previous versions in 3.0

jjayadeep06 · 2025-04-02T05:20:17Z

jvm-packages/xgboost4j-spark/python/src/xgboost4j/estimator.py

+from .params import XGBoostParams
+
+
+class XGBoostClassifier(_JavaProbabilisticClassifier["XGBoostClassificationModel"], XGBoostParams):


Will this be a new class that needs to be used for both normal and spark connect invocation ? Can we not modify _fit method in SparkXGBClassifier to use try_remote_fit decorator or will it be a big change ?

Thx @jjayadeep06 for your reply. Well, when involving connect, things is becoming complicated. You know, we could make the existing xgboost-pyspark support spark connect by changing the RDD operations to Dataframe without using any try_remote_xxxx. Yes, we have a plan to do that.

While this PR is to make xgboost jvm package to support connect by introducing a light-weight python wrapper. If we add the python wrapper over xgboost jvm package to the existing xgboost-python-pyspark, then it's going to raise an issue which backends (xgboost jvm package or python package) will be chose when running xgboost over connect?

wbo4958 · 2025-05-19T01:36:10Z

Hi @jjayadeep06, @WeichenXu123, @trivialfis, I propose integrating the python wrapper over xgboost jvm package into the xgboost python package as an optional module, placed in a subdirectory such as xgboost/jvm/ or xgboost/4j where the estimators will reside.

This would allow users to install jvm wrapper independently using the following command without requiring the full xgboost package.

pip install xgboost[4j]

trivialfis · 2025-05-19T16:43:23Z

I don't think it's easy to merge the package into the existing one but split it up during deployment. If it's merged, it's first class. You can put an experimental flag if you want.

I just want to say the design of the connect ML sets a very high bar for anyone who wants to dig in. One needs to know four different programming languages and some familiarity with spark connect ML to poke around, debugging native code is another level. I'm curious what leads to this design.

Lastly, we need CI.

jjayadeep06 · 2025-05-22T08:56:54Z

Is the thought here that when someone wants to use spark connect then they can install only the python jvm wrapper instead of entire xgboost package? If this is the case then we can use the existing python folder structure with optional dependencies wherein if we do pip install xgboost it will install the current xgboost code and if we do pip install xgboost[connect] then it will install the jvm only python wrapper ?

trivialfis · 2025-05-22T09:04:50Z

The XGBoost JVM package still needs to live on the server side with a matching version.

and if we do pip install xgboost[connect] then it will install the jvm only python wrapper ?

Based on my understanding of Python wheel packaging, xgboost[connect] means xgboost + connect specific dependencies. Like xgboost[dask] will install xgboost in addition to all dask-related dependencies.

jjayadeep06 · 2025-05-22T10:19:25Z

The XGBoost JVM package still needs to live on the server side with a matching version.

In Spark Connect the server is always a JVM server so i don't think we need the python jvm package on the server side. Also, this PR is using the standard scala/java APIs which are used by other SparkML models and therefore there is no additional deployment needed. @wbo4958 - please correct me.

and if we do pip install xgboost[connect] then it will install the jvm only python wrapper ?
Based on my understanding of Python wheel packaging, xgboost[connect] means xgboost + connect specific dependencies. Like xgboost[dask] will install xgboost in addition to all dask-related dependencies.

You are correct it will install the additional dependencies connect along with the core dependency xgboost. If we want to have two separate packaging structure then we might have to alter the project structure.

trivialfis reviewed Mar 31, 2025

View reviewed changes

jjayadeep06 reviewed Apr 2, 2025

View reviewed changes

[jvm-packages] Support spark connect

b7a40b0

wbo4958 force-pushed the xgb4j-connect branch from 1222b02 to b7a40b0 Compare May 8, 2025 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[jvm-packages] Support spark connect #11381

[jvm-packages] Support spark connect #11381

Uh oh!

wbo4958 commented Mar 31, 2025 •

edited

Loading

Uh oh!

trivialfis left a comment

Uh oh!

trivialfis Mar 31, 2025

Uh oh!

jjayadeep06 Apr 2, 2025

Uh oh!

wbo4958 Apr 7, 2025

Uh oh!

jjayadeep06 May 22, 2025

Uh oh!

wbo4958 commented May 19, 2025 •

edited

Loading

Uh oh!

trivialfis commented May 19, 2025 •

edited

Loading

Uh oh!

jjayadeep06 commented May 22, 2025

Uh oh!

trivialfis commented May 22, 2025

Uh oh!

jjayadeep06 commented May 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

		from .params import XGBoostParams


		class XGBoostClassifier(_JavaProbabilisticClassifier["XGBoostClassificationModel"], XGBoostParams):

Uh oh!

[jvm-packages] Support spark connect #11381

Are you sure you want to change the base?

[jvm-packages] Support spark connect #11381

Uh oh!

Conversation

wbo4958 commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trivialfis left a comment

Choose a reason for hiding this comment

Uh oh!

trivialfis Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

jjayadeep06 Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

wbo4958 Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

jjayadeep06 May 22, 2025

Choose a reason for hiding this comment

Uh oh!

wbo4958 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trivialfis commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjayadeep06 commented May 22, 2025

Uh oh!

trivialfis commented May 22, 2025

Uh oh!

jjayadeep06 commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wbo4958 commented Mar 31, 2025 •

edited

Loading

wbo4958 commented May 19, 2025 •

edited

Loading

trivialfis commented May 19, 2025 •

edited

Loading

jjayadeep06 commented May 22, 2025 •

edited

Loading