Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] GOSS boosting error on GPU H100 #6811

Open
SergeevVladislav opened this issue Feb 2, 2025 · 4 comments
Open

[GPU] GOSS boosting error on GPU H100 #6811

SergeevVladislav opened this issue Feb 2, 2025 · 4 comments
Labels

Comments

@SergeevVladislav
Copy link

SergeevVladislav commented Feb 2, 2025

Description

I have encountered the following error while training binary classification task with lightgbm 4.5.0 on H100 and device="cuda":

Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/pywrapper_utils/run_thread/full_batch_run_thread.py", line 47, in _execute_user_function
result = self.user_main_function(**kwargs)
File "/opt/module/source/main.py", line 31, in main
model.perform_all_calculations()
File "/opt/module/source/model/feature_selector.py", line 61, in perform_all_calculations
selected_features: List[Tuple] = self.select_features(base_model, kfold)
File "/opt/module/source/model/feature_selector.py", line 84, in select_features
model.fit(X_train, y_train)
File "/tmp/.local/lib/python3.9/site-packages/lightgbm/sklearn.py", line 1284, in fit
super().fit(
File "/tmp/.local/lib/python3.9/site-packages/lightgbm/sklearn.py", line 955, in fit
self._Booster = train(
File "/tmp/.local/lib/python3.9/site-packages/lightgbm/engine.py", line 307, in train
booster.update(fobj=fobj)
File "/tmp/.local/lib/python3.9/site-packages/lightgbm/basic.py", line 4135, in update
_safe_call(
File "/tmp/.local/lib/python3.9/site-packages/lightgbm/basic.py", line 296, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8"))
lightgbm.basic.LightGBMError: [CUDA] invalid argument /tmp/pip-install-9rgzugd6/lightgbm_37941d8e64514c0e844ef71f72ef6b9c/src/boosting/goss.hpp 63

Environment info

python3.9
cuda 12.4
scikit-learn==1.6.1

Command(s) you used to install LightGBM

pip install lightgbm --config-settings=cmake.define.USE_CUDA=ON
@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

Are you able to share a minimal, reproducible example? Or at least, the exact parameters you passed to LightGBM?

The LightGBM functions you use and confirguration you pass to them changes what underlying code is called. Providing details like that reduces the effort required to investigate this.

@jameslamb jameslamb changed the title GOSS boosting error on GPU H100 [GPU] GOSS boosting error on GPU H100 Feb 7, 2025
@SergeevVladislav
Copy link
Author

SergeevVladislav commented Feb 8, 2025

Sorry, but I haven't any code because of NDA
But maybe this code will reproduce this error on H100 with CUDA 12.4:

import lightgbm as lgb
from sklearn.datasets import load_breast_cancer


data = load_breast_cancer()
X, y = data.data, data.target
data = lgb.Dataset(X, label=y)

params = {
    'boosting_type': 'goss',
    'objective': 'binary',
    'device': 'cuda',
}

model = lgb.train(params, data, num_boost_round=100)

@jameslamb
Copy link
Collaborator

But maybe this code will reproduce this error on H100 with CUDA 12.4

Does it for you, on the H100(s) you have access to?

You could help reduce the effort to debug this by coming up with a self-contained minimal example like that, which shouldn't be affected by any NDA if it's using publicly-available data and non-proprietary code like that.

@SergeevVladislav
Copy link
Author

Ok, I'll try it out
Write you soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants