Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected error from cudaGetDeviceCount() #694

Closed
2 tasks done
LindezaGrey opened this issue May 24, 2024 · 14 comments · Fixed by #697
Closed
2 tasks done

Unexpected error from cudaGetDeviceCount() #694

LindezaGrey opened this issue May 24, 2024 · 14 comments · Fixed by #697
Labels
bug Something isn't working

Comments

@LindezaGrey
Copy link

Has this issue been opened before?

  • It is not in the FAQ, I checked.
  • It is not in the issues, I searched.

Describe the bug

With the current Nvidia Driver from 2024-05-21 (version 555.85) the container does not start with the issue being:

RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found

From my research this comes from a version mismatch between the CUDA components. I tried to find an updated version for

FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime

With the CUDA 12.5 version but i couldn't find it yet. My solution was to downgrade my NVIDIA drivers to version 552.44 which made it work again.

Which UI

tested with auto

Hardware / Software

  • OS: Windows 10
  • OS version: 22H2
  • Docker Version: 26.1.1
  • Docker compose version: v2.27.0-desktop.2
  • Repo version: latest
  • RAM: 64GB
  • GPU/VRAM: RTX 3080

Additional context
I opened this mainly for reference if other people start struggling

@LindezaGrey LindezaGrey added the bug Something isn't working label May 24, 2024
@MarioLiebisch
Copy link

Can confirm, seeing the same with another RTX 3080 and driver 555.85.

Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found: str
Traceback (most recent call last):
  File "/stable-diffusion-webui/modules/errors.py", line 98, in run
    code()
  File "/stable-diffusion-webui/modules/devices.py", line 106, in enable_tf32
    if cuda_no_autocast():
  File "/stable-diffusion-webui/modules/devices.py", line 28, in cuda_no_autocast
    device_id = get_cuda_device_id()
  File "/stable-diffusion-webui/modules/devices.py", line 40, in get_cuda_device_id
    ) or torch.cuda.current_device()
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 769, in current_device
    _lazy_init()
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
    torch._C._cuda_init()
| NVIDIA-SMI 555.42.03              Driver Version: 555.85         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080        On  |   00000000:01:00.0  On |                  N/A |
| 36%   40C    P0             41W /  340W |    2348MiB /  10240MiB |     16%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        23      G   /Xwayland                                   N/A      |
|    0   N/A  N/A        32      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+

@AbdBarho
Copy link
Owner

Thanks for the heads up. I will pin this issue for now.

Maybe an update to the nvidia container toolkit is necessary?

@AbdBarho AbdBarho pinned this issue May 26, 2024
@maifeeulasad
Copy link

@AbdBarho reproduced it. Also tried with pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel that's the latest cuda-toolkit we have now for pytorch, but the outcome was exactly the same.

Local cuda details: NVIDIA-SMI 555.85, Driver Version: 555.85, CUDA Version: 12.5

Complete log:

webui-docker-auto-1  | Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found: str
webui-docker-auto-1  | Traceback (most recent call last):
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/errors.py", line 98, in run
webui-docker-auto-1  |     code()
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/devices.py", line 106, in enable_tf32
webui-docker-auto-1  |     if cuda_no_autocast():
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/devices.py", line 28, in cuda_no_autocast
webui-docker-auto-1  |     device_id = get_cuda_device_id()
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/devices.py", line 40, in get_cuda_device_id
webui-docker-auto-1  |     ) or torch.cuda.current_device()
webui-docker-auto-1  |   File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 769, in current_device
webui-docker-auto-1  |     _lazy_init()
webui-docker-auto-1  |   File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
webui-docker-auto-1  |     torch._C._cuda_init()
webui-docker-auto-1  | RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found
webui-docker-auto-1  |
webui-docker-auto-1  | During handling of the above exception, another exception occurred:
webui-docker-auto-1  |
webui-docker-auto-1  | Traceback (most recent call last):
webui-docker-auto-1  |   File "/stable-diffusion-webui/webui.py", line 13, in <module>
webui-docker-auto-1  |     initialize.imports()
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/initialize.py", line 36, in imports
webui-docker-auto-1  |     shared_init.initialize()
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/shared_init.py", line 17, in initialize
webui-docker-auto-1  |     from modules import options, shared_options
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/shared_options.py", line 4, in <module>
webui-docker-auto-1  |     from modules import localization, ui_components, shared_items, shared, interrogate, shared_gradio_themes, util, sd_emphasis
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/interrogate.py", line 13, in <module>
webui-docker-auto-1  |     from modules import devices, paths, shared, lowvram, modelloader, errors, torch_utils
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/devices.py", line 113, in <module>
webui-docker-auto-1  |     errors.run(enable_tf32, "Enabling TF32")
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/errors.py", line 100, in run
webui-docker-auto-1  |     display(task, e)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/errors.py", line 68, in display
webui-docker-auto-1  |     te = traceback.TracebackException.from_exception(e)
webui-docker-auto-1  |   File "/opt/conda/lib/python3.10/traceback.py", line 572, in from_exception
webui-docker-auto-1  |     return cls(type(exc), exc, exc.__traceback__, *args, **kwargs)
webui-docker-auto-1  | AttributeError: 'str' object has no attribute '__traceback__'
webui-docker-auto-1 exited with code 1

@Andreybest
Copy link

Hey guys, I have a similar issue to yours. Can confirm that downgrading to 552.44 fixes issue.

@maifeeulasad
Copy link

I can confirm that the configuration below can run the system without any issues:

NVIDIA-SMI 551.78
Driver Version: 551.78
CUDA Version: 12.4

And in services/invoke/Dockerfile, we need to use:

FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel

ref:

FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime

@AbdBarho
Copy link
Owner

I have updated the containers, please check again from latest master, if the issue still persists, please re-open this issue.

@gnetget
Copy link

gnetget commented May 28, 2024

issue persists

@AbdBarho AbdBarho reopened this May 29, 2024
@MarioLiebisch
Copy link

I just checked the Nvidia driver feedback thread and it's actually a listed known issue:

PyTorch-CUDA Docker not compatible with CUDA 12.5/GRD 555.85 [4668302]

@maifeeulasad
Copy link

@MarioLiebisch the list of supported versions can be found here:
image

ref: https://pytorch.org/

@MarioLiebisch
Copy link

MarioLiebisch commented May 29, 2024

There's a (partial) fix available: NVIDIA/nvidia-container-toolkit#520

@cliffwoolley
Copy link

Docker Desktop 4.31 was released yesterday and includes NVIDIA Container Toolkit 1.15.0, which resolves this issue.

@MarioLiebisch
Copy link

Ah, very nice. Seems like I barely missed the release when looking for updates earlier yesterday.

@AbdBarho
Copy link
Owner

Docker Desktop 4.31 was released yesterday and includes NVIDIA Container Toolkit 1.15.0, which resolves this issue.

Can confirm, and update to NVIDIA Container Toolkit 1.15.0 has resolved the issue.

On linux: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

@Qs-Tim
Copy link

Qs-Tim commented Nov 5, 2024

I met the same problem, and I have solved it by updating my docker for desktop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants