Unexpected error from cudaGetDeviceCount() #694

LindezaGrey · 2024-05-24T10:30:16Z

Has this issue been opened before?

It is not in the FAQ, I checked.
It is not in the issues, I searched.

Describe the bug

With the current Nvidia Driver from 2024-05-21 (version 555.85) the container does not start with the issue being:

RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found

From my research this comes from a version mismatch between the CUDA components. I tried to find an updated version for

FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-runtime

With the CUDA 12.5 version but i couldn't find it yet. My solution was to downgrade my NVIDIA drivers to version 552.44 which made it work again.

Which UI

tested with auto

Hardware / Software

OS: Windows 10
OS version: 22H2
Docker Version: 26.1.1
Docker compose version: v2.27.0-desktop.2
Repo version: latest
RAM: 64GB
GPU/VRAM: RTX 3080

Additional context
I opened this mainly for reference if other people start struggling

The text was updated successfully, but these errors were encountered:

MarioLiebisch · 2024-05-24T12:16:09Z

Can confirm, seeing the same with another RTX 3080 and driver 555.85.

Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found: str
Traceback (most recent call last):
  File "/stable-diffusion-webui/modules/errors.py", line 98, in run
    code()
  File "/stable-diffusion-webui/modules/devices.py", line 106, in enable_tf32
    if cuda_no_autocast():
  File "/stable-diffusion-webui/modules/devices.py", line 28, in cuda_no_autocast
    device_id = get_cuda_device_id()
  File "/stable-diffusion-webui/modules/devices.py", line 40, in get_cuda_device_id
    ) or torch.cuda.current_device()
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 769, in current_device
    _lazy_init()
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
    torch._C._cuda_init()

| NVIDIA-SMI 555.42.03              Driver Version: 555.85         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080        On  |   00000000:01:00.0  On |                  N/A |
| 36%   40C    P0             41W /  340W |    2348MiB /  10240MiB |     16%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        23      G   /Xwayland                                   N/A      |
|    0   N/A  N/A        32      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+

AbdBarho · 2024-05-26T18:10:15Z

Thanks for the heads up. I will pin this issue for now.

Maybe an update to the nvidia container toolkit is necessary?

maifeeulasad · 2024-05-28T03:59:37Z

@AbdBarho reproduced it. Also tried with pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel that's the latest cuda-toolkit we have now for pytorch, but the outcome was exactly the same.

Local cuda details: NVIDIA-SMI 555.85, Driver Version: 555.85, CUDA Version: 12.5

Complete log:

webui-docker-auto-1  | Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found: str
webui-docker-auto-1  | Traceback (most recent call last):
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/errors.py", line 98, in run
webui-docker-auto-1  |     code()
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/devices.py", line 106, in enable_tf32
webui-docker-auto-1  |     if cuda_no_autocast():
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/devices.py", line 28, in cuda_no_autocast
webui-docker-auto-1  |     device_id = get_cuda_device_id()
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/devices.py", line 40, in get_cuda_device_id
webui-docker-auto-1  |     ) or torch.cuda.current_device()
webui-docker-auto-1  |   File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 769, in current_device
webui-docker-auto-1  |     _lazy_init()
webui-docker-auto-1  |   File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
webui-docker-auto-1  |     torch._C._cuda_init()
webui-docker-auto-1  | RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found
webui-docker-auto-1  |
webui-docker-auto-1  | During handling of the above exception, another exception occurred:
webui-docker-auto-1  |
webui-docker-auto-1  | Traceback (most recent call last):
webui-docker-auto-1  |   File "/stable-diffusion-webui/webui.py", line 13, in <module>
webui-docker-auto-1  |     initialize.imports()
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/initialize.py", line 36, in imports
webui-docker-auto-1  |     shared_init.initialize()
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/shared_init.py", line 17, in initialize
webui-docker-auto-1  |     from modules import options, shared_options
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/shared_options.py", line 4, in <module>
webui-docker-auto-1  |     from modules import localization, ui_components, shared_items, shared, interrogate, shared_gradio_themes, util, sd_emphasis
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/interrogate.py", line 13, in <module>
webui-docker-auto-1  |     from modules import devices, paths, shared, lowvram, modelloader, errors, torch_utils
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/devices.py", line 113, in <module>
webui-docker-auto-1  |     errors.run(enable_tf32, "Enabling TF32")
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/errors.py", line 100, in run
webui-docker-auto-1  |     display(task, e)
webui-docker-auto-1  |   File "/stable-diffusion-webui/modules/errors.py", line 68, in display
webui-docker-auto-1  |     te = traceback.TracebackException.from_exception(e)
webui-docker-auto-1  |   File "/opt/conda/lib/python3.10/traceback.py", line 572, in from_exception
webui-docker-auto-1  |     return cls(type(exc), exc, exc.__traceback__, *args, **kwargs)
webui-docker-auto-1  | AttributeError: 'str' object has no attribute '__traceback__'
webui-docker-auto-1 exited with code 1

Andreybest · 2024-05-28T10:15:16Z

Hey guys, I have a similar issue to yours. Can confirm that downgrading to 552.44 fixes issue.

maifeeulasad · 2024-05-28T10:44:02Z

I can confirm that the configuration below can run the system without any issues:

NVIDIA-SMI 551.78
Driver Version: 551.78
CUDA Version: 12.4

And in services/invoke/Dockerfile, we need to use:

FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel

ref:

stable-diffusion-webui-docker/services/invoke/Dockerfile

Line 6 in 35a18b3

FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime

related bug: AbdBarho#694

AbdBarho · 2024-05-28T17:40:07Z

I have updated the containers, please check again from latest master, if the issue still persists, please re-open this issue.

gnetget · 2024-05-28T21:44:05Z

issue persists

MarioLiebisch · 2024-05-29T09:51:23Z

I just checked the Nvidia driver feedback thread and it's actually a listed known issue:

PyTorch-CUDA Docker not compatible with CUDA 12.5/GRD 555.85 [4668302]

maifeeulasad · 2024-05-29T11:04:33Z

@MarioLiebisch the list of supported versions can be found here:

ref: https://pytorch.org/

MarioLiebisch · 2024-05-29T22:18:58Z

There's a (partial) fix available: NVIDIA/nvidia-container-toolkit#520

cliffwoolley · 2024-06-07T15:46:09Z

Docker Desktop 4.31 was released yesterday and includes NVIDIA Container Toolkit 1.15.0, which resolves this issue.

MarioLiebisch · 2024-06-07T16:05:33Z

Ah, very nice. Seems like I barely missed the release when looking for updates earlier yesterday.

AbdBarho · 2024-06-23T09:36:58Z

Docker Desktop 4.31 was released yesterday and includes NVIDIA Container Toolkit 1.15.0, which resolves this issue.

Can confirm, and update to NVIDIA Container Toolkit 1.15.0 has resolved the issue.

On linux: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Qs-Tim · 2024-11-05T14:55:33Z

I met the same problem, and I have solved it by updating my docker for desktop

LindezaGrey added the bug Something isn't working label May 24, 2024

AbdBarho pinned this issue May 26, 2024

maifeeulasad added a commit to maifeeulasad/stable-diffusion-webui-docker that referenced this issue May 28, 2024

[docker]: update to cuda-toolkit 12.1 in service/invoke/Dockerfile

3dc0217

related bug: AbdBarho#694

maifeeulasad mentioned this issue May 28, 2024

[docker]: update cuda-toolkit to v12.1 in service/invoke/Dockerfile #696

Closed

AbdBarho mentioned this issue May 28, 2024

Bump pytorch containers #697

Merged

AbdBarho closed this as completed in #697 May 28, 2024

AbdBarho closed this as completed in f1bf3b0 May 28, 2024

AbdBarho reopened this May 29, 2024

AbdBarho mentioned this issue May 30, 2024

Docker install fails due to "missing" library #698

Closed

Nazar78 mentioned this issue Jun 8, 2024

Error 500: named symbol not found ahmetoner/whisper-asr-webservice#222

Closed

AbdBarho closed this as completed Jun 23, 2024

AbdBarho unpinned this issue Jun 23, 2024

Innary mentioned this issue Sep 15, 2024

Docker - Unexpected error from cudaGetDeviceCount() TencentARC/InstantMesh#118

Open

jhj0517 mentioned this issue Sep 18, 2024

Unexpected error from cudaGetDeviceCount() jhj0517/Whisper-WebUI#279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected error from cudaGetDeviceCount() #694

Unexpected error from cudaGetDeviceCount() #694

LindezaGrey commented May 24, 2024

MarioLiebisch commented May 24, 2024

AbdBarho commented May 26, 2024

maifeeulasad commented May 28, 2024

Andreybest commented May 28, 2024

maifeeulasad commented May 28, 2024

AbdBarho commented May 28, 2024

gnetget commented May 28, 2024

MarioLiebisch commented May 29, 2024

maifeeulasad commented May 29, 2024

MarioLiebisch commented May 29, 2024 •

edited

Loading

cliffwoolley commented Jun 7, 2024

MarioLiebisch commented Jun 7, 2024

AbdBarho commented Jun 23, 2024

Qs-Tim commented Nov 5, 2024

Unexpected error from cudaGetDeviceCount() #694

Unexpected error from cudaGetDeviceCount() #694

Comments

LindezaGrey commented May 24, 2024

MarioLiebisch commented May 24, 2024

AbdBarho commented May 26, 2024

maifeeulasad commented May 28, 2024

Andreybest commented May 28, 2024

maifeeulasad commented May 28, 2024

AbdBarho commented May 28, 2024

gnetget commented May 28, 2024

MarioLiebisch commented May 29, 2024

maifeeulasad commented May 29, 2024

MarioLiebisch commented May 29, 2024 • edited Loading

cliffwoolley commented Jun 7, 2024

MarioLiebisch commented Jun 7, 2024

AbdBarho commented Jun 23, 2024

Qs-Tim commented Nov 5, 2024

MarioLiebisch commented May 29, 2024 •

edited

Loading