Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-modprobe to potentially early out when nvidia blacklisted (Wayland + driver init issue) #5

Open
tim-rex opened this issue Jan 4, 2024 · 3 comments

Comments

@tim-rex
Copy link

tim-rex commented Jan 4, 2024

I've been exploring the purpose of nvidia-modprobe recently, and the implications for anyone using a dual-gpu setup and occasionally needing to blacklist the nvidia drivers. I'm using Wayland exclusively.

It's my understanding that nvidia-modprobe is provided as a fallback mechanism to ensure the nvidia driver is initialised with root priveleges (should it not already be properly initialised). The mechanism for calling nvidia-modprobe appears to be triggered by the nvidia libraries themselves when they are invoked by the relevant ICD
eg:

libnvidia-egl-gbm.so
libGLX_nvidia.so
libnvidia-egl-wayland.so

I've found that even when the nvidia drivers themselves are blacklisted, any program that tries to invoke or interrogate the ICD's for available devices causes nvidia-modprobe to be called (which in turns, attempts to modprobe nvidia as root)

Unfortunately, modprobe isn't the quickest in town and it takes a while for it to fail when the nvidia drivers are blacklisted (close to 1 second in my testing).

The problem is compounded by diagnostic tools such as inxi
For example, inxi -Fxz will repeatedly poll the ICD layer (approximately 33 times), which in turn loads the nvidia shared libraries (33 times) which triggers nvidia-modprobe (33 times)

This chain of events takes approximately 30 seconds to complete, while my journal logs shows (correctly) that Module nvidia is blacklisted (33 times).

This isn't the end of the world, though I've tried to mitigate the issue as follows:

Workaround
It's been suggested that I should be able to move nvidia-modprobe out of the way, short circuiting this chain of events somewhat. This does have the desired effect when the nvidia drivers are blacklisted

Problem
This has a side effect when the nVidia drivers are not blacklisted.
Specifically, despite the nvidia module being present and accounted for (via lsmod) it seems the appropriate device files have not been created (or the driver otherwise not fully initialised).

This is evidenced by the likes of eglinfo / vulkaninfo not showing the nVidia device whatsoever.

This can be rectified by one of the following approaches

  • Manually run the renamed nvidia-modprobe
  • Run vulkaninfo as root
  • Run nvidia-debugdump --list as root

Theory
I believe that this isn't an issue for X11 users, as the Xorg service runs as root and thus has no trouble when the nvidia shared libraries are instantiated (thus, the driver fully initialises without need for the nvidia-modprobe fallback mechanism.

For GDM and Wayland users, this isn't the case.. since these services do not run with superuser priveleges, the nvidia drivers will ultimately be loaded without special priveleges and will try to initiate the fallback mechanism by default. That obviously does not work if nvidia-modprobe cannot be found


So, to restate the problem (with the above taken into account)...

A linux system running Wayland without nvidia-modprobe will be unable to initialise the nVidia device without user intervention


Potential paths forward

  1. Accept that when the nvidia device is blacklisted, nvidia-modprobe will trigger a modprobe any time a userspace application tries to query or use the ICD's available - and that this may not be immediate.
  2. Accept that the removal of nvidia-modprobe will prevent proper initialisation of nVidia devices under Wayland

or we could consider a check within nvidia-modprobe (or indeed the shared libraries/drivers themselves) such that:

  1. Have nvidia-modprobe proactively check if the nvidia drivers are blacklisted before calling out to /sbin/modprobe and fail fast if that is the case
  2. Have the nvidia shared libraries proactively check if the nvidia drivers are blacklisted before attempting the fallback nvidia-modprobe mechanism

# 1 is a minor irritation (it drove me to research this issue)
# 2 could be scripted around via user code or udev rules, but doesn't help the wider community.

Perhaps # 3 or # 4 could be considered, if it doesn't introduce too much complexity?


Background:
My specific setup includes a GTX 960 with drivers 545.29.06
I've been testing across both Arch Linux and Fedora Linux (same drivers + kernel). It's worth noting that on Arch I'm using regular kernel modules, while Fedora uses akmods. I do not observe any difference in behaviour between the two.

I'm also running with an AMD RX 580
For development purposes, I frequently switch between nvidia, nouveau and amdgpu drivers using boot time kernel parameters to blacklist as appropriate.

Related forum posts here and here

@kbrenneman
Copy link
Collaborator

Is it the /sbin/modprobe subprocess that's taking that time?

I would have expected that modprobe itself would have an early-out path in this case.

@tim-rex
Copy link
Author

tim-rex commented Jan 4, 2024

You're right... it appears /sbin/modprobe does provide an early out mechanism, but only when run with the -b or --use-blacklist switch.

By default, it appears to mmap the entire module before a syscall out to init_module which tries to map it into kernel space (and only then does the blacklisting seem to apply as part of the init_module syscall)

Would there be any downside to nvidia-modprobe calling out to modprobe with the --use-blacklist flag?

Edit: Yes, it is /sbin/modprobe that is taking the time here.
Running with -b completes in 0.006 seconds (versus 0.8 seconds)

tim-rex added a commit to tim-rex/nvidia-modprobe that referenced this issue Jan 5, 2024
This allows modprobe to early out in the event that the nvidia driver
has been blacklisted

NVIDIA#5
@tim-rex
Copy link
Author

tim-rex commented Jan 20, 2024

FWIW, the pull request (#6) has been working well in my usage across Arch and Fedora

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants