Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compatibility with k3s + nvidia container #153

Open
bglgwyng opened this issue Jan 24, 2025 · 2 comments
Open

compatibility with k3s + nvidia container #153

bglgwyng opened this issue Jan 24, 2025 · 2 comments

Comments

@bglgwyng
Copy link
Contributor

I tried k3s + nvidia container + nix-snapshotter and found it didn't work work well.
I tested k3s + nvidia container and k3s + nix-snapshotter invidually and they worked well.
However, when I put them together, there were some problems.

Here is the nix script I tried.
I made this following k3s configuration guide in NixOS docs and nix-snapshotter docs.
I can provide the entire working nixos configuration If needed so please ask me.

The problem is that when I run a container runtime with this k8s configuration,
it failed to find nvidia runtime. It worked well before I added nix-snapshotter configuration.

Warning  FailedCreatePodSandBox  2m27s (x1378 over 5h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "34209d7f367586d856ce61dfc35010997619cfcae8280d4eb319b389f790a64f": no runtime for "nvidia" is configured

Here I share some of my speculations

When nix-snapshotter is enabled, then the extra flags provided for k3s is

--container-runtime-endpoint unix:///run/containerd/containerd.sock --image-service-endpoint unix:///run/nix-snapshotter/nix-snapshotter.sock --node-name=hserver6 --tls-san=k8s.internal --node-label=nvidia.com/gpu.present=true

this is when nix-snapshotter is NOT enabled

--node-name=hserver6 --tls-san=k8s.internal --node-label=nvidia.com/gpu.present=true

Does --container-runtime-endpoint unix:///run/containerd/containerd.sock cause this problem?
I removed that flag but it still failed.
I checked that /var/lib/rancher/k3s/agent/etc/containerd/config.toml is properlay configured to include the following part, but it still failed.

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  privileged_without_host_devices = false
  runtime_engine = ""
  runtime_root = ""
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
  BinaryName = "/run/current-system/sw/bin/nvidia-container-runtime.cdi"

Also nix-snapshotter introduces k3s.moreFlags as replacement of k3s.extraFlags. Is it relevant? I don't see the necessity of that option tbh. It doesn't help resolve conflicts of multiple flag declaration in any sense.

Does anyone have expreince this issue?

@bglgwyng bglgwyng changed the title compatibility k3s + nvidia container compatibility with k3s + nvidia container Jan 25, 2025
@bglgwyng
Copy link
Contributor Author

bglgwyng commented Jan 31, 2025

It seems that k3s patched by nix-snapshotter doesn't use /var/lib/rancher/k3s/agent/etc/containerd/config.toml as config.

nix-snapshotter sets virtualizations.containerd.args.config to the file that contains nix-snapshotter settings and k3s is using it maybe. How is k3s configured to do so?

@bglgwyng
Copy link
Contributor Author

bglgwyng commented Feb 3, 2025

k3s-io/k3s#11695

I posted this question on k3s discussion, and got the answer that config.toml path is not configurable.
However, I find k3s patched by nix-snapshotter ignore config.toml and can't see the relevant modification in the patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant