Add safe NVIDIA driver installer with rollback by businesscurry123 · Pull Request #71 · cxlinux-ai/cx-distro

businesscurry123 · 2026-05-15T16:20:00Z

Summary

Fixes #24.

Adds a packaged cx-nvidia-safe-install workflow for safe NVIDIA driver installs:

Detects NVIDIA hardware via nvidia-smi -L or lspci.
Determines the current driver state and chooses a target from ubuntu-drivers or apt.
Checks running kernel module/header readiness, DKMS availability, and Secure Boot state.
Creates a pre-install rollback snapshot under /var/lib/cx/nvidia-installer/snapshots.
Runs an apt simulation before changing packages.
Validates nvidia-smi, modprobe -n nvidia, DKMS state, kernel taint, and OpenGL when available.
Automatically rolls back NVIDIA/CUDA packages if install or validation fails.
Installs the helper and documentation through the cx-gpu-nvidia package.

Rollback Model

Snapshots capture GPU state, nvidia-smi, NVIDIA/CUDA package versions, full dpkg selections, apt-mark state, DKMS status, loaded NVIDIA/nouveau modules, and NVIDIA/nouveau modprobe config files.

Rollback purges NVIDIA/CUDA packages that were not present in the snapshot, reinstalls recorded package versions with --allow-downgrades, restores NVIDIA-related manual apt marks, and refreshes initramfs when available.

Documentation

Added package docs at:

packages/cx-gpu-nvidia/usr/share/doc/cx-gpu-nvidia/nvidia-safe-install.md

Testing

bash -n packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install
bash -n tests/cx-nvidia-safe-install-tests.sh
bash tests/cx-nvidia-safe-install-tests.sh -> 5 passed
WSL Ubuntu 24.04: bash tests/cx-nvidia-safe-install-tests.sh -> 5 passed
git diff --check
git diff --cached --check

The mocked unit suite covers snapshot creation, status reporting, validation success, expected NVIDIA taint handling under --strict-taint, and automatic rollback when post-install nvidia-smi validation fails.

I could not run dpkg-buildpackage locally because the Debian packaging tools are not installed in this Windows/WSL checkout.

coderabbitai · 2026-05-15T16:20:28Z

📝 Walkthrough

Walkthrough

This pull request introduces cx-nvidia-safe-install, a safe NVIDIA driver management tool with pre-install compatibility checks, atomic snapshot-based rollback, post-install validation, and automatic recovery. Changes include Debian packaging, CLI implementation, comprehensive test coverage, and documentation.

Changes

Safe NVIDIA Driver Installer Feature

Layer / File(s)	Summary
Package Declaration and Build Configuration `packages/cx-gpu-nvidia/debian/control`, `packages/cx-gpu-nvidia/debian/rules`	`cx-gpu-nvidia` package declares new runtime dependencies (`apt`, `bash`, `dpkg`, `kmod`, `pciutils`), adds safe-install and validation feature bullets, updates Secure Boot instructions to run `cx-nvidia-safe-install install`. Build configuration installs the CLI tool into `usr/sbin` and documentation into `usr/share/doc`.
User-Facing Documentation `README.md`, `packages/cx-gpu-nvidia/usr/share/doc/cx-gpu-nvidia/nvidia-safe-install.md`	Feature list and topics checklist in README document safe installation workflow with examples. Complete documentation describes available commands (status, install, rollback, validate), pre-install checks (GPU/kernel/DKMS/Secure Boot), rollback snapshot contents, post-install validation suite (nvidia-smi, modules, DKMS, taint, OpenGL), and testing guidance with `--dry-run --force` and environment variables.
CLI Initialization and Utility Functions `packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install` (lines 1–144)	Script declares identity, state/log paths, and comprehensive usage text. Implements logging with levels, fatal error exit, command existence checks, shell-safe quoting, dry-run execution mode, root enforcement, state directory management, and argument parser for all flags (`--driver`, `--snapshot`, `--dry-run`, `--yes`, `--no-update`, `--force`, `--strict-taint`, `--skip-opengl`).
Hardware and Environment Detection `packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install` (lines 146–297)	GPU detection via `nvidia-smi` and `lspci` parsing, current driver version discovery, target driver recommendation from `ubuntu-drivers`, NVIDIA presence gating with `--force` for non-GPU systems, and kernel/module/DKMS/header/Secure Boot compatibility validation that blocks unsafe conditions. APT candidate verification ensures target package exists before attempting install.
Snapshot Creation and Lifecycle `packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install` (lines 299–419)	Snapshot helpers capture diagnostic outputs (gpu info, driver versions, DKMS status, kernel modules, modprobe config). Timestamped snapshots record `meta.env` containing environment variables and diagnostic files. Snapshot path resolution handles "latest" symlink and explicit snapshot IDs. Package recording captures installed NVIDIA/CUDA versions for restoration.
Rollback and Validation Implementation `packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install` (lines 421–560)	Rollback reconstructs system state by purging newly installed NVIDIA/CUDA packages and reinstalling snapshot versions with downgrade allowance, restores apt manual package marks, optionally refreshes initramfs. Validation checks `nvidia-smi` responsiveness, kernel module resolvability/loading signals, DKMS "failed/bad" patterns, kernel taint behavior with `--strict-taint` strictness, and optional OpenGL renderer availability via `glxinfo` when display present.
Install Orchestration and Command Dispatch `packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install` (lines 562–681)	Main `install_driver` function orchestrates full workflow: root/state validation, GPU/kernel/Secure Boot checks, target driver selection, snapshot creation, APT simulation preflight, optional dry-run early exit, APT installation, and automatic rollback on install or validation failure. Records `last-success` metadata. Status reporting shows version, state directory, latest snapshot, current driver, detected hardware, and install history. Main dispatcher routes subcommands.
Test Harness and Test Execution `tests/cx-nvidia-safe-install-tests.sh`, `Makefile`	Test script establishes isolated environment with mocked `nvidia-smi`, `lspci`, `apt-get`, `dpkg`, `dkms`, and other system tools. Runs four test cases validating dry-run snapshot creation with metadata, status output, rollback on validation failure with apt-get purge logging, and successful validation. Assertions check snapshot files and log grep patterns. Makefile `test` target invokes script before other verifications.

Sequence Diagram(s)

(Diagram included in hidden review stack artifact above)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

cortexlinux/cortex#440: This PR implements GPU driver installation via NVIDIA safe installer from the referenced pain points research, directly realizing the "Safe NVIDIA Driver Installer with Rollback" feature with snapshot management, Secure Boot checks, DKMS/kernel validation, and automatic rollback on failure.
Safe NVIDIA Driver Installer with Rollback #24 (linked issue #24): This PR fulfills all acceptance criteria for the linked issue: GPU detection, kernel compatibility checks, pre-install snapshots, post-install validation suite, automatic rollback, Secure Boot compatibility, comprehensive tests, and documentation with usage examples.

Poem

🐰 A bunny's song for safer drivers true,
Pre-flight checks and snapshots too,
When NVIDIA breaks, just rollback clean,
The safest install you've ever seen! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add safe NVIDIA driver installer with rollback' accurately and concisely captures the primary change: introducing a new cx-nvidia-safe-install tool with rollback capability.
Linked Issues check	✅ Passed	The PR fully implements all coding-related acceptance criteria from issue `#24`: GPU detection, kernel compatibility checks, pre-install snapshot creation, post-install validation, automatic rollback, Secure Boot compatibility, and comprehensive test coverage.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with issue `#24` objectives: the new cx-nvidia-safe-install script, packaging updates, tests, and documentation all serve the safe driver installer feature with no extraneous modifications.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces cx-nvidia-safe-install, a guarded utility for NVIDIA driver management that includes pre-install snapshots, hardware/kernel validation, and automatic rollback capabilities. Feedback focuses on enhancing the script's security and reliability, specifically by addressing insecure temporary file creation in /tmp, preventing potential snapshot ID collisions with a random suffix, and ensuring non-interactive package operations by explicitly handling dpkg configuration prompts.

gemini-code-assist · 2026-05-15T16:21:55Z

+    if detect_gpu >/tmp/cx-nvidia-gpu-status.$$ 2>/dev/null; then
+        printf 'NVIDIA hardware:\n'
+        sed 's/^/  /' "/tmp/cx-nvidia-gpu-status.$$"
+    else
+        printf 'NVIDIA hardware: not detected\n'
+    fi
+    rm -f "/tmp/cx-nvidia-gpu-status.$$"


Using a predictable filename in /tmp (with $$) is insecure and vulnerable to symlink attacks, especially in a script intended to run as root. It is safer to capture the output directly into a variable.

local gpu_info if gpu_info="$(detect_gpu 2>/dev/null)"; then printf 'NVIDIA hardware:\n' printf '%s\n' "$gpu_info" | sed 's/^/ /' else printf 'NVIDIA hardware: not detected\n' fi

gemini-code-assist · 2026-05-15T16:21:55Z

+    local current_driver
+
+    ensure_state_dirs
+    snapshot_id="$(date -u +%Y%m%dT%H%M%SZ)"


The snapshot ID generation relies solely on the current timestamp with second precision. This can lead to directory collisions if the script is invoked multiple times within the same second. Adding a random component ensures uniqueness.

snapshot_id="$(date -u +%Y%m%dT%H%M%SZ)-${RANDOM:-0}"

gemini-code-assist · 2026-05-15T16:21:55Z

+
+    if [ "${#remove_pkgs[@]}" -gt 0 ]; then
+        info "Rolling back newly installed NVIDIA/CUDA packages: ${remove_pkgs[*]}"
+        run_cmd env DEBIAN_FRONTEND=noninteractive apt-get purge -y "${remove_pkgs[@]}"


When performing non-interactive package operations, it is best practice to include options that prevent dpkg from prompting for configuration file conflicts. This ensures the rollback process doesn't hang indefinitely.

run_cmd env DEBIAN_FRONTEND=noninteractive apt-get purge -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" "${remove_pkgs[@]}"

gemini-code-assist · 2026-05-15T16:21:55Z

+
+    if [ "${#restore_specs[@]}" -gt 0 ]; then
+        info "Restoring NVIDIA/CUDA package versions from snapshot."
+        run_cmd env DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-downgrades "${restore_specs[@]}"


The installation command during rollback should also use non-interactive options to handle potential configuration file prompts safely.

run_cmd env DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-downgrades -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" "${restore_specs[@]}"

gemini-code-assist · 2026-05-15T16:21:55Z

+        return 0
+    fi
+
+    if ! run_cmd env DEBIAN_FRONTEND=noninteractive apt-get install -y "$target"; then


The driver installation command should use non-interactive options to avoid hanging on configuration file prompts.

if ! run_cmd env DEBIAN_FRONTEND=noninteractive apt-get install -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" "$target"; then

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install`:
- Around line 632-638: The snippet uses a predictable temp file
"/tmp/cx-nvidia-gpu-status.$$" when running detect_gpu; replace that with a
secure mktemp-created file and ensure it's removed on exit (use a trap); update
the detect_gpu call and subsequent sed/read to use the mktemp path and ensure
the cleanup handles interruptions so stale files aren't left behind and
race/symlink attacks are prevented.
- Around line 531-544: The current check treats any nonzero
/proc/sys/kernel/tainted as fatal under STRICT_TAINT; change it to ignore
expected NVIDIA bits (bit 0 value 1 for proprietary modules and bit 12 value
4096 for out-of-tree modules) or compare against a saved pre-install tainted
snapshot; specifically, when reading tainted into the tainted variable in the
cx-nvidia-safe-install script, compute a masked value (e.g., tainted &
~(1|4096)) or compute tainted_delta = tainted - preinstall_tainted and only fail
if dangerous bits (D/M/W/R) are set in the remaining mask, and use that masked
result in the STRICT_TAINT branch to decide whether to error/warn so legitimate
NVIDIA-set bits do not always trigger rollback.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ad510fc1-e08a-409e-804c-73eb8f309614

📥 Commits

Reviewing files that changed from the base of the PR and between cfd0627 and 416191d.

📒 Files selected for processing (7)

Makefile
README.md
packages/cx-gpu-nvidia/debian/control
packages/cx-gpu-nvidia/debian/rules
packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install
packages/cx-gpu-nvidia/usr/share/doc/cx-gpu-nvidia/nvidia-safe-install.md
tests/cx-nvidia-safe-install-tests.sh

coderabbitai · 2026-05-15T16:26:26Z

+    if detect_gpu >/tmp/cx-nvidia-gpu-status.$$ 2>/dev/null; then
+        printf 'NVIDIA hardware:\n'
+        sed 's/^/  /' "/tmp/cx-nvidia-gpu-status.$$"
+    else
+        printf 'NVIDIA hardware: not detected\n'
+    fi
+    rm -f "/tmp/cx-nvidia-gpu-status.$$"


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid predictable filenames under /tmp; use mktemp.

/tmp/cx-nvidia-gpu-status.$$ is a guessable path. Since status is commonly run via sudo, an unprivileged local user can pre-create a symlink at that path and have the > redirect on Line 632 truncate an arbitrary file owned by root. Even with fs.protected_symlinks=1 mitigating most attacks, mktemp is the standard, portable hardening — and avoids leaving stale files if the script is interrupted before Line 638.

🔒 Suggested fix

- if detect_gpu >/tmp/cx-nvidia-gpu-status.$$ 2>/dev/null; then - printf 'NVIDIA hardware:\n' - sed 's/^/ /' "/tmp/cx-nvidia-gpu-status.$$" - else - printf 'NVIDIA hardware: not detected\n' - fi - rm -f "/tmp/cx-nvidia-gpu-status.$$" + local gpu_status_tmp + gpu_status_tmp="$(mktemp -t cx-nvidia-gpu-status.XXXXXX)" + trap 'rm -f "$gpu_status_tmp"' RETURN + if detect_gpu >"$gpu_status_tmp" 2>/dev/null; then + printf 'NVIDIA hardware:\n' + sed 's/^/ /' "$gpu_status_tmp" + else + printf 'NVIDIA hardware: not detected\n' + fi

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install` around lines 632 - 638, The snippet uses a predictable temp file "/tmp/cx-nvidia-gpu-status.$$" when running detect_gpu; replace that with a secure mktemp-created file and ensure it's removed on exit (use a trap); update the detect_gpu call and subsequent sed/read to use the mktemp path and ensure the cleanup handles interruptions so stale files aren't left behind and race/symlink attacks are prevented.

businesscurry123 · 2026-05-15T16:40:12Z

Pushed a follow-up update for the automated review feedback:

removed the predictable /tmp status file path in show_status
added a random suffix to snapshot IDs
added dpkg noninteractive config-preservation options for apt install/purge/rollback paths
adjusted --strict-taint so expected NVIDIA proprietary/out-of-tree taint bits are allowed while unexpected taints still fail
removed the unrelated root README diff so the PR now only touches the package/helper/test scope

The latest branch head is the cleaner implementation used for the local test pass documented in the PR body.

businesscurry123 requested a review from a team as a code owner May 15, 2026 16:20

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Add safe NVIDIA driver installer

b8cdbba

businesscurry123 force-pushed the bounty/24-safe-nvidia-installer branch from 416191d to b8cdbba Compare May 15, 2026 16:37

Conversation

businesscurry123 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Rollback Model

Documentation

Testing

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

businesscurry123 commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

businesscurry123 commented May 15, 2026 •

edited

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading