Skip to content

Add safe NVIDIA driver installer with rollback#71

Open
businesscurry123 wants to merge 1 commit into
cxlinux-ai:mainfrom
businesscurry123:bounty/24-safe-nvidia-installer
Open

Add safe NVIDIA driver installer with rollback#71
businesscurry123 wants to merge 1 commit into
cxlinux-ai:mainfrom
businesscurry123:bounty/24-safe-nvidia-installer

Conversation

@businesscurry123
Copy link
Copy Markdown

@businesscurry123 businesscurry123 commented May 15, 2026

Summary

Fixes #24.

Adds a packaged cx-nvidia-safe-install workflow for safe NVIDIA driver installs:

  • Detects NVIDIA hardware via nvidia-smi -L or lspci.
  • Determines the current driver state and chooses a target from ubuntu-drivers or apt.
  • Checks running kernel module/header readiness, DKMS availability, and Secure Boot state.
  • Creates a pre-install rollback snapshot under /var/lib/cx/nvidia-installer/snapshots.
  • Runs an apt simulation before changing packages.
  • Validates nvidia-smi, modprobe -n nvidia, DKMS state, kernel taint, and OpenGL when available.
  • Automatically rolls back NVIDIA/CUDA packages if install or validation fails.
  • Installs the helper and documentation through the cx-gpu-nvidia package.

Rollback Model

Snapshots capture GPU state, nvidia-smi, NVIDIA/CUDA package versions, full dpkg selections, apt-mark state, DKMS status, loaded NVIDIA/nouveau modules, and NVIDIA/nouveau modprobe config files.

Rollback purges NVIDIA/CUDA packages that were not present in the snapshot, reinstalls recorded package versions with --allow-downgrades, restores NVIDIA-related manual apt marks, and refreshes initramfs when available.

Documentation

Added package docs at:

  • packages/cx-gpu-nvidia/usr/share/doc/cx-gpu-nvidia/nvidia-safe-install.md

Testing

  • bash -n packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install
  • bash -n tests/cx-nvidia-safe-install-tests.sh
  • bash tests/cx-nvidia-safe-install-tests.sh -> 5 passed
  • WSL Ubuntu 24.04: bash tests/cx-nvidia-safe-install-tests.sh -> 5 passed
  • git diff --check
  • git diff --cached --check

The mocked unit suite covers snapshot creation, status reporting, validation success, expected NVIDIA taint handling under --strict-taint, and automatic rollback when post-install nvidia-smi validation fails.

I could not run dpkg-buildpackage locally because the Debian packaging tools are not installed in this Windows/WSL checkout.

@businesscurry123 businesscurry123 requested a review from a team as a code owner May 15, 2026 16:20
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

📝 Walkthrough

Walkthrough

This pull request introduces cx-nvidia-safe-install, a safe NVIDIA driver management tool with pre-install compatibility checks, atomic snapshot-based rollback, post-install validation, and automatic recovery. Changes include Debian packaging, CLI implementation, comprehensive test coverage, and documentation.

Changes

Safe NVIDIA Driver Installer Feature

Layer / File(s) Summary
Package Declaration and Build Configuration
packages/cx-gpu-nvidia/debian/control, packages/cx-gpu-nvidia/debian/rules
cx-gpu-nvidia package declares new runtime dependencies (apt, bash, dpkg, kmod, pciutils), adds safe-install and validation feature bullets, updates Secure Boot instructions to run cx-nvidia-safe-install install. Build configuration installs the CLI tool into usr/sbin and documentation into usr/share/doc.
User-Facing Documentation
README.md, packages/cx-gpu-nvidia/usr/share/doc/cx-gpu-nvidia/nvidia-safe-install.md
Feature list and topics checklist in README document safe installation workflow with examples. Complete documentation describes available commands (status, install, rollback, validate), pre-install checks (GPU/kernel/DKMS/Secure Boot), rollback snapshot contents, post-install validation suite (nvidia-smi, modules, DKMS, taint, OpenGL), and testing guidance with --dry-run --force and environment variables.
CLI Initialization and Utility Functions
packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install (lines 1–144)
Script declares identity, state/log paths, and comprehensive usage text. Implements logging with levels, fatal error exit, command existence checks, shell-safe quoting, dry-run execution mode, root enforcement, state directory management, and argument parser for all flags (--driver, --snapshot, --dry-run, --yes, --no-update, --force, --strict-taint, --skip-opengl).
Hardware and Environment Detection
packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install (lines 146–297)
GPU detection via nvidia-smi and lspci parsing, current driver version discovery, target driver recommendation from ubuntu-drivers, NVIDIA presence gating with --force for non-GPU systems, and kernel/module/DKMS/header/Secure Boot compatibility validation that blocks unsafe conditions. APT candidate verification ensures target package exists before attempting install.
Snapshot Creation and Lifecycle
packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install (lines 299–419)
Snapshot helpers capture diagnostic outputs (gpu info, driver versions, DKMS status, kernel modules, modprobe config). Timestamped snapshots record meta.env containing environment variables and diagnostic files. Snapshot path resolution handles "latest" symlink and explicit snapshot IDs. Package recording captures installed NVIDIA/CUDA versions for restoration.
Rollback and Validation Implementation
packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install (lines 421–560)
Rollback reconstructs system state by purging newly installed NVIDIA/CUDA packages and reinstalling snapshot versions with downgrade allowance, restores apt manual package marks, optionally refreshes initramfs. Validation checks nvidia-smi responsiveness, kernel module resolvability/loading signals, DKMS "failed/bad" patterns, kernel taint behavior with --strict-taint strictness, and optional OpenGL renderer availability via glxinfo when display present.
Install Orchestration and Command Dispatch
packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install (lines 562–681)
Main install_driver function orchestrates full workflow: root/state validation, GPU/kernel/Secure Boot checks, target driver selection, snapshot creation, APT simulation preflight, optional dry-run early exit, APT installation, and automatic rollback on install or validation failure. Records last-success metadata. Status reporting shows version, state directory, latest snapshot, current driver, detected hardware, and install history. Main dispatcher routes subcommands.
Test Harness and Test Execution
tests/cx-nvidia-safe-install-tests.sh, Makefile
Test script establishes isolated environment with mocked nvidia-smi, lspci, apt-get, dpkg, dkms, and other system tools. Runs four test cases validating dry-run snapshot creation with metadata, status output, rollback on validation failure with apt-get purge logging, and successful validation. Assertions check snapshot files and log grep patterns. Makefile test target invokes script before other verifications.

Sequence Diagram(s)

(Diagram included in hidden review stack artifact above)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • cortexlinux/cortex#440: This PR implements GPU driver installation via NVIDIA safe installer from the referenced pain points research, directly realizing the "Safe NVIDIA Driver Installer with Rollback" feature with snapshot management, Secure Boot checks, DKMS/kernel validation, and automatic rollback on failure.
  • Safe NVIDIA Driver Installer with Rollback #24 (linked issue #24): This PR fulfills all acceptance criteria for the linked issue: GPU detection, kernel compatibility checks, pre-install snapshots, post-install validation suite, automatic rollback, Secure Boot compatibility, comprehensive tests, and documentation with usage examples.

Poem

🐰 A bunny's song for safer drivers true,
Pre-flight checks and snapshots too,
When NVIDIA breaks, just rollback clean,
The safest install you've ever seen!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add safe NVIDIA driver installer with rollback' accurately and concisely captures the primary change: introducing a new cx-nvidia-safe-install tool with rollback capability.
Linked Issues check ✅ Passed The PR fully implements all coding-related acceptance criteria from issue #24: GPU detection, kernel compatibility checks, pre-install snapshot creation, post-install validation, automatic rollback, Secure Boot compatibility, and comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #24 objectives: the new cx-nvidia-safe-install script, packaging updates, tests, and documentation all serve the safe driver installer feature with no extraneous modifications.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cx-nvidia-safe-install, a guarded utility for NVIDIA driver management that includes pre-install snapshots, hardware/kernel validation, and automatic rollback capabilities. Feedback focuses on enhancing the script's security and reliability, specifically by addressing insecure temporary file creation in /tmp, preventing potential snapshot ID collisions with a random suffix, and ensuring non-interactive package operations by explicitly handling dpkg configuration prompts.

Comment on lines +632 to +638
if detect_gpu >/tmp/cx-nvidia-gpu-status.$$ 2>/dev/null; then
printf 'NVIDIA hardware:\n'
sed 's/^/ /' "/tmp/cx-nvidia-gpu-status.$$"
else
printf 'NVIDIA hardware: not detected\n'
fi
rm -f "/tmp/cx-nvidia-gpu-status.$$"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Using a predictable filename in /tmp (with $$) is insecure and vulnerable to symlink attacks, especially in a script intended to run as root. It is safer to capture the output directly into a variable.

    local gpu_info
    if gpu_info="$(detect_gpu 2>/dev/null)"; then
        printf 'NVIDIA hardware:\n'
        printf '%s\n' "$gpu_info" | sed 's/^/  /'
    else
        printf 'NVIDIA hardware: not detected\n'
    fi

local current_driver

ensure_state_dirs
snapshot_id="$(date -u +%Y%m%dT%H%M%SZ)"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The snapshot ID generation relies solely on the current timestamp with second precision. This can lead to directory collisions if the script is invoked multiple times within the same second. Adding a random component ensures uniqueness.

    snapshot_id="$(date -u +%Y%m%dT%H%M%SZ)-${RANDOM:-0}"


if [ "${#remove_pkgs[@]}" -gt 0 ]; then
info "Rolling back newly installed NVIDIA/CUDA packages: ${remove_pkgs[*]}"
run_cmd env DEBIAN_FRONTEND=noninteractive apt-get purge -y "${remove_pkgs[@]}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When performing non-interactive package operations, it is best practice to include options that prevent dpkg from prompting for configuration file conflicts. This ensures the rollback process doesn't hang indefinitely.

        run_cmd env DEBIAN_FRONTEND=noninteractive apt-get purge -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" "${remove_pkgs[@]}"


if [ "${#restore_specs[@]}" -gt 0 ]; then
info "Restoring NVIDIA/CUDA package versions from snapshot."
run_cmd env DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-downgrades "${restore_specs[@]}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The installation command during rollback should also use non-interactive options to handle potential configuration file prompts safely.

        run_cmd env DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-downgrades -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" "${restore_specs[@]}"

return 0
fi

if ! run_cmd env DEBIAN_FRONTEND=noninteractive apt-get install -y "$target"; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The driver installation command should use non-interactive options to avoid hanging on configuration file prompts.

    if ! run_cmd env DEBIAN_FRONTEND=noninteractive apt-get install -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" "$target"; then

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install`:
- Around line 632-638: The snippet uses a predictable temp file
"/tmp/cx-nvidia-gpu-status.$$" when running detect_gpu; replace that with a
secure mktemp-created file and ensure it's removed on exit (use a trap); update
the detect_gpu call and subsequent sed/read to use the mktemp path and ensure
the cleanup handles interruptions so stale files aren't left behind and
race/symlink attacks are prevented.
- Around line 531-544: The current check treats any nonzero
/proc/sys/kernel/tainted as fatal under STRICT_TAINT; change it to ignore
expected NVIDIA bits (bit 0 value 1 for proprietary modules and bit 12 value
4096 for out-of-tree modules) or compare against a saved pre-install tainted
snapshot; specifically, when reading tainted into the tainted variable in the
cx-nvidia-safe-install script, compute a masked value (e.g., tainted &
~(1|4096)) or compute tainted_delta = tainted - preinstall_tainted and only fail
if dangerous bits (D/M/W/R) are set in the remaining mask, and use that masked
result in the STRICT_TAINT branch to decide whether to error/warn so legitimate
NVIDIA-set bits do not always trigger rollback.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ad510fc1-e08a-409e-804c-73eb8f309614

📥 Commits

Reviewing files that changed from the base of the PR and between cfd0627 and 416191d.

📒 Files selected for processing (7)
  • Makefile
  • README.md
  • packages/cx-gpu-nvidia/debian/control
  • packages/cx-gpu-nvidia/debian/rules
  • packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install
  • packages/cx-gpu-nvidia/usr/share/doc/cx-gpu-nvidia/nvidia-safe-install.md
  • tests/cx-nvidia-safe-install-tests.sh

Comment thread packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install Outdated
Comment on lines +632 to +638
if detect_gpu >/tmp/cx-nvidia-gpu-status.$$ 2>/dev/null; then
printf 'NVIDIA hardware:\n'
sed 's/^/ /' "/tmp/cx-nvidia-gpu-status.$$"
else
printf 'NVIDIA hardware: not detected\n'
fi
rm -f "/tmp/cx-nvidia-gpu-status.$$"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid predictable filenames under /tmp; use mktemp.

/tmp/cx-nvidia-gpu-status.$$ is a guessable path. Since status is commonly run via sudo, an unprivileged local user can pre-create a symlink at that path and have the > redirect on Line 632 truncate an arbitrary file owned by root. Even with fs.protected_symlinks=1 mitigating most attacks, mktemp is the standard, portable hardening — and avoids leaving stale files if the script is interrupted before Line 638.

🔒 Suggested fix
-    if detect_gpu >/tmp/cx-nvidia-gpu-status.$$ 2>/dev/null; then
-        printf 'NVIDIA hardware:\n'
-        sed 's/^/  /' "/tmp/cx-nvidia-gpu-status.$$"
-    else
-        printf 'NVIDIA hardware: not detected\n'
-    fi
-    rm -f "/tmp/cx-nvidia-gpu-status.$$"
+    local gpu_status_tmp
+    gpu_status_tmp="$(mktemp -t cx-nvidia-gpu-status.XXXXXX)"
+    trap 'rm -f "$gpu_status_tmp"' RETURN
+    if detect_gpu >"$gpu_status_tmp" 2>/dev/null; then
+        printf 'NVIDIA hardware:\n'
+        sed 's/^/  /' "$gpu_status_tmp"
+    else
+        printf 'NVIDIA hardware: not detected\n'
+    fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install` around lines 632 -
638, The snippet uses a predictable temp file "/tmp/cx-nvidia-gpu-status.$$"
when running detect_gpu; replace that with a secure mktemp-created file and
ensure it's removed on exit (use a trap); update the detect_gpu call and
subsequent sed/read to use the mktemp path and ensure the cleanup handles
interruptions so stale files aren't left behind and race/symlink attacks are
prevented.

@businesscurry123 businesscurry123 force-pushed the bounty/24-safe-nvidia-installer branch from 416191d to b8cdbba Compare May 15, 2026 16:37
@businesscurry123
Copy link
Copy Markdown
Author

Pushed a follow-up update for the automated review feedback:

  • removed the predictable /tmp status file path in show_status
  • added a random suffix to snapshot IDs
  • added dpkg noninteractive config-preservation options for apt install/purge/rollback paths
  • adjusted --strict-taint so expected NVIDIA proprietary/out-of-tree taint bits are allowed while unexpected taints still fail
  • removed the unrelated root README diff so the PR now only touches the package/helper/test scope

The latest branch head is the cleaner implementation used for the local test pass documented in the PR body.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Safe NVIDIA Driver Installer with Rollback

1 participant