Add safe NVIDIA driver installer with rollback#71
Conversation
📝 WalkthroughWalkthroughThis pull request introduces ChangesSafe NVIDIA Driver Installer Feature
Sequence Diagram(s)(Diagram included in hidden review stack artifact above) Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces cx-nvidia-safe-install, a guarded utility for NVIDIA driver management that includes pre-install snapshots, hardware/kernel validation, and automatic rollback capabilities. Feedback focuses on enhancing the script's security and reliability, specifically by addressing insecure temporary file creation in /tmp, preventing potential snapshot ID collisions with a random suffix, and ensuring non-interactive package operations by explicitly handling dpkg configuration prompts.
| if detect_gpu >/tmp/cx-nvidia-gpu-status.$$ 2>/dev/null; then | ||
| printf 'NVIDIA hardware:\n' | ||
| sed 's/^/ /' "/tmp/cx-nvidia-gpu-status.$$" | ||
| else | ||
| printf 'NVIDIA hardware: not detected\n' | ||
| fi | ||
| rm -f "/tmp/cx-nvidia-gpu-status.$$" |
There was a problem hiding this comment.
Using a predictable filename in /tmp (with $$) is insecure and vulnerable to symlink attacks, especially in a script intended to run as root. It is safer to capture the output directly into a variable.
local gpu_info
if gpu_info="$(detect_gpu 2>/dev/null)"; then
printf 'NVIDIA hardware:\n'
printf '%s\n' "$gpu_info" | sed 's/^/ /'
else
printf 'NVIDIA hardware: not detected\n'
fi
| local current_driver | ||
|
|
||
| ensure_state_dirs | ||
| snapshot_id="$(date -u +%Y%m%dT%H%M%SZ)" |
There was a problem hiding this comment.
|
|
||
| if [ "${#remove_pkgs[@]}" -gt 0 ]; then | ||
| info "Rolling back newly installed NVIDIA/CUDA packages: ${remove_pkgs[*]}" | ||
| run_cmd env DEBIAN_FRONTEND=noninteractive apt-get purge -y "${remove_pkgs[@]}" |
There was a problem hiding this comment.
When performing non-interactive package operations, it is best practice to include options that prevent dpkg from prompting for configuration file conflicts. This ensures the rollback process doesn't hang indefinitely.
run_cmd env DEBIAN_FRONTEND=noninteractive apt-get purge -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" "${remove_pkgs[@]}"
|
|
||
| if [ "${#restore_specs[@]}" -gt 0 ]; then | ||
| info "Restoring NVIDIA/CUDA package versions from snapshot." | ||
| run_cmd env DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-downgrades "${restore_specs[@]}" |
There was a problem hiding this comment.
The installation command during rollback should also use non-interactive options to handle potential configuration file prompts safely.
run_cmd env DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-downgrades -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" "${restore_specs[@]}"
| return 0 | ||
| fi | ||
|
|
||
| if ! run_cmd env DEBIAN_FRONTEND=noninteractive apt-get install -y "$target"; then |
There was a problem hiding this comment.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install`:
- Around line 632-638: The snippet uses a predictable temp file
"/tmp/cx-nvidia-gpu-status.$$" when running detect_gpu; replace that with a
secure mktemp-created file and ensure it's removed on exit (use a trap); update
the detect_gpu call and subsequent sed/read to use the mktemp path and ensure
the cleanup handles interruptions so stale files aren't left behind and
race/symlink attacks are prevented.
- Around line 531-544: The current check treats any nonzero
/proc/sys/kernel/tainted as fatal under STRICT_TAINT; change it to ignore
expected NVIDIA bits (bit 0 value 1 for proprietary modules and bit 12 value
4096 for out-of-tree modules) or compare against a saved pre-install tainted
snapshot; specifically, when reading tainted into the tainted variable in the
cx-nvidia-safe-install script, compute a masked value (e.g., tainted &
~(1|4096)) or compute tainted_delta = tainted - preinstall_tainted and only fail
if dangerous bits (D/M/W/R) are set in the remaining mask, and use that masked
result in the STRICT_TAINT branch to decide whether to error/warn so legitimate
NVIDIA-set bits do not always trigger rollback.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ad510fc1-e08a-409e-804c-73eb8f309614
📒 Files selected for processing (7)
MakefileREADME.mdpackages/cx-gpu-nvidia/debian/controlpackages/cx-gpu-nvidia/debian/rulespackages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-installpackages/cx-gpu-nvidia/usr/share/doc/cx-gpu-nvidia/nvidia-safe-install.mdtests/cx-nvidia-safe-install-tests.sh
| if detect_gpu >/tmp/cx-nvidia-gpu-status.$$ 2>/dev/null; then | ||
| printf 'NVIDIA hardware:\n' | ||
| sed 's/^/ /' "/tmp/cx-nvidia-gpu-status.$$" | ||
| else | ||
| printf 'NVIDIA hardware: not detected\n' | ||
| fi | ||
| rm -f "/tmp/cx-nvidia-gpu-status.$$" |
There was a problem hiding this comment.
Avoid predictable filenames under /tmp; use mktemp.
/tmp/cx-nvidia-gpu-status.$$ is a guessable path. Since status is commonly run via sudo, an unprivileged local user can pre-create a symlink at that path and have the > redirect on Line 632 truncate an arbitrary file owned by root. Even with fs.protected_symlinks=1 mitigating most attacks, mktemp is the standard, portable hardening — and avoids leaving stale files if the script is interrupted before Line 638.
🔒 Suggested fix
- if detect_gpu >/tmp/cx-nvidia-gpu-status.$$ 2>/dev/null; then
- printf 'NVIDIA hardware:\n'
- sed 's/^/ /' "/tmp/cx-nvidia-gpu-status.$$"
- else
- printf 'NVIDIA hardware: not detected\n'
- fi
- rm -f "/tmp/cx-nvidia-gpu-status.$$"
+ local gpu_status_tmp
+ gpu_status_tmp="$(mktemp -t cx-nvidia-gpu-status.XXXXXX)"
+ trap 'rm -f "$gpu_status_tmp"' RETURN
+ if detect_gpu >"$gpu_status_tmp" 2>/dev/null; then
+ printf 'NVIDIA hardware:\n'
+ sed 's/^/ /' "$gpu_status_tmp"
+ else
+ printf 'NVIDIA hardware: not detected\n'
+ fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-install` around lines 632 -
638, The snippet uses a predictable temp file "/tmp/cx-nvidia-gpu-status.$$"
when running detect_gpu; replace that with a secure mktemp-created file and
ensure it's removed on exit (use a trap); update the detect_gpu call and
subsequent sed/read to use the mktemp path and ensure the cleanup handles
interruptions so stale files aren't left behind and race/symlink attacks are
prevented.
416191d to
b8cdbba
Compare
|
Pushed a follow-up update for the automated review feedback:
The latest branch head is the cleaner implementation used for the local test pass documented in the PR body. |
Summary
Fixes #24.
Adds a packaged
cx-nvidia-safe-installworkflow for safe NVIDIA driver installs:nvidia-smi -Lorlspci.ubuntu-driversor apt./var/lib/cx/nvidia-installer/snapshots.nvidia-smi,modprobe -n nvidia, DKMS state, kernel taint, and OpenGL when available.cx-gpu-nvidiapackage.Rollback Model
Snapshots capture GPU state,
nvidia-smi, NVIDIA/CUDA package versions, full dpkg selections, apt-mark state, DKMS status, loaded NVIDIA/nouveau modules, and NVIDIA/nouveau modprobe config files.Rollback purges NVIDIA/CUDA packages that were not present in the snapshot, reinstalls recorded package versions with
--allow-downgrades, restores NVIDIA-related manual apt marks, and refreshes initramfs when available.Documentation
Added package docs at:
packages/cx-gpu-nvidia/usr/share/doc/cx-gpu-nvidia/nvidia-safe-install.mdTesting
bash -n packages/cx-gpu-nvidia/usr/sbin/cx-nvidia-safe-installbash -n tests/cx-nvidia-safe-install-tests.shbash tests/cx-nvidia-safe-install-tests.sh-> 5 passedbash tests/cx-nvidia-safe-install-tests.sh-> 5 passedgit diff --checkgit diff --cached --checkThe mocked unit suite covers snapshot creation, status reporting, validation success, expected NVIDIA taint handling under
--strict-taint, and automatic rollback when post-installnvidia-smivalidation fails.I could not run
dpkg-buildpackagelocally because the Debian packaging tools are not installed in this Windows/WSL checkout.