Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH200 systems require open GPU kernel module driver #6

Merged
merged 1 commit into from
Jan 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions gpu-operator/gpu-operator-rdma.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ new kernel module ``nvidia-peermem`` is included in the standard NVIDIA driver i
kernel module provides Mellanox Infiniband-based HCAs direct peer-to-peer read and write access to the GPU's memory.

Starting with v23.9.1 of the Operator, the Operator uses GDS driver version 2.17.5 or newer.
This version and higher is only supported with the NVIDIA open kernel driver.
This version and higher is only supported with the NVIDIA Open GPU Kernel module driver.
The sample commands for installing the Operator include the ``--set useOpenKernelModules=true``
command-line argument for Helm.

Expand Down Expand Up @@ -386,7 +386,7 @@ The following section is applicable to the following configurations and describe
* Kubernetes on bare metal and on vSphere VMs with GPU passthrough and vGPU.

Starting with v22.9.1, the GPU Operator provides an option to load the ``nvidia-fs`` kernel module during the bootstrap of the NVIDIA driver daemonset.
Starting with v23.9.1, the GPU Operator deploys a version of GDS that requires using the NVIDIA open kernel driver.
Starting with v23.9.1, the GPU Operator deploys a version of GDS that requires using the NVIDIA Open GPU Kernel module driver.

The following sample command applies to clusters that use the Network Operator to install the MLNX_OFED drivers.

Expand Down
2 changes: 1 addition & 1 deletion gpu-operator/life-cycle-policy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
.. _gds-open-kernel:

:sup:`1`
This release of the GDS driver requires that you use the NVIDIA open kernel driver for the GPUs.
This release of the GDS driver requires that you use the NVIDIA Open GPU Kernel module driver for the GPUs.
Refer to :doc:`gpu-operator-rdma` for more information.

.. note::
Expand Down
16 changes: 14 additions & 2 deletions gpu-operator/platform-support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,19 +41,31 @@ Supported NVIDIA Data Center GPUs and Systems

The following NVIDIA data center GPUs are supported on x86 based platforms:

.. _open-kern-module: #requires-open-kernel-module
.. |open-kern-module| replace:: :sup:`1`

.. tab-set::

.. tab-item:: GH-series Products


.. list-table::
:header-rows: 1

* - Product
- Architecture

* - NVIDIA GH200
* - NVIDIA GH200 |open-kern-module|_
- NVIDIA Grace Hopper

.. _requires-open-kernel-module:

:sup:`1`
NVIDIA GH200 systems require the NVIDIA Open GPU Kernel module driver.
You can install the open kernel modules by specifying the ``driver.useOpenKernelModules=true``
argument to the ``helm`` command.
Refer to :ref:`chart customization options` for more information.

.. tab-item:: A, H and L-series Products
:selected:

Expand Down Expand Up @@ -466,7 +478,7 @@ Supported operating systems and NVIDIA GPU Drivers with GPUDirect Storage.
.. note::

Version v2.17.5 and higher of the NVIDIA GPUDirect Storage kernel driver, ``nvidia-fs``,
requires the NVIDIA open kernel modules.
requires the NVIDIA Open GPU Kernel module driver.
You can install the open kernel modules by specifying the ``driver.useOpenKernelModules=true``
argument to the ``helm`` command.
Refer to :ref:`chart customization options` for more information.
Expand Down
5 changes: 4 additions & 1 deletion gpu-operator/release-notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,15 +50,18 @@ New Features

- Run Ubuntu 22.04 and an NVIDIA Linux kernel, such as one provided with a ``linux-nvidia-<x.x>`` package.
- Add ``init_on_alloc=0`` and ``memhp_default_state=online_movable`` as Linux kernel boot parameters.
- Run the NVIDIA Open GPU Kernel module driver.

* Added support for configuring the driver container to use the NVIDIA open kernel modules.
* Added support for configuring the driver container to use the NVIDIA Open GPU Kernel module driver.
Support is limited to installation using the runfile installer.
Support for precompiled driver containers with open kernel modules is not available.

For clusters that use GPUDirect Storage (GDS), beginning with CUDA toolkit 12.2.2 and
the NVIDIA GPUDirect Storage kernel driver version v2.17.5, are only supported
with the open kernel modules.

NVIDIA GH200 Grace Hopper Superchip systems are only supported with the open kernel modules.

- Refer to :ref:`gpu-operator-helm-chart-options` for information about setting
``useOpenKernelModules`` if you manage the driver containers with the NVIDIA cluster policy custom resource definition.
- Refer to :doc:`gpu-driver-configuration` for information about setting ``spec.useOpenKernelModules``
Expand Down
Loading