Skip to content

Commit 8622ce6

Browse files
committed
addressing feedback
1 parent 742c4c2 commit 8622ce6

File tree

4 files changed

+33
-24
lines changed

4 files changed

+33
-24
lines changed

latest/ug/ml/ml-eks-k8s-device-plugin.adoc

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ ip-192-168-11-225.us-west-2.compute.internal 1
9292
ip-192-168-24-96.us-west-2.compute.internal 1
9393
----
9494
+
95-
. Create a file named `nvidia-smi.yaml` with the following contents. Replace [.replaceable]`12.9.1-base-amzn2023` with your desired tag for https://hub.docker.com/r/nvidia/cuda/tags[nvidia/cuda]. This manifest launches an https://developer.nvidia.com/cuda-zone[NVIDIA CUDA] container that runs `nvidia-smi` on a node.
95+
. Create a file named `nvidia-smi.yaml` with the following contents. This manifest launches a https://docs.aws.amazon.com/linux/al2023/ug/minimal-container.html[minimal AL2023 container image] that runs `nvidia-smi` on a node.
9696
+
9797
[source,yaml,subs="verbatim,attributes,quotes"]
9898
----
@@ -103,13 +103,18 @@ metadata:
103103
spec:
104104
restartPolicy: OnFailure
105105
containers:
106-
- name: nvidia-smi
107-
image: nvidia/cuda:12.9.1-base-amzn2023
108-
args:
109-
- "nvidia-smi"
110-
resources:
111-
limits:
112-
nvidia.com/gpu: 1
106+
- name: gpu-demo
107+
image: public.ecr.aws/amazonlinux/amazonlinux:2023-minimal
108+
command: ['/bin/sh', '-c']
109+
args: ['nvidia-smi && tail -f /dev/null']
110+
resources:
111+
limits:
112+
nvidia.com/gpu: 1
113+
tolerations:
114+
- key: 'nvidia.com/gpu'
115+
operator: 'Equal'
116+
value: 'true'
117+
effect: 'NoSchedule'
113118
----
114119
+
115120
. Apply the manifest with the following command.
@@ -158,7 +163,7 @@ The following procedure describes how to install the Neuron Kubernetes device pl
158163
=== Prerequisites
159164

160165
* EKS cluster created
161-
* Neuron GPU nodes running in the cluster using EKS-optimized AL2023 or Bottlerocket Neuron AMI
166+
* Neuron GPU nodes running in the cluster using EKS-optimized AL2023 Neuron AMI or Bottlerocket AMI
162167
* Helm installed in your command-line environment, see <<helm,Setup Helm instructions>>.
163168

164169
=== Procedure

latest/ug/ml/ml-eks-optimized-ami.adoc

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ The table below shows the supported GPU instance types for each EKS-optimized ac
1414
|EKS AMI variant | EC2 instance types
1515

1616
|AL2023 x86_64 NVIDIA
17-
|p6-b200, p5, p5e, p5en, p4d, p4de, p3, p3dn, gr6, g6, g6e, g5, g4dn
17+
|p6-b200, p5, p5e, p5en, p4d, p4de, p3, p3dn, gr6, g6, g6e, g6f, gr6f, g5, g4dn
1818

1919
|AL2023 ARM NVIDIA
2020
|p6e-gb200, g5g
@@ -23,7 +23,7 @@ The table below shows the supported GPU instance types for each EKS-optimized ac
2323
|inf1, inf2, trn1, trn2
2424

2525
|Bottlerocket x86_64 aws-k8s-nvidia
26-
|p6-b200, p5, p5e, p5en, p4d, p4de, p3, p3dn, gr6, g6, g6e, g5, g4dn
26+
|p6-b200, p5, p5e, p5en, p4d, p4de, p3, p3dn, gr6, g6, g6e, g6f, gr6f, g5, g4dn
2727

2828
|Bottlerocket aarch64/arm64 aws-k8s-nvidia
2929
|g5g
@@ -49,18 +49,20 @@ When using the https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/late
4949
In addition to the standard EKS AMI components, the EKS-optimized AL2023 NVIDIA AMIs include the following components.
5050

5151
* NVIDIA driver
52-
* NVIDIA CUDA runtime libraries
52+
* NVIDIA CUDA user mode driver
5353
* NVIDIA container toolkit
5454
* NVIDIA fabric manager
5555
* NVIDIA persistenced
5656
* NVIDIA IMEX driver
5757
* NVIDIA NVLink Subnet Manager
5858
* EFA minimal (kernel module and rdma-core)
5959
60-
See the EKS AL2023 NVIDIA AMI https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/provisioners/install-nvidia-driver.sh[installation script] and https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/runtime/gpu/nvidia-kmod-load.sh[kernel loading script] for details on how the EKS AMIs configure the NVIDIA dependencies. See the EKS-optimized https://github.com/awslabs/amazon-eks-ami/releases[AL2023 releases] on GitHub to see the component versions included in the AMIs. You can find the list of installed packages and their versions on a running EC2 instance with the `dnf list installed` command.
60+
For details on the NVIDIA CUDA user mode driver and the CUDA runtime/libraries used within application containers, see the https://docs.nvidia.com/deploy/cuda-compatibility/why-cuda-compatibility.html#why-cuda-compatibility[NVIDIA documentation]. The CUDA version shown from `nvidia-smi` is the version of the NVIDIA CUDA user mode driver installed on the host, which must be compatible with the CUDA runtime/libraries used in application containers.
6161

6262
To track the status of the EKS-optimized NVIDIA AMIs upgrade to NVIDIA driver 580 version, see https://github.com/awslabs/amazon-eks-ami/issues/2470[GitHub issue #2470]. The NVIDIA 580 driver is required to use CUDA 13+.
6363

64+
See the EKS AL2023 NVIDIA AMI https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/provisioners/install-nvidia-driver.sh[installation script] and https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/runtime/gpu/nvidia-kmod-load.sh[kernel loading script] for details on how the EKS AMIs configure the NVIDIA dependencies. See the EKS-optimized https://github.com/awslabs/amazon-eks-ami/releases[AL2023 releases] on GitHub to see the component versions included in the AMIs. You can find the list of installed packages and their versions on a running EC2 instance with the `dnf list installed` command.
65+
6466
When building custom AMIs with the EKS-optimized AMIs as the base, it is not recommended or supported to run an operating system upgrade (ie. `dnf upgrade`) or upgrade any of the Kubernetes or GPU packages that are included in the EKS-optimized AMIs, as this risks breaking component compatibility. If you do upgrade the operating system or packages that are included in the EKS-optimized AMIs, it is recommended to thoroughly test in a development or staging environment before deploying to production.
6567

6668
When building custom AMIs for GPU instances, it is recommended to build separate custom AMIs for each instance type generation and family that you will run. The EKS-optimized accelerated AMIs selectively install drivers and packages at runtime based on the underlying instance type generation and family. For more information, see the EKS AMI scripts for https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/provisioners/install-nvidia-driver.sh[installation] and https://github.com/awslabs/amazon-eks-ami/blob/main/templates/al2023/runtime/gpu/nvidia-kmod-load.sh[runtime].
@@ -70,15 +72,17 @@ When building custom AMIs for GPU instances, it is recommended to build separate
7072

7173
When using the https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html[NVIDIA GPU operator] with the EKS-optimized Bottlerocket NVIDIA AMIs, you must disable the operator installation of the driver, toolkit, and device plugin as these are already included in the EKS AMIs.
7274

73-
In addition to the standard EKS AMI components, the EKS-optimized Bottlerocket NVIDIA AMIs include the following components.
75+
In addition to the standard EKS AMI components, the EKS-optimized Bottlerocket NVIDIA AMIs include the following components. The minimal dependencies for EFA (kernel module and rdma-core) are installed in all Bottlerocket variants.
7476

7577
* NVIDIA driver
76-
* NVIDIA CUDA runtime libraries
78+
* NVIDIA CUDA user mode driver
7779
* NVIDIA container toolkit
7880
* NVIDIA fabric manager
81+
* NVIDIA persistenced
7982
* NVIDIA IMEX driver
8083
* NVIDIA NVLink Subnet Manager
81-
* EFA minimal (kernel module and rdma-core)
84+
85+
For details on the NVIDIA CUDA user mode driver and the CUDA runtime/libraries used within application containers, see the https://docs.nvidia.com/deploy/cuda-compatibility/why-cuda-compatibility.html#why-cuda-compatibility[NVIDIA documentation]. The CUDA version shown from `nvidia-smi` is the version of the NVIDIA CUDA user mode driver installed on the host, which must be compatible with the CUDA runtime/libraries used in application containers.
8286

8387
See the Bottlerocket Version Information in the https://bottlerocket.dev/en/[Bottlerocket documentation] for details on the installed packages and their versions. The EKS-optimized Bottlerocket NVIDIA AMIs support kernel 6.12 and NVIDIA driver 580 version for Kubernetes versions 1.33 and above. The NVIDIA 580 driver is required to use CUDA 13+.
8488

latest/ug/nodes/eks-ami-deprecation-faqs.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ Kubernetes version 1.32 is the last version for which Amazon EKS will release AL
8686
|Fedora/CentOS 9
8787
|N/A
8888

89-
|CUDA Toolkit
89+
|https://docs.nvidia.com/deploy/cuda-compatibility/why-cuda-compatibility.html#why-cuda-compatibility[CUDA user mode driver]
9090
|12.x
9191
|12.x
9292
|12.x,13.x

latest/ug/nodes/eks-optimized-ami.adoc

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,10 @@ include::../attributes.txt[]
77

88
[abstract]
99
--
10-
The Amazon EKS optimized Amazon Linux AMIs are built on top of Amazon Linux 2 (AL2) and Amazon Linux 2023 (AL2023). They are configured to serve as the base images for Amazon EKS nodes.
10+
The Amazon EKS-optimized Amazon Linux AMIs are built on top of Amazon Linux 2 (AL2) and Amazon Linux 2023 (AL2023). They are configured to serve as the base images for Amazon EKS nodes.
1111
--
1212

13-
The Amazon EKS optimized Amazon Linux AMIs are built on top of Amazon Linux 2 (AL2) and Amazon Linux 2023 (AL2023). They are configured to serve as the base images for Amazon EKS nodes. The AMIs are configured to work with Amazon EKS and they include the following components:
13+
The Amazon EKS-optimized Amazon Linux AMIs are built on top of Amazon Linux 2 (AL2) and Amazon Linux 2023 (AL2023). They are configured to serve as the base images for Amazon EKS nodes. The AMIs are configured to work with Amazon EKS and they include the following components:
1414

1515
* `kubelet`
1616
* {aws} IAM Authenticator
@@ -20,7 +20,7 @@ The Amazon EKS optimized Amazon Linux AMIs are built on top of Amazon Linux 2 (A
2020
====
2121
2222
* You can track security or privacy events for Amazon Linux at the https://alas.aws.amazon.com/[Amazon Linux security center] by choosing the tab for your desired version. You can also subscribe to the applicable RSS feed. Security and privacy events include an overview of the issue, what packages are affected, and how to update your instances to correct the issue.
23-
* Before deploying an accelerated or Arm AMI, review the information in <<gpu-ami,Amazon EKS optimized accelerated Amazon Linux AMIs>> and <<arm-ami>>.
23+
* Before deploying an accelerated or Arm AMI, review the information in <<gpu-ami,Amazon EKS-optimized accelerated Amazon Linux AMIs>> and <<arm-ami>>.
2424
* Amazon EC2 `P2` instances aren't supported on Amazon EKS because they require `NVIDIA` driver version 470 or earlier.
2525
* Any newly created managed node groups in clusters on version `1.30` or newer will automatically default to using AL2023 as the node operating system. Previously, new node groups would default to AL2. You can continue to use AL2 by choosing it as the AMI type when creating a new node group.
2626
* Amazon EKS will no longer publish EKS-optimized Amazon Linux 2 (AL2) AMIs after November 26th, 2025. Additionally, Kubernetes version `1.32` is the last version for which Amazon EKS will release AL2 AMIs. From version `1.33` onwards, Amazon EKS will continue to release AL2023 and Bottlerocket based AMIs.
@@ -30,7 +30,7 @@ The Amazon EKS optimized Amazon Linux AMIs are built on top of Amazon Linux 2 (A
3030
[#gpu-ami]
3131
== Amazon EKS-optimized accelerated Amazon Linux AMIs
3232

33-
The Amazon EKS-optimized accelerated Amazon Linux AMIs are built on top of the standard Amazon EKS optimized Amazon Linux AMIs. They are configured to serve as optional images for Amazon EKS nodes to support GPU, link:machine-learning/inferentia/[Inferentia,type="marketing"], and link:machine-learning/trainium/[Trainium,type="marketing"] based workloads.
33+
The Amazon EKS-optimized accelerated Amazon Linux AMIs are built on top of the standard Amazon EKS-optimized Amazon Linux AMIs. They are configured to serve as optional images for Amazon EKS nodes to support GPU, link:machine-learning/inferentia/[Inferentia,type="marketing"], and link:machine-learning/trainium/[Trainium,type="marketing"] based workloads.
3434

3535
For more information, see <<ml-eks-optimized-ami>>.
3636

@@ -47,13 +47,13 @@ Arm instances deliver significant cost savings for scale-out and Arm-based appli
4747
[#linux-more-information]
4848
== More information
4949

50-
For more information about using Amazon EKS optimized Amazon Linux AMIs, see the following sections:
50+
For more information about using Amazon EKS-optimized Amazon Linux AMIs, see the following sections:
5151

5252
* To use Amazon Linux with managed node groups, see <<managed-node-groups>>.
5353
* To launch self-managed Amazon Linux nodes, see <<retrieve-ami-id>>.
5454
* For version information, see <<eks-linux-ami-versions>>.
55-
* To retrieve the latest IDs of the Amazon EKS optimized Amazon Linux AMIs, see <<retrieve-ami-id>>.
56-
* For open-source scripts that are used to build the Amazon EKS optimized AMIs, see <<eks-ami-build-scripts>>.
55+
* To retrieve the latest IDs of the Amazon EKS-optimized Amazon Linux AMIs, see <<retrieve-ami-id>>.
56+
* For open-source scripts that are used to build the Amazon EKS-optimized AMIs, see <<eks-ami-build-scripts>>.
5757
5858
include::al2023.adoc[leveloffset=+1]
5959

0 commit comments

Comments
 (0)