Skip to content

Commit be61965

Browse files
authored
Support fully-automatic GRES configuration for nvml (#202)
* auto set GresTypes * auto gres - v1 * auto gres v2 * auto gres v3 * auto gres v4 * auto gres v5 - proper top-level/overrride * v5 - fix grestypes when none * fixup validation * update README * disable waffly AI PR summary * fixup README example * fixup library boilerplate * fix README typos * try to avoid jmespath failures in CI * fix multiple gres specs resulting in NodeName= lines missing newlines * fix regex for in-nodegroup gres name extraction
1 parent 9b63015 commit be61965

File tree

11 files changed

+204
-175
lines changed

11 files changed

+204
-175
lines changed

.gemini/config.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
code_review:
2+
pull_request_opened:
3+
summary: false

.github/workflows/ci.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,8 @@ jobs:
8686

8787
- name: Install test dependencies.
8888
run: |
89-
pip3 install -U pip ansible>=2.9.0 molecule-plugins[podman]==23.5.0 yamllint ansible-lint
89+
pip3 install -U pip
90+
pip install -r molecule/requirements.txt
9091
ansible-galaxy collection install containers.podman:>=1.10.1 # otherwise get https://github.com/containers/ansible-podman-collections/issues/428
9192
9293
- name: Display ansible version

README.md

Lines changed: 70 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -68,12 +68,20 @@ unique set of homogenous nodes:
6868
`free --mebi` total * `openhpc_ram_multiplier`.
6969
* `ram_multiplier`: Optional. An override for the top-level definition
7070
`openhpc_ram_multiplier`. Has no effect if `ram_mb` is set.
71-
* `gres_autodetect`: Optional. The [auto detection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect) to use for the generic resources. Note: you must still define the `gres` dictionary (see below) but you only need the define the `conf` key. See [GRES autodetection](#gres-autodetection) section below.
72-
* `gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict should define:
73-
- `conf`: A string with the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) but requiring the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`. Note the `type` is an arbitrary string.
74-
- `file`: Omit if `gres_autodetect` is set. A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
75-
76-
Note [GresTypes](https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes) must be set in `openhpc_config` if this is used.
71+
* `gres_autodetect`: Optional. The [hardware autodetection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect)
72+
to use for [generic resources](https://slurm.schedmd.com/gres.html).
73+
**NB:** A value of `'off'` (the default) must be quoted to avoid yaml
74+
conversion to `false`.
75+
* `gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html).
76+
Not required if using `nvml` GRES autodetection. Keys/values in dicts are:
77+
- `conf`: A string defining the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1)
78+
in the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`.
79+
- `file`: A string defining device path(s) as per [File](https://slurm.schedmd.com/gres.conf.html#OPT_File),
80+
e.g. `/dev/nvidia[0-1]`. Not required if using any GRES autodetection.
81+
82+
Note [GresTypes](https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes) is
83+
automatically set from the defined GRES or GRES autodetection. See [GRES Configuration](#gres-configuration)
84+
for more discussion.
7785
* `features`: Optional. List of [Features](https://slurm.schedmd.com/slurm.conf.html#OPT_Features) strings.
7886
* `node_params`: Optional. Mapping of additional parameters and values for
7987
[node configuration](https://slurm.schedmd.com/slurm.conf.html#lbAE).
@@ -106,6 +114,10 @@ partition. Each partition mapping may contain:
106114
If this variable is not set one partition per nodegroup is created, with default
107115
partition configuration for each.
108116

117+
`openhpc_gres_autodetect`: Optional. A global default for `openhpc_nodegroups.gres_autodetect`
118+
defined above. **NB:** A value of `'off'` (the default) must be quoted to avoid
119+
yaml conversion to `false`.
120+
109121
`openhpc_job_maxtime`: Maximum job time limit, default `'60-0'` (60 days), see
110122
[slurm.conf:MaxTime](https://slurm.schedmd.com/slurm.conf.html#OPT_MaxTime).
111123
**NB:** This should be quoted to avoid Ansible conversions.
@@ -278,7 +290,7 @@ cluster-control
278290

279291
This example shows how partitions can span multiple types of compute node.
280292

281-
This example inventory describes three types of compute node (login and
293+
Assume an inventory containing two types of compute node (login and
282294
control nodes are omitted for brevity):
283295

284296
```ini
@@ -293,17 +305,12 @@ cluster-general-1
293305
# large memory nodes
294306
cluster-largemem-0
295307
cluster-largemem-1
296-
297-
[hpc_gpu]
298-
# GPU nodes
299-
cluster-a100-0
300-
cluster-a100-1
301308
...
302309
```
303310

304-
Firstly the `openhpc_nodegroups` is set to capture these inventory groups and
305-
apply any node-level parameters - in this case the `largemem` nodes have
306-
2x cores reserved for some reason, and GRES is configured for the GPU nodes:
311+
Firstly `openhpc_nodegroups` maps to these inventory groups and applies any
312+
node-level parameters - in this case the `largemem` nodes have 2x cores
313+
reserved for some reason:
307314

308315
```yaml
309316
openhpc_cluster_name: hpc
@@ -312,104 +319,100 @@ openhpc_nodegroups:
312319
- name: large
313320
node_params:
314321
CoreSpecCount: 2
315-
- name: gpu
316-
gres:
317-
- conf: gpu:A100:2
318-
file: /dev/nvidia[0-1]
319322
```
320-
or if using the NVML gres_autodection mechamism (NOTE: this requires recompilation of the slurm binaries to link against the [NVIDIA Management libray](#gres-autodetection)):
321323
322-
```yaml
323-
openhpc_cluster_name: hpc
324-
openhpc_nodegroups:
325-
- name: general
326-
- name: large
327-
node_params:
328-
CoreSpecCount: 2
329-
- name: gpu
330-
gres_autodetect: nvml
331-
gres:
332-
- conf: gpu:A100:2
333-
```
334-
Now two partitions can be configured - a default one with a short timelimit and
335-
no large memory nodes for testing jobs, and another with all hardware and longer
336-
job runtime for "production" jobs:
324+
Now two partitions can be configured using `openhpc_partitions`: A default
325+
partition for testing jobs with a short timelimit and no large memory nodes,
326+
and another partition with all hardware and longer job runtime for "production"
327+
jobs:
337328

338329
```yaml
339330
openhpc_partitions:
340331
- name: test
341332
nodegroups:
342333
- general
343-
- gpu
344334
maxtime: '1:0:0' # 1 hour
345335
default: 'YES'
346336
- name: general
347337
nodegroups:
348338
- general
349339
- large
350-
- gpu
351340
maxtime: '2-0' # 2 days
352341
default: 'NO'
353342
```
354343
Users will select the partition using `--partition` argument and request nodes
355-
with appropriate memory or GPUs using the `--mem` and `--gres` or `--gpus*`
356-
options for `sbatch` or `srun`.
344+
with appropriate memory using the `--mem` option for `sbatch` or `srun`.
357345

358-
Finally here some additional configuration must be provided for GRES:
359-
```yaml
360-
openhpc_config:
361-
GresTypes:
362-
-gpu
363-
```
346+
## GRES Configuration
364347

365-
## GRES autodetection
348+
### Autodetection
366349

367-
Some autodetection mechanisms require recompilation of the slurm packages to
368-
link against external libraries. Examples are shown in the sections below.
350+
Some autodetection mechanisms require recompilation of Slurm packages to link
351+
against external libraries. Examples are shown in the sections below.
369352

370-
### Recompiling slurm binaries against the [NVIDIA Management libray](https://developer.nvidia.com/management-library-nvml)
353+
#### Recompiling Slurm binaries against the [NVIDIA Management library](https://developer.nvidia.com/management-library-nvml)
371354

372-
This will allow you to use `gres_autodetect: nvml` in your `nodegroup`
373-
definitions.
355+
This allows using `openhpc_gres_autodetect: nvml` or `openhpc_nodegroup.gres_autodetect: nvml`.
374356

375357
First, [install the complete cuda toolkit from NVIDIA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/).
376358
You can then recompile the slurm packages from the source RPMS as follows:
377359

378360
```sh
379361
dnf download --source slurm-slurmd-ohpc
380-
381362
rpm -i slurm-ohpc-*.src.rpm
382-
383363
cd /root/rpmbuild/SPECS
384-
385364
dnf builddep slurm.spec
386-
387365
rpmbuild -bb -D "_with_nvml --with-nvml=/usr/local/cuda-12.8/targets/x86_64-linux/" slurm.spec | tee /tmp/build.txt
388366
```
389367

390368
NOTE: This will need to be adapted for the version of CUDA installed (12.8 is used in the example).
391369

392-
The RPMs will be created in ` /root/rpmbuild/RPMS/x86_64/`. The method to distribute these RPMs to
393-
each compute node is out of scope of this document. You can either use a custom package repository
394-
or simply install them manually on each node with Ansible.
370+
The RPMs will be created in `/root/rpmbuild/RPMS/x86_64/`. The method to distribute these RPMs to
371+
each compute node is out of scope of this document.
395372

396-
#### Configuration example
373+
## GRES configuration examples
397374

398-
A configuration snippet is shown below:
375+
For NVIDIA GPUs, `nvml` GRES autodetection can be used. This requires:
376+
- The relevant GPU nodes to have the `nvidia-smi` binary installed
377+
- Slurm to be compiled against the NVIDIA management library as above
378+
379+
Autodetection can then be enabled using either for all nodegroups:
399380

400381
```yaml
401-
openhpc_cluster_name: hpc
382+
openhpc_gres_autodetection: nvml
383+
```
384+
385+
or for individual nodegroups e.g:
386+
```yaml
387+
openhpc_nodegroups:
388+
- name: example
389+
gres_autodetection: nvml
390+
...
391+
```
392+
393+
In either case no additional configuration of GRES is required. Any nodegroups
394+
with NVIDIA GPUs will automatically get `gpu` GRES defined for all GPUs found.
395+
GPUs within a node do not need to be the same model but nodes in a nodegroup
396+
must be homogenous. GRES types are set to the autodetected model names e.g. `H100`.
397+
398+
For `nvml` GRES autodetection per-nodegroup `gres_autodetection` and/or `gres` keys
399+
can be still be provided. These can be used to disable/override the default
400+
autodetection method, or to allow checking autodetected resources against
401+
expectations as described by [gres.conf documentation](https://slurm.schedmd.com/gres.conf.html).
402+
403+
Without any autodetection, a GRES configuration for NVIDIA GPUs might be:
404+
405+
```
402406
openhpc_nodegroups:
403407
- name: general
404-
- name: large
405-
node_params:
406-
CoreSpecCount: 2
407408
- name: gpu
408-
gres_autodetect: nvml
409409
gres:
410-
- conf: gpu:A100:2
410+
- conf: gpu:H200:2
411+
file: /dev/nvidia[0-1]
411412
```
412-
for additional context refer to the GPU example in: [Multiple Nodegroups](#multiple-nodegroups).
413413

414+
Note that the `nvml` autodetection is special in this role. Other autodetection
415+
mechanisms, e.g. `nvidia` or `rsmi` allow the `gres.file:` specification to be
416+
omitted but still require `gres.conf:` to be defined.
414417

415418
<b id="slurm_ver_footnote">1</b> Slurm 20.11 removed `accounting_storage/filetxt` as an option. This version of Slurm was introduced in OpenHPC v2.1 but the OpenHPC repos are common to all OpenHPC v2.x releases. [↩](#accounting_storage)

defaults/main.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ openhpc_packages:
1212
openhpc_resume_timeout: 300
1313
openhpc_retry_delay: 10
1414
openhpc_job_maxtime: '60-0' # quote this to avoid ansible converting some formats to seconds, which is interpreted as minutes by Slurm
15+
openhpc_gres_autodetect: 'off'
1516
openhpc_default_config:
1617
# This only defines values which are not Slurm defaults
1718
SlurmctldHost: "{{ openhpc_slurm_control_host }}{% if openhpc_slurm_control_host_address is defined %}({{ openhpc_slurm_control_host_address }}){% endif %}"
@@ -40,6 +41,7 @@ openhpc_default_config:
4041
PropagateResourceLimitsExcept: MEMLOCK
4142
Epilog: /etc/slurm/slurm.epilog.clean
4243
ReturnToService: 2
44+
GresTypes: "{{ ohpc_gres_types if ohpc_gres_types != '' else 'omit' }}"
4345
openhpc_cgroup_default_config:
4446
ConstrainCores: "yes"
4547
ConstrainDevices: "yes"
@@ -48,6 +50,16 @@ openhpc_cgroup_default_config:
4850

4951
openhpc_config: {}
5052
openhpc_cgroup_config: {}
53+
ohpc_gres_types: >-
54+
{{
55+
(
56+
['gpu'] if openhpc_gres_autodetect == 'nvml' else [] +
57+
['gpu'] if openhpc_nodegroups | map(attribute='gres_autodetect', default='') | unique | select('eq', 'nvml') else [] +
58+
openhpc_nodegroups |
59+
community.general.json_query('[].gres[].conf') |
60+
map('regex_search', '^(\w+)')
61+
) | flatten | reject('eq', '') | sort | unique | join(',')
62+
}}
5163
openhpc_gres_template: gres.conf.j2
5264
openhpc_cgroup_template: cgroup.conf.j2
5365

files/nodegroup.schema

Lines changed: 0 additions & 86 deletions
This file was deleted.

0 commit comments

Comments
 (0)