Releases: stackhpc/ansible-slurm-appliance
v1.161
What's Changed
Bumps slurm versions to fix CVE-2025-43904:
- Upgrade to OpenHPC/Slurm versions RL9=3.1.1/24.11.5 RL8=2.9.1/23.11.11 by @sjpb in #668
- Perform Slurm database upgrade if necessary by @sjpb in #670
- Automate image release by @sjpb in #671
Caution
This is a Slurm major version update for RockyLinux 9 (= OpenHPC v3) clusters.
These clusters will perform a Slurm database upgrade on slurmdbd startup. They will backup the entire state volume via a volume snapshot before performing the backup. See #670 and linked dependency PRs for full information.
Full Changelog: v1.160...v1.161
Images
Two new images are available:
- RockyLinux 8: openhpc-RL8-250514-1502-5a923b2c
- RockyLinux 9: openhpc-RL9-250514-1502-5a923b2c
v1.161-rc1
What's Changed
Bumps slurm versions to fix CVE-2025-43904:
Caution
This is a Slurm major version update for RockyLinux 9 (= OpenHPC v3) clusters.
These clusters will perform a Slurm database upgrade on slurmdbd startup. The startup timeout for that service has been increased to 45 minutes to allow for that. However it is recommended that this database (in /var/lib/state/mysql on the control node) is backed-up before starting slurmdbd, for example by snapshotting the $CLUSTER_NAME-state
volume after the reimage (so the service is stopped) but before running the site.yml
playbook.
Full Changelog: v1.160...v1.161
Images
Two new images are available:
- RockyLinux 8: openhpc-RL8-250513-1045-ca44f898
- RockyLinux 9: openhpc-RL9-250513-1046-ca44f898
v1.160
v1.159
What's Changed
In summary:
- Updated OS dnf packages
- Updated NVIDIA driver and CUDA packages, for sites building images including the
cuda
group - Updated grafana to v10
- Various fixes (mostly for root-squashed NFS home directory mounts) and feature completion
- Improved documentation
- Fixes the Zenith proxy in CaaS clusters for RL9
- Compute-init: cope with root-squashed nfs clients by @bertiethorpe in #627
- Update terraform provider openstack to v3 by @sd109 in #578
- Fix some typos by @priteau in #629
- Add docs with sequence diagrams for operations by @sjpb in #456
- Update nvidia drivers (to 570-open) CUDA packages (to 12.8.1-1) and samples playbook by @priteau in #628
- Fix dropin directory creation by @jovial in #631
- Test upgrade from latest release to current branch image in CI by @sjpb in #576
- Compute-Init: wait for cloud-init before NFS mount by @JohnGarbutt in #635
- Update dnf repos using latest Pulp timestamps (plus tooling) by @sjpb in #621
- Ensure no_proxy entries are unique by @technowhizz in #633
- Fix typos in docs by @priteau in #639
- Correct vnic_types var name in skeleton variables by @MoteHue in #640
- Document (and test) slurm controlled rebuild configuration and usage by @sjpb in #634
- Fix site.yml hanging on initial deploy by @sjpb in #648
- Fix cuda installs by @MoteHue in #652
- Use checksum verification for CernVM-FS GPG key by @priteau in #641
- fix nightly cleanup for duplicate server names by @bertiethorpe in #653
- Add support for alertmanager by @sjpb in #649
- Fix fatimage build without alertmanager secret by @sjpb in #655
- Fix typos in docs by @priteau in #658
- Change fat image build to create raw image for speed by @JohnGarbutt in #650
- Allow empty items in extra package and user lists by @priteau in #637
- Fix nightly-cleanup workflow by @bertiethorpe in #660
- Fix creation of hpctests directory by @priteau in #659
- Default hpctests_group to hpctests_user by @sjpb in #663
- Fix caas zenith/hpctests/basic_users by @sjpb in #662
- Update grafana to v10 using Ark rpms by @sjpb in #664
- Allow modifying nodes fully-qualified name by @sjpb in #651
Full Changelog: v1.158...v1.159
Images
Two new images are available:
- RockyLinux 8: openhpc-RL8-250506-1259-abb6394b
- RockyLinux 9: openhpc-RL9-250506-1259-abb6394b
v1.158
What's Changed
New features
- Support multiple networks in OpenTofu configurations by @sjpb in #548
- Support attaching FIPs to login nodes by @sjpb in #572
- Support for configuring chrony by @jovial in #575
- Control default routes on boot by @sjpb in #617
- Support mapping compute & login instances to Ironic nodes by @sjpb in #573
- Add support for configuring CA certificates by @sjpb in #574
Important fixes and changes from previous release
- Support lustre on Rocky 8 by @jovial in #566
- Fix lustre IP route detection if there is no gateway by @jovial in #567
- Support sshd password authentication on Rocky 8 by @jovial in #565
- Ensure oddjobd is enabled/started by @jovial in #564
- Add lustre_repo variable by @jovial in #563
- Define login nodes using an opentofu module by @sjpb in #547
- Lower hpl memory fraction to reduce stress from defaults by @sjpb in #591
- Root-squash nfs exports by default by @sjpb in #599
- Restrict all nfs shares to nfs group IPs by @sjpb in #607
- Lustre: Harden mount options by @jovial in #618
- Manila/CephFS and NFS: harden mounts to prevent setuid and devices by @sjpb in #619
Other changes
- Read k3s_token from secrets.yml file by @sjpb in #540
- Remove slurm_openstack_tools collection by @sjpb in #537
- Rename terraform/ directories to tofu/ by @sjpb in #541
- Fix squid/dnf ordering problem by @sjpb in #546
- Optionally ignore image changes in TF by @bertiethorpe in #545
- Change docs/ references from Terraform to OpenTofu by @bertiethorpe in #544
- avoid tf updates to login/compute on control delete/recreate by @sjpb in #555
- Set k3s node IP from access network IP by @sjpb in #556
- docs: update README to use new network syntax by @priteau in #560
- Support compute node rebuild/reboot via Slurm RebootProgram by @bertiethorpe in #553
- Document compute-init image requirements by @sjpb in #569
- Support tuned in compute-init by @sjpb in #570
- Support memory limits and pam no-login in compute-init by @bertiethorpe in #568
- docs: fix OpenTofu file names in README by @priteau in #562
- Support sssd and sshd in compute-init by @bertiethorpe in #571
- Reword recommendation about image by @priteau in #580
- Fix link to Open OnDemand documentation by @priteau in #584
- Fix some typos by @priteau in #583
- Make no_proxy list more configurable by @sd109 in #579
- Fix wrong path to Ansible inventory by @priteau in #587
- Support setting PYTHON_VERSION by @priteau in #588
- Disable compute-init by default & warn of security issue by @sjpb in #585
- Fix basic_users not modifying default nfs-shared home correctly by @sjpb in #590
- Support disabling port security by @sjpb in #592
- Use bootstrap tokens provisioned by ansible for K3s instead of persistent tokens in cloud-init metadata by @wtripp180901 in #589
- Fixed bootstrap tokens not being idempotent by @wtripp180901 in #597
- Fix: Support networks not owned by openstack project by @bertiethorpe in #598
- Remove support for setting VNIC binding profiles by @priteau in #586
- Prevent nfs being mounted by tunnelling/forwarding through login node by @sjpb in #595
- Enable lustre in compute-init by @bertiethorpe in #581
- Fix OpenTofu execution as admin by @priteau in #582
- FIX: Tofu attempts to apply security groups when port_security_enabled is false by @bertiethorpe in #601
- Add file deletion to cleanup play by @sjpb in #600
- Disable nightly builds by @bertiethorpe in #603
- Fix chrony for nodes w/o network access (yet) by @sjpb in #605
- Fix typo in variables.tf by @technowhizz in #609
- Compute-init: Optimise dir copies + Numerical sort playbook + new nodes to existing cluster by @bertiethorpe in #611
- Fix builds not in stackhpc env by @sjpb in #615
- Fix documentation of sssd_install_ldap variable by @priteau in #613
- docs: fix typo by @priteau in #623
- Updated README so image consistent with codebase by @wtripp180901 in #610
- Add image share script by @sjpb in #624
- Enable creating users with local homedirs by @sjpb in #626
New Contributors
- @technowhizz made their first contribution in #609
Full Changelog: v1.157...v1.158
New images
Two new images are available:
- RockyLinux 8: openhpc-RL8-250312-1522-7e5c051d
- RockyLinux 9: openhpc-RL9-250312-1435-7e5c051d
v1.157
What's Changed
- Update ceph to use ark packages and move RL9 to ceph reef by @wtripp180901 in #519
- Add more information re. configuring production sites by @sjpb in #508
- Change defaults so a cookiecutter environment is fully functional by @wtripp180901 in #473
- Fix epel not using Ark repos for RL8 by @wtripp180901 in #526
- Fix volume_backed_instances not working for compute nodes by @sjpb in #527
- Generate and persist hostkeys for ondemand and login nodes by @wtripp180901 in #525
- Support additional volumes on compute nodes by @sjpb in #528
- Support SSSD and optionally LDAP by @sjpb in #438
- Fix nightly cleanup to deal with duplicate server names by @bertiethorpe in #532
- Fix various typos in documentation by @priteau in #530
- Fix environment creation steps by @priteau in #531
- Support and test "re-imageable" compute nodes via compute node metadata by @bertiethorpe in #518
- Document required security groups by @priteau in #534
- Bump Zenith client to latest from azimuth-cloud namespace by @m-bull in #437
- Fix yaml formatting in operations docs by @sjpb in #535
- Enable image builds to install extra packages by default by @sjpb in #536
Image Details
Two new images are available
- RL8: openhpc-RL8-250114-1627-bccc88b5
- RL9: openhpc-RL9-250114-1626-bccc88b5
New Contributors
Full Changelog: v1.156...v1.157
v1.156
What's Changed
Due to the size of this release, PRs are grouped below. In brief:
- This release addresses various breakages caused by changes to upstream repos. As a result, as of this release the StackHPC images (see below) ship with all dnf repos disabled and either credentials for StackHPC's ark server or a local Pulp server mirrored from
ark
are required in order to build images. - OFED and CUDA are no longer shipped in StacHPC images and require an image build to add.
- StackHPC images move to RockyLinux 9.5 and 8.10.
- Added support for NVIDIA DOCA instead of OFED.
- Added support for Lustre clients.
- OpenHPC role supports using the same nodes in multiple partitions/groups.
- Additional packages can be added via
appliances_default_extra_packages
.
Isolation from upstream dnf repos
- Remove CUDA and OFED builds from CI by @bertiethorpe in #479
- Use rocky 9.4 release train snapshots for builds by @wtripp180901 in #486
- Support site Pulp server for image builds by @wtripp180901 in #490
- Pin nvidia-driver and cuda packages to working packages by @sjpb in #496
- Bump RL9.4 repo timestamps to latest snapshots by @wtripp180901 in #497
- Refactor pulp/dnf roles to avoid having to redefine Ark URLs by @wtripp180901 in #507
- Release train support for Rocky 8.10 by @wtripp180901 in #501
- Bump appliance to Rocky 9.5 + release train support by @wtripp180901 in #503
- Fix python/ansible/pulp squeezer versions for RL8 deploy hosts by @sjpb in #516
- Add Release Train OpenHPC repos by @wtripp180901 in #515
New functionality
- Support lustre client by @sjpb in #447
- Install k3s cluster with ansible init by @wtripp180901 in #441
- Make block device detection work on ESXi by @mkjpryor in #481
- Add role to install NVIDIA DOCA on top of an existing "fat" image by @sjpb in #492
- Fix DOCA install cleanup deleteing /tmp by @sjpb in #494
- Add list of additional package installs by @wtripp180901 in #499
- EXPERIMENTAL: add machinery to allow compute nodes to rejoin cluster on reimage by @sjpb in #500
- Ansible-init compute node script by @bertiethorpe in #476
Docs
- Add missing bits re. initial setup to refactored README by @sjpb in #464
- Add generic upgrade docs by @sjpb in #462
- Add note about login node reboot when changing OOD servername by @sd109 in #510
Fixes
- Remove local DNS as a dependency for k3s by @sjpb in #442
- Fix adhoc/rebuild wait_for_connection race condition by @bertiethorpe in #483
- Fix Lustre deleting rdma packages and bump to v2.15.6 for RL9.5 support by @wtripp180901 in #502
Upgrades
- Upgrade RL8 ceph to quincy + trivy rate limit and OOD false positives fix by @wtripp180901 in #477
- Bump openhpc role for slurm restart, templating and nodes in multiple groups by @sjpb in #488
Internal CI changes/fixes
- Don't run trivy scan on nightly builds by @sjpb in #467
- Unset signature_verified property from nightly/latest images by @sjpb in #474
- Don't fail cluster cleanup when prefix not found by @bertiethorpe in #480
- Fix nightly images getting timestamp/git hash by @sjpb in #493
- Fix nightly build version (v2) by @sjpb in #495
- Remove use of FIPs for leafcloud packer builds by @sjpb in #498
Image Details
Two new images are available (neither of which now contain OFED) :
- RL8: openhpc-RL8-250106-0916-f8603056
- RL9: openhpc-RL9-250106-0916-f8603056
New Contributors
Full Changelog: v1.155...v1.156
v1.155
What's Changed
- Prevent ansible-init running during packer build by @wtripp180901 in #439
- Ensure podman copes with a hard reboot by @sjpb in #460
- Add workflow to cleanup CI clusters by @sjpb in #451
Image Details
Three new images are available, all with OFED:
- openhpc-RL8-241022-0441-a5affa58
- openhpc-RL9-241022-0038-a5affa58
- openhpc-cuda-RL9-241022-0441-a5affa58
New Contributors
- @wtripp180901 made their first contribution in #439
Full Changelog: v1.154...v1.155
v1.154
What's Changed
- Add description of image to build by @sjpb in #444
- Nightly Slurm CI Rocky update workflow by @bertiethorpe in #440
- stub s3-image-sync workflow for easier ci by @bertiethorpe in #450
- Upload main images to Arcus S3 and sync clouds by @bertiethorpe in #448
- Update docs to include operations by @sjpb in #422
- Fix error in packer build command for nightly builds by @bertiethorpe in #455
- Bump terraform collection to fix race with waiting for ssh by @sjpb in #457
Image details
Three new images are available, all with OFED:
- openhpc-RL8-241009-1523-354b048a
- openhpc-RL9-241009-1523-354b048a
- openhpc-cuda-RL9-241009-1523-354b048a
These require a 15GB root disk except for the image with CUDA which requires 30GB.
Full Changelog: v1.153.1...v1.154
v1.153.1
What's Changed
- Fix up the outputs, after the fip fix by @JohnGarbutt in #446
Full Changelog: v1.153...v1.153.1
No new images provided at this release.