Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
60123ee
swap to no-ohpc version of openhpc role
sjpb Sep 20, 2023
1d47689
use specific openhpc install play files
sjpb Sep 20, 2023
cee6770
bugfix slurm user not existing on non-control nodes
sjpb Sep 20, 2023
251389f
add default openhpc_install_type
sjpb Sep 20, 2023
a49f480
openhpc_ role config for custom binaries
sjpb Sep 20, 2023
fbaaba2
NFS export localhost directory to cluster for /slurm
sjpb Sep 20, 2023
9f888c1
modify hpctests to support non-OpenHPC slurm
sjpb Sep 20, 2023
446ec91
add stackhpc config for hpctests with non-openhpc slurm
sjpb Sep 20, 2023
9dea950
use GenericCloud image, i.e. w/o OpenHPC
sjpb Sep 20, 2023
8cdc4a6
move slurm build to .stackhpc environment
sjpb Sep 20, 2023
2a81f63
add containerised Slurm build
sjpb Sep 20, 2023
79a2cc5
simplify localhosts' NFS definition
sjpb Sep 20, 2023
aec14fd
Merge branch 'main' into feat/no-ohpc
sjpb Sep 26, 2023
49038fb
fix tags for openhpc role (need to run entire playbook due to changes…
sjpb Sep 26, 2023
29c8018
use /nopt/slurm/... directories, with prefix/sysconfdir set in build too
sjpb Sep 26, 2023
f4b02ce
Merge branch 'main' into feat/no-ohpc
sjpb Nov 10, 2023
2253fb1
remove stackhpc demo config for openhpc-less slurm
sjpb Nov 10, 2023
3440e6e
Merge branch 'main' into feat/no-ohpc-mergeable
sjpb Jan 24, 2024
875f61e
add changes from branch rl9
sjpb Jan 24, 2024
458cc0a
fix unqualified names for container pulls
sjpb Jan 24, 2024
64071cf
fix openondemand install
sjpb Jan 24, 2024
5ffbf33
bugfix filebeat unit reload
sjpb Jan 24, 2024
0982f41
comment on required image
sjpb Jan 24, 2024
37e799a
bump fat image base to RL9.3
sjpb Jan 24, 2024
9d5688e
fix dbus-launch command for OOD desktop
sjpb Jan 24, 2024
6e8ea6c
fix OOD desktop launch
sjpb Jan 24, 2024
e5608d9
fix useradd warning: {grafana,prometheus}'s uid * outside of the UID_…
sjpb Jan 25, 2024
6ace72e
Merge branch 'rl9_v2' into feat/no-ohpc-mergeable-rl9
sjpb Jan 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions ansible/fatimage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,9 +58,9 @@
name: mysql
tasks_from: install.yml
- name: OpenHPC
import_role:
include_role:
name: stackhpc.openhpc
tasks_from: install.yml
tasks_from: "install-{{ openhpc_install_type }}.yml"

- name: Include distribution variables for osc.ood
include_vars: "{{ appliances_repository_root }}/ansible/roles/osc.ood/vars/Rocky/8.yml"
Expand Down
1 change: 1 addition & 0 deletions ansible/roles/filebeat/tasks/install.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,4 @@
- name: Reload filebeat unit file
command: systemctl daemon-reload
when: _filebeat_unit.changed
become: true
2 changes: 2 additions & 0 deletions ansible/roles/hpctests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ The following variables should not generally be changed:
- `hpctests_pingpong_plot`: Whether to plot pingpong results. Default `yes`.
- `hpctests_hpl_modules`: As above but for hpl tests.
- `hpctests_hpl_version`: Version of HPL
- `hpctests_extra_paths`: List of additional paths to add to $PATH in `pingpong` and `pingmatrix` sbatch scripts.
- `hpctests_pingpong_command`: Command to use to run IMB-MPI1 pingpong.

Dependencies
------------
Expand Down
2 changes: 2 additions & 0 deletions ansible/roles/hpctests/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
---
hpctests_rootdir:
hpctests_extra_paths: []
hpctests_pingmatrix_modules: [gnu12 openmpi4]
hpctests_pingpong_modules: [gnu12 openmpi4 imb]
hpctests_pingpong_command: 'mpirun IMB-MPI1 pingpong' # NB 'srun --mpi=pmi2 IMB-MPI1 pingpong' doesn't work in ohpc v2.1
hpctests_pingpong_plot: yes
hpctests_hpl_modules: [gnu12 openmpi4 openblas]
hpctests_outdir: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}/hpctests"
Expand Down
3 changes: 2 additions & 1 deletion ansible/roles/hpctests/templates/pingmatrix.sh.j2
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ export UCX_NET_DEVICES={{ hpctests_ucx_net_devices }}
echo SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST
echo SLURM_JOB_ID: $SLURM_JOB_ID
echo UCX_NET_DEVICES: $UCX_NET_DEVICES
module load {{ hpctests_pingmatrix_modules | join(' ' ) }}
{% if hpctests_pingmatrix_modules %}module load {{ hpctests_pingmatrix_modules | join(' ' ) }}{% endif %}
{% if hpctests_extra_paths %}export PATH={{ hpctests_extra_paths | join(':') }}:$PATH{% endif %}

mpicc -o nxnlatbw mpi_nxnlatbw.c
mpirun nxnlatbw
6 changes: 3 additions & 3 deletions ansible/roles/hpctests/templates/pingpong.sh.j2
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ export UCX_NET_DEVICES={{ hpctests_ucx_net_devices }}
echo SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST
echo SLURM_JOB_ID: $SLURM_JOB_ID
echo UCX_NET_DEVICES: $UCX_NET_DEVICES
module load {{ hpctests_pingpong_modules | join(' ' ) }}
{% if hpctests_pingpong_modules %}module load {{ hpctests_pingpong_modules | join(' ' ) }}{% endif %}
{% if hpctests_extra_paths %}export PATH={{ hpctests_extra_paths | join(':') }}:$PATH{% endif %}

#srun --mpi=pmi2 IMB-MPI1 pingpong # doesn't work in ohpc v2.1
mpirun IMB-MPI1 pingpong
{{ hpctests_pingpong_command }}
10 changes: 8 additions & 2 deletions ansible/roles/mysql/tasks/install.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
- name: Install pip
dnf:
name: python3-pip

- name: Install python mysql client
pip:
name: pymysql
name:
- pymysql
- cryptography
state: present

- name: Create systemd mysql container unit file
Expand All @@ -11,6 +17,6 @@

- name: Pull container image
containers.podman.podman_image:
name: "mysql"
name: docker.io/library/mysql
tag: "{{ mysql_tag }}"
become_user: "{{ mysql_podman_user }}"
2 changes: 1 addition & 1 deletion ansible/roles/mysql/templates/mysql.service.j2
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ ExecStart=/usr/bin/podman run \
--volume {{ mysql_datadir }}:/var/lib/mysql:U \
--publish 3306:3306 \
--env MYSQL_ROOT_PASSWORD=${MYSQL_INITIAL_ROOT_PASSWORD} \
mysql:{{ mysql_tag }}{%- for opt in mysql_mysqld_options %} \
docker.io/library/mysql:{{ mysql_tag }}{%- for opt in mysql_mysqld_options %} \
--{{ opt }}{% endfor %}

ExecStop=/usr/bin/podman stop --ignore mysql -t 10
Expand Down
2 changes: 1 addition & 1 deletion ansible/roles/openondemand/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
- include_role:
name: osc.ood
tasks_from: install-package.yml
vars_from: Rocky/8.yml
vars_from: "Rocky/{{ ansible_distribution_major_version }}.yml"
public: yes # Expose the vars from this role to the rest of the play
# can't set vars: from a dict hence the workaround above

Expand Down
1 change: 1 addition & 0 deletions ansible/roles/openondemand/tasks/vnc_compute.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
- turbovnc-3.0.1
- nmap-ncat
- python3.9
- dbus-x11

- name: Install Xfce desktop
tags: install
Expand Down
2 changes: 1 addition & 1 deletion ansible/roles/opensearch/tasks/install.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

- name: Pull container image
containers.podman.podman_image:
name: "opensearchproject/opensearch"
name: docker.io/opensearchproject/opensearch
tag: "{{ opensearch_version }}"
become_user: "{{ opensearch_podman_user }}"

Expand Down
2 changes: 1 addition & 1 deletion ansible/roles/opensearch/templates/opensearch.service.j2
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ ExecStart=/usr/bin/podman run \
--env bootstrap.memory_lock=true \
--env "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" \
--env DISABLE_INSTALL_DEMO_CONFIG=true \
opensearchproject/opensearch:{{ opensearch_version }}
docker.io/opensearchproject/opensearch:{{ opensearch_version }}
ExecStop=/usr/bin/podman stop --ignore opensearch -t 10
# note for some reason this returns status=143 which makes systemd show the unit as failed, not stopped
ExecStopPost=/usr/bin/podman rm --ignore -f opensearch
Expand Down
8 changes: 7 additions & 1 deletion ansible/slurm.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,14 @@
tags:
- openhpc
tasks:
- import_role:
- include_role:
name: stackhpc.openhpc
tasks_from: "install-{{ openhpc_install_type }}.yml"
tags: install
- include_role:
name: stackhpc.openhpc
tasks_from: runtime.yml
tags: runtime

- name: Set locked memory limits on user-facing nodes
hosts:
Expand Down
2 changes: 1 addition & 1 deletion environments/.stackhpc/ARCUS.pkrvars.hcl
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ volume_size = 12 # GB. Compatible with SMS-lab's general.v1.tiny
image_disk_format = "qcow2"
networks = ["4b6b2722-ee5b-40ec-8e52-a6610e14cc51"] # portal-internal (DNS broken on ilab-60)
source_image_name = "openhpc-230804-1754-80b8d714" # https://github.com/stackhpc/ansible-slurm-appliance/pull/298
fatimage_source_image_name = "Rocky-8-GenericCloud-Base-8.9-20231119.0.x86_64.qcow2"
fatimage_source_image_name = "Rocky-9-GenericCloud-Base-9.3-20231113.0.x86_64.qcow2"
ssh_keypair_name = "slurm-app-ci"
ssh_private_key_file = "~/.ssh/id_rsa"
security_groups = ["default", "SSH"]
Expand Down
2 changes: 1 addition & 1 deletion environments/.stackhpc/SMS.pkrvars.hcl
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
flavor = "general.v1.tiny"
networks = ["26023e3d-bc8e-459c-8def-dbd47ab01756"] # stackhpc-ipv4-geneve
source_image_name = "openhpc-230503-0944-bf8c3f63" # https://github.com/stackhpc/ansible-slurm-appliance/pull/252
fatimage_source_image_name = "Rocky-8-GenericCloud-Base-8.9-20231119.0.x86_64.qcow2"
fatimage_source_image_name = "Rocky-9-GenericCloud-Base-9.3-20231113.0.x86_64.qcow2"
ssh_keypair_name = "slurm-app-ci"
ssh_private_key_file = "~/.ssh/id_rsa"
security_groups = ["default", "SSH"]
Expand Down
11 changes: 6 additions & 5 deletions environments/.stackhpc/hooks/post-bootstrap.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,15 @@
gather_facts: false
tags: podman
tasks:
- name: Configure container image registry for unqualified searches to avoid docker.io ratelimits
- name: Configure container image registry to avoid docker.io ratelimits
copy:
dest: /etc/containers/registries.conf.d/003-arcus-unqualfied-overrides.conf
dest: /etc/containers/registries.conf.d/003-arcus-mirror.conf
content: |
unqualified-search-registries = ['{{ podman_registry_address | split('/') | first }}', 'registry.access.redhat.com', 'registry.redhat.io', 'docker.io']

[[registry]]
prefix = "{{ podman_registry_address }}"
location="docker.io/library/"
prefix="docker.io/library/"

[[registry.mirror]]
location = "{{ podman_registry_address }}"
insecure = true
when: "ci_cloud == 'ARCUS'"
6 changes: 2 additions & 4 deletions environments/.stackhpc/terraform/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ variable "cluster_name" {
variable "cluster_image" {
description = "single image for all cluster nodes - a convenience for CI"
type = string
default = "openhpc-240116-1156-aa8dba7d" # https://github.com/stackhpc/ansible-slurm-appliance/pull/351
# default = "Rocky-8-GenericCloud-Base-8.9-20231119.0.x86_64.qcow2"
# default = "openhpc-240116-1156-aa8dba7d" # https://github.com/stackhpc/ansible-slurm-appliance/pull/351
default = "Rocky-9-GenericCloud-Base-9.3-20231113.0.x86_64.qcow2" # TODO: create packer build
}

variable "cluster_net" {}
Expand Down Expand Up @@ -62,8 +62,6 @@ module "cluster" {
compute_nodes = {
compute-0: "small"
compute-1: "small"
compute-2: "extra"
compute-3: "extra"
}
volume_backed_instances = var.volume_backed_instances

Expand Down
2 changes: 2 additions & 0 deletions environments/common/inventory/group_vars/all/defaults.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ appliances_local_users_default:
uid: 981
home: "{{ prometheus_db_dir }}"
shell: /usr/sbin/nologin
system: true
enable: "{{ 'prometheus' in group_names }}"

- group:
Expand All @@ -69,6 +70,7 @@ appliances_local_users_default:
uid: 984
home: /usr/share/grafana
shell: /sbin/nologin
system: true
enable: "{{ 'grafana' in group_names }}"

# Overide this to add extra users whilst keeping the defaults.
Expand Down
2 changes: 1 addition & 1 deletion environments/common/inventory/group_vars/all/openhpc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

# See: https://github.com/stackhpc/ansible-role-openhpc
# for variable definitions

openhpc_install_type: ohpc # use "ohcp" for an OpenHPC-based system or "generic" if providing binaries
openhpc_enable:
control: "{{ inventory_hostname in groups['control'] }}"
batch: "{{ inventory_hostname in groups['compute'] }}"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ openondemand_clusters:
module purge

export PATH=/opt/TurboVNC/bin:$PATH
# avoid "Failed to create secure directory (/run/user/*/pulse)"
export XDG_RUNTIME_DIR="$TMPDIR/xdg_runtime"

# Workaround to avoid "Unable to contact settings server" when
# lauching xfce4-session
Expand Down
2 changes: 1 addition & 1 deletion requirements.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ roles:
- src: stackhpc.nfs
version: v23.12.1 # Tolerate state nfs file handles
- src: https://github.com/stackhpc/ansible-role-openhpc.git
version: v0.23.0 # https://github.com/stackhpc/ansible-role-openhpc/pull/165
version: feat/no-ohpc # https://github.com/stackhpc/ansible-role-openhpc/pull/162
name: stackhpc.openhpc
- src: https://github.com/stackhpc/ansible-node-exporter.git
version: stackhpc
Expand Down