- 
                Notifications
    
You must be signed in to change notification settings  - Fork 37
 
Development notes
What information is required as input to the cluster/nodes.
Groups:
logincomputecontrol
Group/host vars:
Odd things
- For smslabs, control node needs to know login private IP because 
openondemand_servernameis defined using it in group_vars/all/openondemand.yml as we use SOCKS proxy to access. But generally,grafana(default: control) will need to know openondemand (default: login) external address. 
Full list for everything cluster is shown below.
Note that api_address and internal_address for hosts both default to inventory_hostname.
- 
openhpc_cluster_name: Cluster name. No default, must be set. - 
openhpc_slurm_control_host: Slurmctld address. Default in common:all:openhpc ={{ groups['control'] | first }}.- 
NB: maybe should use 
.internal_address? - Required for all 
openhpchosts. Is needed asdelegate_toso must be an inventory_hostname. Is also used as address of slurm controller, which is overloading it really - Note Slurm assumes slurmdbd and slurm.conf are in same directory, how does this work configless?
 - For 
slurmdnodes, we could rewrite /etc/sysconfig/slurmd using cloud-config'swrite_files. Note this works as the (upstream) unit file (installed to/usr/lib/systemd/system/slurmd.service) specifies the aboveEnvironmentFilepath, and uses env vars inExecStart. 
 - 
NB: maybe should use 
 - 
openhpc_slurm_partitions: Partition definitions. Default in common:all:openhpc is single 'compute' partition. NB: requires group"{{ openhpc_cluster_name }}_compute"in environment inventory. Could check groups during validation??- Host requirements & comments as above (but for control only)
 
 - 
nfs_server. Default in common:all:nfs isnfs_server_default->"{{ hostvars[groups['control'] | first ].internal_address }}". Required for all clients.- For client nodes, could rewrite 
fstab(done byhttps://github.com/stackhpc/ansible-role-cluster-nfs/blob/master/tasks/nfs-clients.yml) using cloud-config's mount module. 
 - For client nodes, could rewrite 
 - 
elasticsearch_address: Default in common:all:defaults is{{ hostvars[groups['opendistro'].0].api_address }}. Required forfilebeatandgrafanahosts.- Usage: usage search
 - 
ansible/roles/filebeat/tasks/config.yml templates out from 
filebeat_config_pathwhich is [environments/common/files/filebeat/filebeat.yml]https://github.com/stackhpc/ansible-slurm-appliance/blob/main/environments/common/files/filebeat/filebeat.yml). This contains: 
 
output.elasticsearch:
  hosts: ["{{ elasticsearch_address }}:9200"]
  protocol: "https"
  ssl.verification_mode: none
  username: "admin"
  password: "{{ vault_elasticsearch_admin_password }}"
(docs). Looks like these support environment vars, so potentially could set this using a systemd unit file fragment. The current systemd unit file is in the appliance - ansible/roles/filebeat/templates/filebeat.service.j2
- 
prometheus_address: Default in common:all:defaults is{{ hostvars[groups['prometheus'].0].api_address }}Required forprometheusandgrafanahosts - link - 
openondemand_address: Default in common:all:defaults is{{ hostvars[groups['openondemand'].0].api_address if groups['openondemand'] | count > 0 else '' }}. Required for prometheus host - NB this should probably be in prometheus group vars. - 
grafana_address: Default in common:all:grafana is{{ hostvars[groups['grafana'].0].api_address }}. Required for grafana host link.- This should probably be moved to common:all:defaults in line with other service endpoints
 
 - 
openondemand_servername: Non-functional default'', must be set. Required foropenondemandhost andgrafanahost link when both grafana and openondemand exist (which they do foreverything). NB this probably requires either a) a FIP or b) a fixed IP when using SOCKS proxy. For latter case this means the control host needs to have the login node's fixed IP available. - 
All the secrets in environment:all:secrets - see secret role's defaults:
- grafana, elasticsearch, mysql (x2) passwords (all potentially depending on group placement)
 - 
vault_openhpc_mungekey-> `openhpc_munge_key' (for all openhpc nodes):- could rewrite /etc/munge/munge.key using cloud-init 
write_files. 
 - could rewrite /etc/munge/munge.key using cloud-init 
 
 
Which roles can we ONLY run the install tasks from, to build a cluster-independent(*)/no-config image?
In-appliance roles:
- basic_users: n/a
 - block_devices: n/a
 - filebeat: n/a but downloads Docker container at service start)
 - grafana-dashboards: Downloads grafana dashboards
 - grafana-datasources: n/a
 - hpctests: n/a but reqd. packages are installed as part of 
openhpc_default_packages. - opendistro: n/a but downloads Docker container at service start.
 - openondemand:
- 
main.ymlunnamed task does rpm installs using osc.ood:install-rpm.yml - 
main.ymlunnamed task does rpm installs using pam_auth.yml. - 
main.yml[unnamed task] does git downloads using osc.ood:install-apps.yml - 
jupyter_compute.yml: Does package installs - 
vnc_compute.yml: Does package installs 
 - 
 - passwords: n/a
 - podman: 
prereqs.ymlDoes package installs 
Out of appliance roles:
- stackhpc.nfs: [main.yml(https://github.com/stackhpc/ansible-role-cluster-nfs/blob/master/tasks/main.yml) installs packages.
 - stackhpc.openhpc: Required and 
openhpc_packages(see above) installed in install.yml but requiresopenhpc_slurm_servicefact set frommain.yml. - cloudalchemy.node_exporter:
- 
install.yml does binary download from github but also propagation. Could pre-download it and use 
node_exporter_binary_local_dirbut install.yml still needs running as it does user creation too. - selinux.yml also does package installations
 
 - 
install.yml does binary download from github but also propagation. Could pre-download it and use 
 - cloudalchemy.blackbox-exporter: Currently unused.
 - cloudalchemy.prometheus: install.yml. Same comments as for 
cloudalchemy.node_exporterabove. - cloudalchemy.alertmanager: Currently unused.
 - cloudalchemy.grafana: install.yml does package updates.
 - geerlingguy.mysql: setup-RedHat.yml does package updates BUT needs variables.yml running to load appropriate variables.
 - jriguera.configdrive: Unused, should be deleted.
 - osc.ood: See 
openondemandabove. 
- It's not really cluster-independent as which features are turned on where may vary.