You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've uploaded the version 2.7.1 to the cluster (after downloading the blobs from github). I've modified the template to include a gpu cluster along with parameter definitions corresponding to that cluster. I've selected NC6 as the GPU machine name and started the cluster. Everything starts fine and I'm able to allocate the F32_vs nodes that correspond to the hpc partition. However when allocating the GPU node, the node starts without any error reported in the console, however, in the scheduler, slurm does not appear to recognize this fact and is still stuck on
This is a problem with both the CentOS 7 and Almalinux 8 operating systems. I have the following cloud-init scripts to install singularity in each case. These don't appear to be a problem as any error in the script typically gets reported as an error in the web console.
Which image are you using? Does it have the GPU drivers installed? If not, it's probably not reporting the right number of GPUs so Slurm doesn't consider the node to be "healthy". You won't see this error in the console, and would have to catch it at just the right time to see the node in a DOWN state in Slurm. If you can get the slurmd logs from the node and the slurmctld logs from the scheduler, that would tell for sure. I would suggest opening a support ticket in the Azure portal so one of our engineers can help.
I figured this out pretty much exactly as you described, however a bigger problem is that AlmaLinux 8 - HPC (the default image of cyclecloud-slurm) doesn't support NC series instances (confirmed this by creating a standalone NC6 Vm with the almalinux-hpc image), probably due to an incompatibility with drivers. It works with NV series nodes though.
I just wish these sort of things were well documented (either in the cyclecloud-slurm readme or elsewhere).
I have contacted the support channels
I've uploaded the version 2.7.1 to the cluster (after downloading the blobs from github). I've modified the template to include a gpu cluster along with parameter definitions corresponding to that cluster. I've selected NC6 as the GPU machine name and started the cluster. Everything starts fine and I'm able to allocate the F32_vs nodes that correspond to the hpc partition. However when allocating the GPU node, the node starts without any error reported in the console, however, in the scheduler, slurm does not appear to recognize this fact and is still stuck on
This is a problem with both the CentOS 7 and Almalinux 8 operating systems. I have the following cloud-init scripts to install singularity in each case. These don't appear to be a problem as any error in the script typically gets reported as an error in the web console.
CentOS 7:
Almalinux 8
The text was updated successfully, but these errors were encountered: