GPU allocation stuck on "Waiting for resource configuration" #114

maharjun · 2023-03-23T07:29:14Z

I've uploaded the version 2.7.1 to the cluster (after downloading the blobs from github). I've modified the template to include a gpu cluster along with parameter definitions corresponding to that cluster. I've selected NC6 as the GPU machine name and started the cluster. Everything starts fine and I'm able to allocate the F32_vs nodes that correspond to the hpc partition. However when allocating the GPU node, the node starts without any error reported in the console, however, in the scheduler, slurm does not appear to recognize this fact and is still stuck on

[hpcadmin@ip-0A030006 ~]$ salloc -p gpu -n 1
salloc: Granted job allocation 3
salloc: Waiting for resource configuration

This is a problem with both the CentOS 7 and Almalinux 8 operating systems. I have the following cloud-init scripts to install singularity in each case. These don't appear to be a problem as any error in the script typically gets reported as an error in the web console.

CentOS 7:

#!/usr/bin/bash
wget https://github.com/sylabs/singularity/releases/download/v3.9.9/singularity-ce-3.9.9-1.el7.x86_64.rpm
sudo yum localinstall -y ./singularity-ce-3.9.9-1.el7.x86_64.rpm
rm ./singularity-ce-3.9.9-1.el7.x86_64.rpm

Almalinux 8

#!/usr/bin/bash
wget https://github.com/sylabs/singularity/releases/download/v3.9.9/singularity-ce-3.9.9-1.el8.x86_64.rpm
sudo yum localinstall -y ./singularity-ce-3.9.9-1.el8.x86_64.rpm
rm ./singularity-ce-3.9.9-1.el8.x86_64.rpm

The text was updated successfully, but these errors were encountered:

anhoward · 2023-03-23T16:16:20Z

Which image are you using? Does it have the GPU drivers installed? If not, it's probably not reporting the right number of GPUs so Slurm doesn't consider the node to be "healthy". You won't see this error in the console, and would have to catch it at just the right time to see the node in a DOWN state in Slurm. If you can get the slurmd logs from the node and the slurmctld logs from the scheduler, that would tell for sure. I would suggest opening a support ticket in the Azure portal so one of our engineers can help.

maharjun · 2023-03-30T08:22:33Z

I figured this out pretty much exactly as you described, however a bigger problem is that AlmaLinux 8 - HPC (the default image of cyclecloud-slurm) doesn't support NC series instances (confirmed this by creating a standalone NC6 Vm with the almalinux-hpc image), probably due to an incompatibility with drivers. It works with NV series nodes though.

I just wish these sort of things were well documented (either in the cyclecloud-slurm readme or elsewhere).
I have contacted the support channels

noahharrison64 mentioned this issue Feb 5, 2025

Unable to use a custom image as HPC OS #313

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU allocation stuck on "Waiting for resource configuration" #114

GPU allocation stuck on "Waiting for resource configuration" #114

maharjun commented Mar 23, 2023

anhoward commented Mar 23, 2023

maharjun commented Mar 30, 2023 •

edited

Loading

GPU allocation stuck on "Waiting for resource configuration" #114

GPU allocation stuck on "Waiting for resource configuration" #114

Comments

maharjun commented Mar 23, 2023

anhoward commented Mar 23, 2023

maharjun commented Mar 30, 2023 • edited Loading

maharjun commented Mar 30, 2023 •

edited

Loading