Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU allocation stuck on "Waiting for resource configuration" #114

Open
maharjun opened this issue Mar 23, 2023 · 2 comments
Open

GPU allocation stuck on "Waiting for resource configuration" #114

maharjun opened this issue Mar 23, 2023 · 2 comments

Comments

@maharjun
Copy link

I've uploaded the version 2.7.1 to the cluster (after downloading the blobs from github). I've modified the template to include a gpu cluster along with parameter definitions corresponding to that cluster. I've selected NC6 as the GPU machine name and started the cluster. Everything starts fine and I'm able to allocate the F32_vs nodes that correspond to the hpc partition. However when allocating the GPU node, the node starts without any error reported in the console, however, in the scheduler, slurm does not appear to recognize this fact and is still stuck on

[hpcadmin@ip-0A030006 ~]$ salloc -p gpu -n 1
salloc: Granted job allocation 3
salloc: Waiting for resource configuration

This is a problem with both the CentOS 7 and Almalinux 8 operating systems. I have the following cloud-init scripts to install singularity in each case. These don't appear to be a problem as any error in the script typically gets reported as an error in the web console.

CentOS 7:

#!/usr/bin/bash
wget https://github.com/sylabs/singularity/releases/download/v3.9.9/singularity-ce-3.9.9-1.el7.x86_64.rpm
sudo yum localinstall -y ./singularity-ce-3.9.9-1.el7.x86_64.rpm
rm ./singularity-ce-3.9.9-1.el7.x86_64.rpm

Almalinux 8

#!/usr/bin/bash
wget https://github.com/sylabs/singularity/releases/download/v3.9.9/singularity-ce-3.9.9-1.el8.x86_64.rpm
sudo yum localinstall -y ./singularity-ce-3.9.9-1.el8.x86_64.rpm
rm ./singularity-ce-3.9.9-1.el8.x86_64.rpm
@anhoward
Copy link
Contributor

Which image are you using? Does it have the GPU drivers installed? If not, it's probably not reporting the right number of GPUs so Slurm doesn't consider the node to be "healthy". You won't see this error in the console, and would have to catch it at just the right time to see the node in a DOWN state in Slurm. If you can get the slurmd logs from the node and the slurmctld logs from the scheduler, that would tell for sure. I would suggest opening a support ticket in the Azure portal so one of our engineers can help.

@maharjun
Copy link
Author

maharjun commented Mar 30, 2023

I figured this out pretty much exactly as you described, however a bigger problem is that AlmaLinux 8 - HPC (the default image of cyclecloud-slurm) doesn't support NC series instances (confirmed this by creating a standalone NC6 Vm with the almalinux-hpc image), probably due to an incompatibility with drivers. It works with NV series nodes though.

I just wish these sort of things were well documented (either in the cyclecloud-slurm readme or elsewhere).
I have contacted the support channels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants