-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash when node names are too long #55
Comments
Fix vpenso#55 When node names are over 20 characters long, sinfo truncates the name but does not preserve the space between NodeList and AllocMem. Example: cpu-always-on-st-t30 1 0/2/0/2 idle This commit explicitly tells sinfo to append a space after each entry, fixing the issue.
On our current infrastructure it is not possible to reproduce this problem, given the fact that we use hostnames with less than 10 characters. On Slurm 18.08.8 (CentOS 7.8), I have got quite a different output.
The second output will most likely crash Which version of Slurm are you using? From the crash log you have posted and the hostnames you are using, I assume you are running a cluster on AWS but we definitely do not have operational experience with that environment. |
We are using AWS ParallelCluster 2.10.0, running on Ubuntu 18.04 using Slurm version 20.02.4. With this version, the output of
So I guess there was a change between 18.08.8 and 20.02.4 that changes the interface and output of sinfo. Edit: |
We have faced this problem in the past. Since this exporter is basically parsing the output of The |
Recent versions of sinfo have |
If node names are over 20 characters long, the output of
sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong"
, used at node.go:85, looks like this:You can see that node name and memory are not separated by whitespace.
This results in a crash with the following output:
It expects 5 fields separated by whatespace, but finds only 4 which results in out-of-bounds array access and panic.
Possible fix is to change
sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong"
tosinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong: "
, explicitly telling SLURM to append a space after each value.The text was updated successfully, but these errors were encountered: