Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to launch slurm jobs isn't obvious enough #100

Closed
JamesWrigley opened this issue Sep 27, 2023 · 1 comment
Closed

Failure to launch slurm jobs isn't obvious enough #100

JamesWrigley opened this issue Sep 27, 2023 · 1 comment

Comments

@JamesWrigley
Copy link
Member

When launching a slurm job fails, the error message isn't very obvious. In this case the user didn't have access to the reservation:
image

And there was only that single error from sbatch followed by a log message implying that the job was actually launched (though the job ID was omitted, which I feel should also have caused an error).

@takluyver
Copy link
Member

After #270, a bad partition name will show something like this in logs:

sbatch: error: Batch job submission failed: User's group not permitted to use this partition
Traceback (most recent call last):
  File "/gpfs/exfel/sw/software/xfel_anaconda3/amore-mid/conda_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfs/exfel/sw/software/xfel_anaconda3/amore-mid/conda_env/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/kluyvert/Code/DAMNIT/damnit/backend/extract_data.py", line 262, in <module>
    main()
  File "/home/kluyvert/Code/DAMNIT/damnit/backend/extract_data.py", line 254, in main
    Extractor().extract_and_ingest(args.proposal, args.run,
  File "/home/kluyvert/Code/DAMNIT/damnit/backend/extract_data.py", line 226, in extract_and_ingest
    submitter.submit(cluster_req)
  File "/home/kluyvert/Code/DAMNIT/damnit/backend/extraction_control.py", line 122, in submit
    res = subprocess.run(
  File "/gpfs/exfel/sw/software/xfel_anaconda3/amore-mid/conda_env/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['sbatch', '--parsable', '--clusters', 'maxwell', '--time', '02:00:00', '--reservation', 'nonexistant_1', '-o', PosixPath('/gpfs/exfel/data/scratch/kluyvert/damnit-p6616/process_logs/r10-p6616.out'), '--open-mode=append', '--job-name', 'r10-p6616-damnit', '--wrap', '/home/kluyvert/Code/DAMNIT/env/bin/python -m damnit.backend.extract_data 6616 10 all --cluster-job']' returned non-zero exit status 1.
srun: error: max-wn008: task 0: Exited with exit code 1

If you're using the --watch option to process runs one by one, it will also error out at this point.

The traceback might be a bit too much? But it definitely makes it very obvious that something has gone wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants