Nextflow failing to detect task exit/failure due to slurm failed node #5276

sereeena · 2024-09-03T03:23:36Z

Bug report

Expected behavior and actual behavior

I use sbatch to launch a nextflow workflow which in turn launches other slurm jobs, and if there is a slurm node failure, nextflow does not detect that the task has completed and so the workflow hangs with the next process waiting forever.

executor >  slurm (46)
[34/c77efe] process > batch_pipeline:BATCHCLEAN (... [100%] 1 of 1 ✔
[07/78ef5d] process > batch_pipeline:PLOTPCA (all... [100%] 1 of 1 ✔
[89/b34062] process > batch_pipeline:IMPUTATION_P... [100%] 22 of 22 ✔
[5c/b1bb0d] process > batch_pipeline:PRS_PER_CHR ... [ 95%] 21 of 22
[-        ] process > batch_pipeline:PRS_SUMMARY     -

Apologies I have lost the .nextflow.log file, but it is repeatedly polling for job status and just shows the job as running and doesn't detect it as having exited with error, despite the job having failed on slurm

JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
3826         nf-batch_+    compute                     1  NODE_FAIL      1:0 
3826.batch        batch                                1  CANCELLED     0:15

In my nextflow.config I have errorStrategy = 'retry', which usually works to resubmit the job when the process fails, but it doesn't detect this process as having failed even though slurm shows a non zero exit code?

I saw a similar issue https://github.com/nextflow-io/nextflow/issues/3422#issuecomment-1323855649 suggesting it might be due to nextflow launching jobs with --no-requeue slurm option. If it has something to do with this, couldn't there be a way to pass to nextflow whether to use --no-requeue in the executor config section? (It seems very much more useful to default to requeuing jobs due to node failures rather than disabling this functionality)

Steps to reproduce the problem

Difficult to reproduce as node failures are intermittent, I occasionally get these kinds of node failures in slurmctld.log

[2024-09-02T16:28:29.026] sched: Allocate JobId=3826 NodeList=seonixhpc-compute-ghpc-0 #CPUs=1 Partition=compute
[2024-09-02T16:29:28.127] Batch JobId=3826 missing from batch node seonixhpc-compute-ghpc-0 (not found BatchStartTime after startup)
[2024-09-02T16:29:28.128] _job_complete: JobId=3826 WTERMSIG 126
[2024-09-02T16:29:28.128] _job_complete: JobId=3826 cancelled by node failure
[2024-09-02T16:29:28.129] _job_complete: JobId=3826 done

Environment

Nextflow version: [23.10.1 build 5891]
Java version: [?]
Operating system: [Centos]
Bash version: GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)

The text was updated successfully, but these errors were encountered:

bentsherman added the executor/slurm label Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nextflow failing to detect task exit/failure due to slurm failed node #5276

Nextflow failing to detect task exit/failure due to slurm failed node #5276

sereeena commented Sep 3, 2024

Nextflow failing to detect task exit/failure due to slurm failed node #5276

Nextflow failing to detect task exit/failure due to slurm failed node #5276

Comments

sereeena commented Sep 3, 2024

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Environment