Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nextflow failing to detect task exit/failure due to slurm failed node #5276

Open
sereeena opened this issue Sep 3, 2024 · 0 comments
Open

Comments

@sereeena
Copy link

sereeena commented Sep 3, 2024

Bug report

Expected behavior and actual behavior

I use sbatch to launch a nextflow workflow which in turn launches other slurm jobs, and if there is a slurm node failure, nextflow does not detect that the task has completed and so the workflow hangs with the next process waiting forever.

executor >  slurm (46)
[34/c77efe] process > batch_pipeline:BATCHCLEAN (... [100%] 1 of 1 ✔
[07/78ef5d] process > batch_pipeline:PLOTPCA (all... [100%] 1 of 1 ✔
[89/b34062] process > batch_pipeline:IMPUTATION_P... [100%] 22 of 22 ✔
[5c/b1bb0d] process > batch_pipeline:PRS_PER_CHR ... [ 95%] 21 of 22
[-        ] process > batch_pipeline:PRS_SUMMARY     -

Apologies I have lost the .nextflow.log file, but it is repeatedly polling for job status and just shows the job as running and doesn't detect it as having exited with error, despite the job having failed on slurm

JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
3826         nf-batch_+    compute                     1  NODE_FAIL      1:0 
3826.batch        batch                                1  CANCELLED     0:15

In my nextflow.config I have errorStrategy = 'retry', which usually works to resubmit the job when the process fails, but it doesn't detect this process as having failed even though slurm shows a non zero exit code?

I saw a similar issue https://github.com/nextflow-io/nextflow/issues/3422#issuecomment-1323855649 suggesting it might be due to nextflow launching jobs with --no-requeue slurm option. If it has something to do with this, couldn't there be a way to pass to nextflow whether to use --no-requeue in the executor config section? (It seems very much more useful to default to requeuing jobs due to node failures rather than disabling this functionality)

Steps to reproduce the problem

Difficult to reproduce as node failures are intermittent, I occasionally get these kinds of node failures in slurmctld.log

[2024-09-02T16:28:29.026] sched: Allocate JobId=3826 NodeList=seonixhpc-compute-ghpc-0 #CPUs=1 Partition=compute
[2024-09-02T16:29:28.127] Batch JobId=3826 missing from batch node seonixhpc-compute-ghpc-0 (not found BatchStartTime after startup)
[2024-09-02T16:29:28.128] _job_complete: JobId=3826 WTERMSIG 126
[2024-09-02T16:29:28.128] _job_complete: JobId=3826 cancelled by node failure
[2024-09-02T16:29:28.129] _job_complete: JobId=3826 done

Environment

  • Nextflow version: [23.10.1 build 5891]
  • Java version: [?]
  • Operating system: [Centos]
  • Bash version: GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants