Skip to content

Conversation

bentsherman
Copy link
Member

@bentsherman bentsherman commented Mar 4, 2025

When a pipeline runs multiple GPU-enabled tasks on the same node, each task will see all GPUs and will not try to coordinate which task should use which GPU.

NVIDIA provides the CUDA_VISIBLE_DEVICES environment variable to control which tasks can see which GPUs, and users generally have to manage this variable themselves. Some HPC schedulers can assign this variable automatically, or use cgroups to control GPU visibility at a lower level.

Nextflow should be able to manage this variable for the local executor, so that the user doesn't have to add complex pipeline logic to do the same. Running a GPU workload locally on a multi-GPU node is a common use case, so it is worth doing.

See the docs in the PR for usage.

To use with containers, you might have to add CUDA_VISIBLE_DEVICES to docker.envWhitelist.

  • I don't remember if CUDA_VISIBLE_DEVICES works with containers or if you have to set NVIDIA_VISIBLE_DEVICES.
  • I don't remember if you have to set --gpus for the docker command in order to use the GPUs at all, that can be set in docker.runOptions

See also: #5570

Copy link

netlify bot commented Mar 4, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 66f29e3
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/689663f74f233e00086a914a

@thealanjason
Copy link

thealanjason commented Apr 28, 2025

@bentsherman The feature implemented in this PR would really help us use all the local GPUs without dealing with scheduling tasks on them manually.

Do you know when this feature will be released? Or is it even planned?

@bentsherman
Copy link
Member Author

Right now I just put it out so that people can try it out, so I encourage you to try it with a local build of this PR. In principle we do want to have this, just haven't decided whether it should be part of local or a separate executor like local-gpu

@pditommaso pditommaso force-pushed the master branch 3 times, most recently from b4b321e to 069653d Compare June 4, 2025 18:54
@thealanjason
Copy link

thealanjason commented Jun 11, 2025

Hi @bentsherman or @pditommaso , I finally got some time to try this out. However, I was not able to compile nextflow from source.

I used the following steps:

  1. Install Java (version 17) using SDKMAN
  2. Clone the nextflow repository
  3. Checkout this PR branch local-gpu-executor
  4. cd nextflow
  5. make compile

The error I get is as below:

ajc@mbd:~/Work/git_EXT/nextflow$ make compile
./gradlew compile exportClasspath
> Task :nextflow:compileGroovy FAILED

[Incubating] Problems report is available at: file:///home/ajc/Work/git_EXT/nextflow/build/reports/problems/problems-report.html

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':nextflow:compileGroovy'.
> Could not resolve all files for configuration ':nextflow:compileClasspath'.
   > Could not resolve com.github.nextflow-io.language-server:compiler:main-SNAPSHOT.
     Required by:
         project :nextflow
      > Could not resolve com.github.nextflow-io.language-server:compiler:main-SNAPSHOT.
         > Unable to load Maven meta-data from https://s3-eu-west-1.amazonaws.com/maven.seqera.io/releases/com/github/nextflow-io/language-server/compiler/main-SNAPSHOT/maven-metadata.xml.
            > Could not GET 'https://s3-eu-west-1.amazonaws.com/maven.seqera.io/releases/com/github/nextflow-io/language-server/compiler/main-SNAPSHOT/maven-metadata.xml'. Received status code 403 from server: Forbidden
      > Could not resolve com.github.nextflow-io.language-server:compiler:main-SNAPSHOT.
         > Unable to load Maven meta-data from https://s3-eu-west-1.amazonaws.com/maven.seqera.io/snapshots/com/github/nextflow-io/language-server/compiler/main-SNAPSHOT/maven-metadata.xml.
            > Could not GET 'https://s3-eu-west-1.amazonaws.com/maven.seqera.io/snapshots/com/github/nextflow-io/language-server/compiler/main-SNAPSHOT/maven-metadata.xml'. Received status code 403 from server: Forbidden

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.
> Get more help at https://help.gradle.org.

Deprecated Gradle features were used in this build, making it incompatible with Gradle 9.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

For more on this, please refer to https://docs.gradle.org/8.12.1/userguide/command_line_interface.html#sec:command_line_warnings in the Gradle documentation.

BUILD FAILED in 4s
27 actionable tasks: 21 executed, 2 from cache, 4 up-to-date
make: *** [Makefile:31: compile] Error 1

I'm not really sure, why I receive the 403 status code during download. Do you have any ideas to fix this?

I would really want to try out this feature on our local GPU machines.

@thealanjason
Copy link

Hi @bentsherman or @pditommaso, I'd be happy if you could have a look at this PR:
#6189

@thealanjason
Copy link

Hi @bentsherman or @pditommaso, could you please also have a look at this PR:
#6218

It proposes a fix to respect gpuIDs set in CUDA_VISIBLE_DEVICES before running nextflow.

@bentsherman bentsherman changed the title Manage NVIDIA GPU slots in local executor Support accelerator directive for local executor Jun 26, 2025
@bentsherman
Copy link
Member Author

@thealanjason thank you again, you actually inspired me to improve the overall approach and make it more generic.

I removed the executor.gpus config setting and rely solely on the CUDA_VISIBLE_DEVICES environment variable to make GPUs visible to Nextflow. This is also easily extended to support other runtimes like AMD ROCm and HIP, since they follow the same convention.

@pditommaso I think this PR is ready for serious consideration. Using CUDA_VISIBLE_DEVICES (and its variants) for everything gives us a seamless way for GPU users to integrate with Nextflow. The AcceleratorTracker is a nice abstraction that can be extended to support new devices and strategies without complicating the local executor.

@bentsherman bentsherman marked this pull request as ready for review June 26, 2025 16:09
@bentsherman bentsherman requested a review from a team as a code owner June 26, 2025 16:09
@bentsherman bentsherman requested a review from pditommaso June 26, 2025 16:09
Copy link

@thealanjason thealanjason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great that now NVIDIA, AMD, and HIP devices can be handled generically :)

@bentsherman bentsherman added this to the 25.10 milestone Jul 14, 2025
Signed-off-by: Ben Sherman <[email protected]>
@ECM893
Copy link

ECM893 commented Sep 3, 2025

Just commenting to say this would help on SO many of cloud compute deployments.

@bentsherman
Copy link
Member Author

@ECM893 do you typically use the local executor in cloud for GPUs? If so I'm curious what your process looks like

@ECM893
Copy link

ECM893 commented Sep 3, 2025

Yes.
I use nf-core docker based pipelines, in cloud instance VMs, in Azure and GCP.
I would like to be able to say "hey, here's a pool of 8 Graphics cards, attached to this machine, just use them as needed"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants