Skip to content

slurm_scheduler: handle OCI images #345

Open
@d4l3k

Description

@d4l3k

Description

Add support for running TorchX components via the Slurm OCI interface.

Motivation/Background

Slurm 21.08+ has support for running OCI containers as the environment. This matches well with our other docker/k8s images that we use by default. With workspaces + OCI we can support slurm like the docker based environments.

Detailed Proposal

The new slurm container support doesn't handle the image finding the same way docker/podman does. This means that the images need to be placed on disk in the same way a virutalenv would be supported which would have to be a user configurable path.

This also means that we have to interact with docker/buildah to download the images and export them to an OCI image on disk. There's some extra questions about image management to avoid disk space issues etc.

The cluster would have to be configured with nvidia-container-runtime for use with GPUs.

Alternatives

Additional context/links

https://slurm.schedmd.com/containers.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestmodule: runnerissues related to the torchx.runner and torchx.scheduler modulesslurmslurm scheduler

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions