Skip to content

[RFE] Change num_gpus to a dict to support arbitrary accelerators #467

@Xaenalt

Description

@Xaenalt

Name of Feature or Improvement

I'd like to change from a hardcoding of nvidia.com/gpu to instead having a dict or something of resources. There are other accelerators and it'd be nice to specify them with an arbitrary key/value rather than hardcoding nvidia.com/gpu

Description of Problem the Feature Should Solve

Currently hardcoding nvidia.com/gpu is suboptimal since there are other accelerators, habana.ai/gaudi to name one, but there are other potential resources and accelerators, some possibly even not public. It would be a benefit to usability to specify these additional resources without editing the template.

Describe the Solution You Would Like to See

I'd like to see a constructor something like:

cluster = Cluster(ClusterConfiguration(
    name='raytest',
    namespace='ray-demo',
    num_workers=2,
    min_cpus=8,
    max_cpus=8,
    min_memory=12,
    max_memory=12,
    resources={"habana.ai/gaudi": 1},
    image="quay.io/spryor/ray:synapseai-1.13-torch",
    instascale=False
))

Which would just add the keys/values from the resources variable into the resources requests/limits section. Perhaps an option to set requests/limits separately something like for splitting, but first pass it's totally fine if it's just requests == limits since for hardware devices it's required they be equal

Describe Alternatives You Have Considered

Some alternative format ideas are maybe something like min_resources and max_resources, or a string format like "someresource": "1/2" for request 1 limit 2, etc.

Additional Context

In this case, the request is around Habana Gaudi devices, but the scope exists beyond that

Activity

anishasthana

anishasthana commented on Feb 22, 2024

@anishasthana
Contributor
Bobbins228

Bobbins228 commented on Feb 26, 2024

@Bobbins228
Contributor

This sounds like a useful change 👍

KPostOffice

KPostOffice commented on Sep 19, 2024

@KPostOffice
Contributor

Solved with #531

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @anishasthana@Xaenalt@KPostOffice@Bobbins228

        Issue actions

          [RFE] Change num_gpus to a dict to support arbitrary accelerators · Issue #467 · project-codeflare/codeflare-sdk