Skip to content

@ remote support for multi-instance training job #4125

Open
@sateeshmannar

Description

@sateeshmannar

Describe the feature you'd like
Need the ability to use @ remote to train on a multi-instance node for distributed training.

How would this feature be used? Please describe.
Distributed training packages like h2o can be used with @ remote. Currently @ remote restrict the instance count for training job to "One" instance

Describe alternatives you've considered
Use sagemaker.estimator.Estimator to configure distributed training job. This requires duplication of code when switching between local mode vs Instance based training.

Additional context
We are in the process of switching from SageMaker notebook instance to SageMaker Studio. SageMaker studio does not support local mode at this time. So, in order to test with local mode we are using @ remote. However, to train on large datasets we use distributed training. In Sagemaker Notebook Instance env, sagemaker.estimator.Estimator easily allowed us to switch between local and multi-instance based training. However, not having a SDK function for a comparable local/distributed training option in studio is causing a lot of rework of templates. Enhancing @ remote to train on multi-instance would mitigate the concern.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions