Skip to content

NVIDIA Resiliency Extension v0.1.3

Compare
Choose a tag to compare
@srogawski-nvidia srogawski-nvidia released this 15 Oct 02:47
· 16 commits to main since this release

Release Notes

NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to failures and interruptions.

NVIDIA Resiliency Extension v0.1.3

Highlights

We are excited to announce the first release of NVIDIA Resiliency Extension v0.1.3!

Straggler Detection API provides tools for user to mark the section of code and configure the threshold to detect slow running GPU.

Fault Tolerance API provides the rank monitor server and client, and modified torchrun launcher based on TorchElastic to automatically detect hang and ability to in-job restart the training.

Contributors

@jbieniusiewi @srogawski-nvidia @yzhautouskay