NVIDIA Resiliency Extension v0.1.3
Release Notes
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to failures and interruptions.
NVIDIA Resiliency Extension v0.1.3
Highlights
We are excited to announce the first release of NVIDIA Resiliency Extension v0.1.3!
Straggler Detection API provides tools for user to mark the section of code and configure the threshold to detect slow running GPU.
Fault Tolerance API provides the rank monitor server and client, and modified torchrun launcher based on TorchElastic to automatically detect hang and ability to in-job restart the training.