-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit d815452
Showing
138 changed files
with
25,251 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
build | ||
dist | ||
*.egg-info | ||
__pycache__ | ||
cupti_module.*.so |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
default_language_version: | ||
python: python3 | ||
|
||
repos: | ||
|
||
- repo: https://github.com/PyCQA/isort | ||
rev: 5.13.2 | ||
hooks: | ||
- id: isort | ||
exclude: docs/ | ||
|
||
- repo: https://github.com/psf/black-pre-commit-mirror | ||
rev: 24.10.0 | ||
hooks: | ||
- id: black | ||
language_version: python3.10 | ||
|
||
- repo: https://github.com/astral-sh/ruff-pre-commit | ||
rev: v0.6.9 | ||
hooks: | ||
- id: ruff |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,124 @@ | ||
|
||
## Nvidia Resiliency Extension (NVRx) OSS Contribution Rules | ||
|
||
#### Issue Tracking | ||
|
||
* All enhancement, bugfix, or change requests must begin with the creation of a [NVRx Issue Request](TBD). | ||
* The issue request must be reviewed by NVRx engineers and approved prior to code review. | ||
|
||
|
||
#### Coding Guidelines | ||
|
||
- All source code contributions must follow the existing conventions in the relevant file, submodule, module, and project when you add new code or when you extend/fix existing functionality. | ||
|
||
- Avoid introducing unnecessary complexity into existing code so that maintainability and readability are preserved. | ||
|
||
- Try to keep pull requests (PRs) as concise as possible: | ||
- Avoid committing commented-out code. | ||
- Wherever possible, each PR should address a single concern. If there are several otherwise-unrelated things that should be fixed to reach a desired endpoint, our recommendation is to open several PRs and indicate the dependencies in the description. The more complex the changes are in a single PR, the more time it will take to review those changes. | ||
|
||
- To ensure code consistency and maintainability across the project, please format and lint your code using the following tools before committing any changes: | ||
- We use black to automatically format Python code. It enforces a consistent style by reformatting code according to a set of rules. | ||
- To format your code, run: | ||
``` | ||
black . | ||
``` | ||
- isort is used to sort and format import statements automatically. Ensure that your imports are ordered correctly by running: | ||
``` | ||
isort . | ||
``` | ||
- ruff is a fast Python linter that helps catch common issues. Please run ruff to check for and fix linting problems: | ||
``` | ||
ruff check . | ||
``` | ||
|
||
- Write commit titles using imperative mood and [these rules](https://chris.beams.io/posts/git-commit/), and reference the Issue number corresponding to the PR. Following is the recommended format for commit texts: | ||
``` | ||
#<Issue Number> - <Commit Title> | ||
<Commit Body> | ||
``` | ||
|
||
- Ensure that the build log is clean, meaning no warnings or errors should be present. | ||
|
||
- Ensure that all unit tests pass prior to submitting your code. | ||
|
||
- All OSS components must contain accompanying documentation (READMEs) describing the functionality, dependencies, and known issues. | ||
|
||
- See `README.md` for existing samples and plugins for reference. | ||
|
||
- All OSS components must have an accompanying test. | ||
|
||
- If introducing a new component, such as a plugin, provide a test sample to verify the functionality. | ||
|
||
- Make sure that you can contribute your work to open source (no license and/or patent conflict is introduced by your code). You will need to [`sign`](#signing-your-work) your commit. | ||
|
||
- Thanks in advance for your patience as we review your contributions; we do appreciate them! | ||
|
||
|
||
#### Pull Requests | ||
Developer workflow for code contributions is as follows: | ||
|
||
1. Developers must first [fork](https://help.github.com/en/articles/fork-a-repo) the [upstream](TBD) NVRx OSS repository. | ||
|
||
2. Git clone the forked repository and push changes to the personal fork. | ||
|
||
```bash | ||
git clone https://github.com/YOUR_USERNAME/YOUR_FORK.git NVRx | ||
# Checkout the targeted branch and commit changes | ||
# Push the commits to a branch on the fork (remote). | ||
git push -u origin <local-branch>:<remote-branch> | ||
``` | ||
|
||
3. Once the code changes are staged on the fork and ready for review, a [Pull Request](https://help.github.com/en/articles/about-pull-requests) (PR) can be [requested](https://help.github.com/en/articles/creating-a-pull-request) to merge the changes from a branch of the fork into a selected branch of upstream. | ||
* Exercise caution when selecting the source and target branches for the PR. | ||
Note that versioned releases of NVRx OSS are posted to `release/` branches of the upstream repo. | ||
* Creation of a PR creation kicks off the code review process. | ||
* Atleast one NVRx engineer will be assigned for the review. | ||
* While under review, mark your PRs as work-in-progress by prefixing the PR title with [WIP]. | ||
|
||
4. Since there is no CI/CD process in place yet, the PR will be accepted and the corresponding issue closed only after adequate testing has been completed, manually, by the developer and/or NVRx engineer reviewing the code. | ||
|
||
|
||
#### Signing Your Work | ||
|
||
* We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. | ||
|
||
* Any contribution which contains commits that are not Signed-Off will not be accepted. | ||
|
||
* To sign off on a commit you simply use the `--signoff` (or `-s`) option when committing your changes: | ||
```bash | ||
$ git commit -s -m "Add cool feature." | ||
``` | ||
This will append the following to your commit message: | ||
``` | ||
Signed-off-by: Your Name <[email protected]> | ||
``` | ||
|
||
* Full text of the DCO: | ||
|
||
``` | ||
Developer Certificate of Origin | ||
Version 1.1 | ||
Copyright (C) 2004, 2006 The Linux Foundation and its contributors. | ||
1 Letterman Drive | ||
Suite D4700 | ||
San Francisco, CA, 94129 | ||
Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. | ||
``` | ||
|
||
``` | ||
Developer's Certificate of Origin 1.1 | ||
By making a contribution to this project, I certify that: | ||
(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or | ||
(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or | ||
(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. | ||
(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# This image purpose is to build "nvidia_resiliency_ext" wheels using different Python versions. | ||
# There are python3.10, python3.11 and python3.12 installed. | ||
# Base image is CUDA, as Straggler Detection package uses CUPTI. | ||
# Wheel for Python3.10 can be created with "python3.10 -m build --wheel" etc. | ||
|
||
# Choose a base CUDA image from NVIDIA | ||
# nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04, nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04 etc. | ||
ARG BASE_CUDA_IMG=nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04 | ||
FROM ${BASE_CUDA_IMG} | ||
|
||
# Set environment variables to non-interactive to avoid prompts during package installation | ||
ENV DEBIAN_FRONTEND=noninteractive | ||
|
||
# Repo with Pythons | ||
RUN apt update && apt install -y software-properties-common && add-apt-repository ppa:deadsnakes/ppa | ||
|
||
# Install common dependencies | ||
RUN apt-get update && apt-get install -y \ | ||
python3.10 python3.10-dev python3.10-distutils \ | ||
python3.11 python3.11-dev python3.11-distutils \ | ||
python3.12 python3.12-dev python3.12-distutils \ | ||
wget curl build-essential gcc-10 g++-10\ | ||
&& rm -rf /var/lib/apt/lists/* | ||
|
||
# Install pip for each Python version | ||
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10 && \ | ||
curl -sS https://bootstrap.pypa.io/get-pip.py | python3.11 && \ | ||
curl -sS https://bootstrap.pypa.io/get-pip.py | python3.12 | ||
|
||
# Install deps, | ||
# FIXME: for some reason six needs to be manually updated | ||
# otherwise wheel building fails with: ModuleNotFoundError: No module named 'six' | ||
RUN python3.10 -m pip install build poetry && \ | ||
python3.11 -m pip install build poetry && \ | ||
python3.12 -m pip install build poetry && \ | ||
python3.10 -m pip install -U six && \ | ||
python3.11 -m pip install -U six && \ | ||
python3.12 -m pip install -U six | ||
|
||
# Set the working directory | ||
WORKDIR /workspace | ||
ENTRYPOINT ["/bin/bash", "-c"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,174 @@ | ||
# Nvidia Resiliency Extension | ||
|
||
This project combines multiple resiliency-related solutions. | ||
- Fault Tolerance package | ||
- Straggler Detection package | ||
- PyTorch Lightning callbacks | ||
|
||
|
||
## Installation: | ||
|
||
### From sources | ||
- `git clone --recursive <this repo URL>` | ||
- `cd <repo>` | ||
- `pip install .` | ||
|
||
Requirements: | ||
- Python >= 3.10 | ||
- gcc >= 8.0 | ||
- CUDA >= 11.8 | ||
|
||
## Fault Tolerance integration guide | ||
|
||
This section describes Fault Tolerance callback integration with a PTL-based workload (e.g. NeMo). | ||
|
||
Let's define some terms used in this section: | ||
- `PTL` is PyTorch Lightning | ||
- `Fault Tolerance`, `FT` is the `fault_tolerance` package, included in `nvidia_resiliency_ext`. | ||
- `FT callback`, `FaultToleranceCallback` is a PTL callback defined in `ptl_resiliency` package, included in `nvidia_resiliency_ext`. | ||
- `ft_launcher` is a launcher tool included in the FT, which is based on `torchrun`. | ||
- `heartbeat` is a lightweight message sent from a rank to its rank monitor that indicates that a rank is alive. | ||
- `rank monitor` is a special side-process started by `ft_launcher` that monitors heartbeats from its rank. | ||
- `timeouts` are time intervals used by a rank monitor to detect that a rank is not alive. | ||
There are 2 separate timeouts: for the initial heartbeat and the subsequent heartbeats. | ||
- `launcher script` is a bash script that invokes `ft_launcher`. | ||
|
||
### 0. Use `ft_launcher` to start the workload | ||
|
||
`ft_launcher` is similar to `torchrun` but it starts a rank monitor for each started rank. | ||
`ft_launcher` takes the FT configuration in a YAML file (`--fault-tol-cfg-path`) or via CLI args (`--ft-param-...`). | ||
FT configuration items are described in `FaultToleranceConfig` docstring. | ||
|
||
### 1. Add FT callback to the trainer | ||
|
||
Add FT callback to PTL callbacks. | ||
|
||
``` | ||
fault_tol_cb = FaultToleranceCallback( | ||
autoresume=True, | ||
calculate_timeouts=True, | ||
logger_name="test_logger", | ||
exp_dir=tmp_path, | ||
) | ||
trainer = pl.Trainer( | ||
... | ||
callbacks=[..., fault_tol_cb], | ||
) | ||
``` | ||
|
||
|
||
Core FT callback functionality is: | ||
- Establishing a connection with a rank monitor | ||
- Sending heartbeats during training and evaluation steps | ||
- Disconnecting from a rank monitor | ||
|
||
Optionally, it can also: | ||
- Compute timeouts that will be used instead of timeouts defined in the FT config | ||
- Create a flag file when the training is completed | ||
|
||
FT callback initialization params: | ||
``` | ||
def __init__( | ||
self, | ||
autoresume: bool, | ||
calculate_timeouts: bool, | ||
simulated_fault_params: Optional[Any] = None, | ||
exp_dir: Union[str, pathlib.Path, None] = None, | ||
logger_name: Optional[str] = "nemo_logger.FaultToleranceCallback", | ||
): | ||
""" | ||
Initialize callback instance. | ||
This is a lightweight initialization. Most of the initialization is conducted in the 'setup' hook. | ||
Args: | ||
autoresume (bool): Set to `True` if the FT auto-resume feature is used (e.g., there are multiple training jobs to be run). | ||
calculate_timeouts (bool): Set to `True` if FT timeouts should be calculated based on observed heartbeat intervals. | ||
Calculated timeouts overwrite the timeouts from the FT config. | ||
Timeouts are computed at the end of a training job, if there was checkpoint loading and saving. | ||
For example, for training started from scratch, the timeouts are computed at the end of the second job. | ||
simulated_fault_params (Optional[Any], optional): Simulated fault spec. It's for debugging only. Defaults to None. | ||
exp_dir (Union[str, pathlib.Path, None], optional): Directory where the FT state should be saved. | ||
Must be available for all training jobs. NOTE: Beware that PTL/NeMo can move files written directly to `trainer.log_dir`. | ||
Defaults to None, in which case it defaults to `trainer.log_dir/ft_state/`. | ||
logger_name (Optional[str], optional): Logger name to be used. | ||
Defaults to "nemo_logger.FaultToleranceCallback". | ||
""" | ||
``` | ||
|
||
### 2. Implementing auto-resume | ||
|
||
Auto-resume is a feature that simplifies running a training consisting of multiple subsequent training jobs. | ||
|
||
NOTE: Auto-resume is not a part of the FT package. It is entirely implemented in a launcher script and the `FaultToleranceCallback`. | ||
|
||
`FaultToleranceCallback` exposes an "interface" that allows implementing an auto-resume launcher script. | ||
Specifically, if `autoresume=True` the FT callback creates a special marker file when a training is completed. | ||
The marker file location is expected to be set in the `FAULT_TOL_FINISHED_FLAG_FILE` environment variable. | ||
|
||
The following mechanism can be used to implement an auto-resuming launcher script: | ||
- Launcher script starts ranks with `ft_launcher` | ||
- `FAULT_TOL_FINISHED_FLAG_FILE` should be passed to rank processes | ||
- When a `ft_launcher` exits, a launcher script checks if the `FAULT_TOL_FINISHED_FLAG_FILE` file was created. | ||
- If `FAULT_TOL_FINISHED_FLAG_FILE` exists, the auto-resume loop can be broken, as the training is completed. | ||
- If `FAULT_TOL_FINISHED_FLAG_FILE` does not exist, the continuation job can be issued | ||
(other conditions can be checked e.g. if the maximum number of failures is not reached). | ||
|
||
## Straggler Detection integration guide | ||
|
||
### Include `plt_resiliency.StragglerDetectionCallback` in a PTL trainer callbacks. | ||
|
||
``` | ||
straggler_cb_args = dict( | ||
report_time_interval=300.0, | ||
calc_relative_gpu_perf=True, | ||
calc_individual_gpu_perf=True, | ||
num_gpu_perf_scores_to_log=3, | ||
gpu_relative_perf_threshold=0.7, | ||
gpu_individual_perf_threshold=0.7, | ||
stop_if_detected=False, | ||
logger_name="test_logger", | ||
) | ||
straggler_det_cb = StragglerDetectionCallback(**cb_args) | ||
trainer = pl.Trainer( | ||
... | ||
callbacks=[..., straggler_det_cb], | ||
) | ||
``` | ||
|
||
`StragglerDetectionCallback` initialization params: | ||
|
||
``` | ||
def __init__( | ||
self, | ||
report_time_interval: float, | ||
calc_relative_gpu_perf: bool, | ||
calc_individual_gpu_perf: bool, | ||
num_gpu_perf_scores_to_log: int, | ||
gpu_relative_perf_threshold: float, | ||
gpu_individual_perf_threshold: float, | ||
stop_if_detected: bool, | ||
logger_name: Optional[str] = "nemo_logger.StragglerDetectionCallback", | ||
): | ||
""" | ||
Initialize straggler detection callback instance. | ||
Args: | ||
report_time_interval (float): Interval [seconds] of the straggler check | ||
calc_relative_gpu_perf (bool): Calculate relative GPU performance | ||
calc_individual_gpu_perf (bool): Calculate individual GPU performance | ||
num_gpu_perf_scores_to_log (int): How many best and worst scores to log (0 - does not log periodically, but only if stragglers are detected) | ||
gpu_relative_perf_threshold (float): Threshold for relative GPU performance scores | ||
gpu_individual_perf_threshold (float): Threshold for individual GPU performance scores | ||
stop_if_detected (bool): Set to True, to terminate the workload if stragglers are detected | ||
logger_name (Optional[str], optional): Defaults to "nemo_logger.StragglerDetectionCallback". | ||
Raises: | ||
ValueError: If invalid config was provided. | ||
""" | ||
``` | ||
|
||
More info on straggler detection can be found in the straggler package's README. |
Oops, something went wrong.