Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
jbieniusiewi committed Oct 14, 2024
0 parents commit d815452
Show file tree
Hide file tree
Showing 138 changed files with 25,251 additions and 0 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
build
dist
*.egg-info
__pycache__
cupti_module.*.so
21 changes: 21 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
default_language_version:
python: python3

repos:

- repo: https://github.com/PyCQA/isort
rev: 5.13.2
hooks:
- id: isort
exclude: docs/

- repo: https://github.com/psf/black-pre-commit-mirror
rev: 24.10.0
hooks:
- id: black
language_version: python3.10

- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.6.9
hooks:
- id: ruff
124 changes: 124 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@

## Nvidia Resiliency Extension (NVRx) OSS Contribution Rules

#### Issue Tracking

* All enhancement, bugfix, or change requests must begin with the creation of a [NVRx Issue Request](TBD).
* The issue request must be reviewed by NVRx engineers and approved prior to code review.


#### Coding Guidelines

- All source code contributions must follow the existing conventions in the relevant file, submodule, module, and project when you add new code or when you extend/fix existing functionality.

- Avoid introducing unnecessary complexity into existing code so that maintainability and readability are preserved.

- Try to keep pull requests (PRs) as concise as possible:
- Avoid committing commented-out code.
- Wherever possible, each PR should address a single concern. If there are several otherwise-unrelated things that should be fixed to reach a desired endpoint, our recommendation is to open several PRs and indicate the dependencies in the description. The more complex the changes are in a single PR, the more time it will take to review those changes.

- To ensure code consistency and maintainability across the project, please format and lint your code using the following tools before committing any changes:
- We use black to automatically format Python code. It enforces a consistent style by reformatting code according to a set of rules.
- To format your code, run:
```
black .
```
- isort is used to sort and format import statements automatically. Ensure that your imports are ordered correctly by running:
```
isort .
```
- ruff is a fast Python linter that helps catch common issues. Please run ruff to check for and fix linting problems:
```
ruff check .
```

- Write commit titles using imperative mood and [these rules](https://chris.beams.io/posts/git-commit/), and reference the Issue number corresponding to the PR. Following is the recommended format for commit texts:
```
#<Issue Number> - <Commit Title>
<Commit Body>
```

- Ensure that the build log is clean, meaning no warnings or errors should be present.

- Ensure that all unit tests pass prior to submitting your code.

- All OSS components must contain accompanying documentation (READMEs) describing the functionality, dependencies, and known issues.

- See `README.md` for existing samples and plugins for reference.

- All OSS components must have an accompanying test.

- If introducing a new component, such as a plugin, provide a test sample to verify the functionality.

- Make sure that you can contribute your work to open source (no license and/or patent conflict is introduced by your code). You will need to [`sign`](#signing-your-work) your commit.

- Thanks in advance for your patience as we review your contributions; we do appreciate them!


#### Pull Requests
Developer workflow for code contributions is as follows:

1. Developers must first [fork](https://help.github.com/en/articles/fork-a-repo) the [upstream](TBD) NVRx OSS repository.

2. Git clone the forked repository and push changes to the personal fork.

```bash
git clone https://github.com/YOUR_USERNAME/YOUR_FORK.git NVRx
# Checkout the targeted branch and commit changes
# Push the commits to a branch on the fork (remote).
git push -u origin <local-branch>:<remote-branch>
```

3. Once the code changes are staged on the fork and ready for review, a [Pull Request](https://help.github.com/en/articles/about-pull-requests) (PR) can be [requested](https://help.github.com/en/articles/creating-a-pull-request) to merge the changes from a branch of the fork into a selected branch of upstream.
* Exercise caution when selecting the source and target branches for the PR.
Note that versioned releases of NVRx OSS are posted to `release/` branches of the upstream repo.
* Creation of a PR creation kicks off the code review process.
* Atleast one NVRx engineer will be assigned for the review.
* While under review, mark your PRs as work-in-progress by prefixing the PR title with [WIP].

4. Since there is no CI/CD process in place yet, the PR will be accepted and the corresponding issue closed only after adequate testing has been completed, manually, by the developer and/or NVRx engineer reviewing the code.


#### Signing Your Work

* We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.

* Any contribution which contains commits that are not Signed-Off will not be accepted.

* To sign off on a commit you simply use the `--signoff` (or `-s`) option when committing your changes:
```bash
$ git commit -s -m "Add cool feature."
```
This will append the following to your commit message:
```
Signed-off-by: Your Name <[email protected]>
```

* Full text of the DCO:

```
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129
Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
```

```
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or
(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or
(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.
(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.
```
43 changes: 43 additions & 0 deletions Dockerfile.builder
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# This image purpose is to build "nvidia_resiliency_ext" wheels using different Python versions.
# There are python3.10, python3.11 and python3.12 installed.
# Base image is CUDA, as Straggler Detection package uses CUPTI.
# Wheel for Python3.10 can be created with "python3.10 -m build --wheel" etc.

# Choose a base CUDA image from NVIDIA
# nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04, nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04 etc.
ARG BASE_CUDA_IMG=nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
FROM ${BASE_CUDA_IMG}

# Set environment variables to non-interactive to avoid prompts during package installation
ENV DEBIAN_FRONTEND=noninteractive

# Repo with Pythons
RUN apt update && apt install -y software-properties-common && add-apt-repository ppa:deadsnakes/ppa

# Install common dependencies
RUN apt-get update && apt-get install -y \
python3.10 python3.10-dev python3.10-distutils \
python3.11 python3.11-dev python3.11-distutils \
python3.12 python3.12-dev python3.12-distutils \
wget curl build-essential gcc-10 g++-10\
&& rm -rf /var/lib/apt/lists/*

# Install pip for each Python version
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10 && \
curl -sS https://bootstrap.pypa.io/get-pip.py | python3.11 && \
curl -sS https://bootstrap.pypa.io/get-pip.py | python3.12

# Install deps,
# FIXME: for some reason six needs to be manually updated
# otherwise wheel building fails with: ModuleNotFoundError: No module named 'six'
RUN python3.10 -m pip install build poetry && \
python3.11 -m pip install build poetry && \
python3.12 -m pip install build poetry && \
python3.10 -m pip install -U six && \
python3.11 -m pip install -U six && \
python3.12 -m pip install -U six

# Set the working directory
WORKDIR /workspace
ENTRYPOINT ["/bin/bash", "-c"]
14 changes: 14 additions & 0 deletions LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
174 changes: 174 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# Nvidia Resiliency Extension

This project combines multiple resiliency-related solutions.
- Fault Tolerance package
- Straggler Detection package
- PyTorch Lightning callbacks


## Installation:

### From sources
- `git clone --recursive <this repo URL>`
- `cd <repo>`
- `pip install .`

Requirements:
- Python >= 3.10
- gcc >= 8.0
- CUDA >= 11.8

## Fault Tolerance integration guide

This section describes Fault Tolerance callback integration with a PTL-based workload (e.g. NeMo).

Let's define some terms used in this section:
- `PTL` is PyTorch Lightning
- `Fault Tolerance`, `FT` is the `fault_tolerance` package, included in `nvidia_resiliency_ext`.
- `FT callback`, `FaultToleranceCallback` is a PTL callback defined in `ptl_resiliency` package, included in `nvidia_resiliency_ext`.
- `ft_launcher` is a launcher tool included in the FT, which is based on `torchrun`.
- `heartbeat` is a lightweight message sent from a rank to its rank monitor that indicates that a rank is alive.
- `rank monitor` is a special side-process started by `ft_launcher` that monitors heartbeats from its rank.
- `timeouts` are time intervals used by a rank monitor to detect that a rank is not alive.
There are 2 separate timeouts: for the initial heartbeat and the subsequent heartbeats.
- `launcher script` is a bash script that invokes `ft_launcher`.

### 0. Use `ft_launcher` to start the workload

`ft_launcher` is similar to `torchrun` but it starts a rank monitor for each started rank.
`ft_launcher` takes the FT configuration in a YAML file (`--fault-tol-cfg-path`) or via CLI args (`--ft-param-...`).
FT configuration items are described in `FaultToleranceConfig` docstring.

### 1. Add FT callback to the trainer

Add FT callback to PTL callbacks.

```
fault_tol_cb = FaultToleranceCallback(
autoresume=True,
calculate_timeouts=True,
logger_name="test_logger",
exp_dir=tmp_path,
)
trainer = pl.Trainer(
...
callbacks=[..., fault_tol_cb],
)
```


Core FT callback functionality is:
- Establishing a connection with a rank monitor
- Sending heartbeats during training and evaluation steps
- Disconnecting from a rank monitor

Optionally, it can also:
- Compute timeouts that will be used instead of timeouts defined in the FT config
- Create a flag file when the training is completed

FT callback initialization params:
```
def __init__(
self,
autoresume: bool,
calculate_timeouts: bool,
simulated_fault_params: Optional[Any] = None,
exp_dir: Union[str, pathlib.Path, None] = None,
logger_name: Optional[str] = "nemo_logger.FaultToleranceCallback",
):
"""
Initialize callback instance.
This is a lightweight initialization. Most of the initialization is conducted in the 'setup' hook.
Args:
autoresume (bool): Set to `True` if the FT auto-resume feature is used (e.g., there are multiple training jobs to be run).
calculate_timeouts (bool): Set to `True` if FT timeouts should be calculated based on observed heartbeat intervals.
Calculated timeouts overwrite the timeouts from the FT config.
Timeouts are computed at the end of a training job, if there was checkpoint loading and saving.
For example, for training started from scratch, the timeouts are computed at the end of the second job.
simulated_fault_params (Optional[Any], optional): Simulated fault spec. It's for debugging only. Defaults to None.
exp_dir (Union[str, pathlib.Path, None], optional): Directory where the FT state should be saved.
Must be available for all training jobs. NOTE: Beware that PTL/NeMo can move files written directly to `trainer.log_dir`.
Defaults to None, in which case it defaults to `trainer.log_dir/ft_state/`.
logger_name (Optional[str], optional): Logger name to be used.
Defaults to "nemo_logger.FaultToleranceCallback".
"""
```

### 2. Implementing auto-resume

Auto-resume is a feature that simplifies running a training consisting of multiple subsequent training jobs.

NOTE: Auto-resume is not a part of the FT package. It is entirely implemented in a launcher script and the `FaultToleranceCallback`.

`FaultToleranceCallback` exposes an "interface" that allows implementing an auto-resume launcher script.
Specifically, if `autoresume=True` the FT callback creates a special marker file when a training is completed.
The marker file location is expected to be set in the `FAULT_TOL_FINISHED_FLAG_FILE` environment variable.

The following mechanism can be used to implement an auto-resuming launcher script:
- Launcher script starts ranks with `ft_launcher`
- `FAULT_TOL_FINISHED_FLAG_FILE` should be passed to rank processes
- When a `ft_launcher` exits, a launcher script checks if the `FAULT_TOL_FINISHED_FLAG_FILE` file was created.
- If `FAULT_TOL_FINISHED_FLAG_FILE` exists, the auto-resume loop can be broken, as the training is completed.
- If `FAULT_TOL_FINISHED_FLAG_FILE` does not exist, the continuation job can be issued
(other conditions can be checked e.g. if the maximum number of failures is not reached).

## Straggler Detection integration guide

### Include `plt_resiliency.StragglerDetectionCallback` in a PTL trainer callbacks.

```
straggler_cb_args = dict(
report_time_interval=300.0,
calc_relative_gpu_perf=True,
calc_individual_gpu_perf=True,
num_gpu_perf_scores_to_log=3,
gpu_relative_perf_threshold=0.7,
gpu_individual_perf_threshold=0.7,
stop_if_detected=False,
logger_name="test_logger",
)
straggler_det_cb = StragglerDetectionCallback(**cb_args)
trainer = pl.Trainer(
...
callbacks=[..., straggler_det_cb],
)
```

`StragglerDetectionCallback` initialization params:

```
def __init__(
self,
report_time_interval: float,
calc_relative_gpu_perf: bool,
calc_individual_gpu_perf: bool,
num_gpu_perf_scores_to_log: int,
gpu_relative_perf_threshold: float,
gpu_individual_perf_threshold: float,
stop_if_detected: bool,
logger_name: Optional[str] = "nemo_logger.StragglerDetectionCallback",
):
"""
Initialize straggler detection callback instance.
Args:
report_time_interval (float): Interval [seconds] of the straggler check
calc_relative_gpu_perf (bool): Calculate relative GPU performance
calc_individual_gpu_perf (bool): Calculate individual GPU performance
num_gpu_perf_scores_to_log (int): How many best and worst scores to log (0 - does not log periodically, but only if stragglers are detected)
gpu_relative_perf_threshold (float): Threshold for relative GPU performance scores
gpu_individual_perf_threshold (float): Threshold for individual GPU performance scores
stop_if_detected (bool): Set to True, to terminate the workload if stragglers are detected
logger_name (Optional[str], optional): Defaults to "nemo_logger.StragglerDetectionCallback".
Raises:
ValueError: If invalid config was provided.
"""
```

More info on straggler detection can be found in the straggler package's README.
Loading

0 comments on commit d815452

Please sign in to comment.