Initial commit

NVIDIA · Oct 14, 2024 · d815452 · d815452
commit d815452
Show file tree

Hide file tree

Showing 138 changed files with 25,251 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,5 @@
+build
+dist
+*.egg-info
+__pycache__
+cupti_module.*.so
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,21 @@
+default_language_version:
+  python: python3
+
+repos:
+
+  - repo: https://github.com/PyCQA/isort
+    rev: 5.13.2
+    hooks:
+      - id: isort
+        exclude: docs/
+
+  - repo: https://github.com/psf/black-pre-commit-mirror
+    rev: 24.10.0
+    hooks:
+      - id: black
+        language_version: python3.10
+
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.6.9
+    hooks:
+      - id: ruff
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,124 @@
+
+## Nvidia Resiliency Extension (NVRx) OSS Contribution Rules
+
+#### Issue Tracking
+
+* All enhancement, bugfix, or change requests must begin with the creation of a [NVRx Issue Request](TBD).
+  * The issue request must be reviewed by NVRx engineers and approved prior to code review.
+
+
+#### Coding Guidelines
+
+- All source code contributions must follow the existing conventions in the relevant file, submodule, module, and project when you add new code or when you extend/fix existing functionality.
+
+- Avoid introducing unnecessary complexity into existing code so that maintainability and readability are preserved.
+
+- Try to keep pull requests (PRs) as concise as possible:
+  - Avoid committing commented-out code.
+  - Wherever possible, each PR should address a single concern. If there are several otherwise-unrelated things that should be fixed to reach a desired endpoint, our recommendation is to open several PRs and indicate the dependencies in the description. The more complex the changes are in a single PR, the more time it will take to review those changes.
+
+- To ensure code consistency and maintainability across the project, please format and lint your code using the following tools before committing any changes:
+  - We use black to automatically format Python code. It enforces a consistent style by reformatting code according to a set of rules.
+  - To format your code, run:
+```
+black .
+```
+  - isort is used to sort and format import statements automatically. Ensure that your imports are ordered correctly by running:
+```
+isort .
+```
+  - ruff is a fast Python linter that helps catch common issues. Please run ruff to check for and fix linting problems:
+```
+ruff check .
+```
+
+- Write commit titles using imperative mood and [these rules](https://chris.beams.io/posts/git-commit/), and reference the Issue number corresponding to the PR. Following is the recommended format for commit texts:
+```
+#<Issue Number> - <Commit Title>
+
+<Commit Body>
+```
+
+- Ensure that the build log is clean, meaning no warnings or errors should be present.
+
+- Ensure that all unit tests pass prior to submitting your code.
+
+- All OSS components must contain accompanying documentation (READMEs) describing the functionality, dependencies, and known issues.
+
+  - See `README.md` for existing samples and plugins for reference.
+
+- All OSS components must have an accompanying test.
+
+  - If introducing a new component, such as a plugin, provide a test sample to verify the functionality.
+
+- Make sure that you can contribute your work to open source (no license and/or patent conflict is introduced by your code). You will need to [`sign`](#signing-your-work) your commit.
+
+- Thanks in advance for your patience as we review your contributions; we do appreciate them!
+
+
+#### Pull Requests
+Developer workflow for code contributions is as follows:
+
+1. Developers must first [fork](https://help.github.com/en/articles/fork-a-repo) the [upstream](TBD) NVRx OSS repository.
+
+2. Git clone the forked repository and push changes to the personal fork.
+
+  ```bash
+git clone https://github.com/YOUR_USERNAME/YOUR_FORK.git NVRx 
+# Checkout the targeted branch and commit changes
+# Push the commits to a branch on the fork (remote).
+git push -u origin <local-branch>:<remote-branch>
+  ```
+
+3. Once the code changes are staged on the fork and ready for review, a [Pull Request](https://help.github.com/en/articles/about-pull-requests) (PR) can be [requested](https://help.github.com/en/articles/creating-a-pull-request) to merge the changes from a branch of the fork into a selected branch of upstream.
+  * Exercise caution when selecting the source and target branches for the PR.
+    Note that versioned releases of NVRx OSS are posted to `release/` branches of the upstream repo.
+  * Creation of a PR creation kicks off the code review process.
+  * Atleast one NVRx engineer will be assigned for the review.
+  * While under review, mark your PRs as work-in-progress by prefixing the PR title with [WIP].
+
+4. Since there is no CI/CD process in place yet, the PR will be accepted and the corresponding issue closed only after adequate testing has been completed, manually, by the developer and/or NVRx engineer reviewing the code.
+
+
+#### Signing Your Work
+
+* We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
+
+  * Any contribution which contains commits that are not Signed-Off will not be accepted.
+
+* To sign off on a commit you simply use the `--signoff` (or `-s`) option when committing your changes:
+  ```bash
+  $ git commit -s -m "Add cool feature."
+  ```
+  This will append the following to your commit message:
+  ```
+  Signed-off-by: Your Name <[email protected]>
+  ```
+
+* Full text of the DCO:
+
+  ```
+    Developer Certificate of Origin
+    Version 1.1
+    
+    Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
+    1 Letterman Drive
+    Suite D4700
+    San Francisco, CA, 94129
+    
+    Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
+  ```
+
+  ```
+    Developer's Certificate of Origin 1.1
+    
+    By making a contribution to this project, I certify that:
+    
+    (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or
+    
+    (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or
+    
+    (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.
+    
+    (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.
+  ```
diff --git a/Dockerfile.builder b/Dockerfile.builder
@@ -0,0 +1,43 @@
+# This image purpose is to build "nvidia_resiliency_ext" wheels using different Python versions.
+# There are python3.10, python3.11 and python3.12 installed.
+# Base image is CUDA, as Straggler Detection package uses CUPTI.
+# Wheel for Python3.10 can be created with "python3.10 -m build --wheel" etc.
+
+# Choose a base CUDA image from NVIDIA
+# nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04, nvidia/cuda:12.4.1-cudnn-devel-ubuntu20.04 etc.
+ARG BASE_CUDA_IMG=nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
+FROM ${BASE_CUDA_IMG}
+
+# Set environment variables to non-interactive to avoid prompts during package installation
+ENV DEBIAN_FRONTEND=noninteractive
+
+# Repo with Pythons
+RUN apt update && apt install -y software-properties-common && add-apt-repository ppa:deadsnakes/ppa
+
+# Install common dependencies
+RUN apt-get update && apt-get install -y \
+    python3.10 python3.10-dev python3.10-distutils \
+    python3.11 python3.11-dev python3.11-distutils \
+    python3.12 python3.12-dev python3.12-distutils \
+    wget curl build-essential gcc-10 g++-10\
+    && rm -rf /var/lib/apt/lists/*
+
+# Install pip for each Python version
+RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10 && \
+    curl -sS https://bootstrap.pypa.io/get-pip.py | python3.11 && \
+    curl -sS https://bootstrap.pypa.io/get-pip.py | python3.12
+
+# Install deps,
+# FIXME: for some reason six needs to be manually updated
+# otherwise wheel building fails with: ModuleNotFoundError: No module named 'six'
+RUN python3.10 -m pip install build poetry && \
+    python3.11 -m pip install build poetry && \
+    python3.12 -m pip install build poetry && \
+    python3.10 -m pip install -U six && \
+    python3.11 -m pip install -U six && \
+    python3.12 -m pip install -U six
+
+# Set the working directory
+WORKDIR /workspace
+
+ENTRYPOINT ["/bin/bash", "-c"]
diff --git a/LICENSE.txt b/LICENSE.txt
@@ -0,0 +1,14 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/README.md b/README.md
@@ -0,0 +1,174 @@
+# Nvidia Resiliency Extension
+
+This project combines multiple resiliency-related solutions.
+- Fault Tolerance package
+- Straggler Detection package
+- PyTorch Lightning callbacks
+
+
+## Installation:
+
+### From sources
+- `git clone --recursive <this repo URL>`
+- `cd <repo>`
+- `pip install .`
+
+Requirements:
+- Python >= 3.10
+- gcc >= 8.0
+- CUDA >= 11.8
+
+## Fault Tolerance integration guide
+
+This section describes Fault Tolerance callback integration with a PTL-based workload (e.g. NeMo).
+
+Let's define some terms used in this section:
+- `PTL` is PyTorch Lightning
+- `Fault Tolerance`, `FT` is the `fault_tolerance` package, included in `nvidia_resiliency_ext`. 
+- `FT callback`, `FaultToleranceCallback` is a PTL callback defined in `ptl_resiliency` package, included in `nvidia_resiliency_ext`.
+- `ft_launcher` is a launcher tool included in the FT, which is based on `torchrun`.  
+- `heartbeat` is a lightweight message sent from a rank to its rank monitor that indicates that a rank is alive.
+- `rank monitor` is a special side-process started by `ft_launcher` that monitors heartbeats from its rank.
+- `timeouts` are time intervals used by a rank monitor to detect that a rank is not alive. 
+    There are 2 separate timeouts: for the initial heartbeat and the subsequent heartbeats.
+- `launcher script` is a bash script that invokes `ft_launcher`.
+
+### 0. Use `ft_launcher` to start the workload
+
+`ft_launcher` is similar to `torchrun` but it starts a rank monitor for each started rank.  
+`ft_launcher` takes the FT configuration in a YAML file (`--fault-tol-cfg-path`) or via CLI args (`--ft-param-...`).  
+FT configuration items are described in `FaultToleranceConfig` docstring.
+
+### 1. Add FT callback to the trainer
+
+Add FT callback to PTL callbacks. 
+
+```
+fault_tol_cb = FaultToleranceCallback(
+    autoresume=True,
+    calculate_timeouts=True,
+    logger_name="test_logger",
+    exp_dir=tmp_path,
+)
+
+trainer = pl.Trainer(
+    ...
+    callbacks=[..., fault_tol_cb],
+)
+```
+
+
+Core FT callback functionality is:
+- Establishing a connection with a rank monitor
+- Sending heartbeats during training and evaluation steps
+- Disconnecting from a rank monitor
+
+Optionally, it can also:
+- Compute timeouts that will be used instead of timeouts defined in the FT config
+- Create a flag file when the training is completed
+
+FT callback initialization params:
+```
+def __init__(
+    self,
+    autoresume: bool,
+    calculate_timeouts: bool,
+    simulated_fault_params: Optional[Any] = None,
+    exp_dir: Union[str, pathlib.Path, None] = None,
+    logger_name: Optional[str] = "nemo_logger.FaultToleranceCallback",
+):
+    """
+    Initialize callback instance.
+
+    This is a lightweight initialization. Most of the initialization is conducted in the 'setup' hook.
+
+    Args:
+        autoresume (bool): Set to `True` if the FT auto-resume feature is used (e.g., there are multiple training jobs to be run).
+        calculate_timeouts (bool): Set to `True` if FT timeouts should be calculated based on observed heartbeat intervals.
+            Calculated timeouts overwrite the timeouts from the FT config.
+            Timeouts are computed at the end of a training job, if there was checkpoint loading and saving.
+            For example, for training started from scratch, the timeouts are computed at the end of the second job.
+        simulated_fault_params (Optional[Any], optional): Simulated fault spec. It's for debugging only. Defaults to None.
+        exp_dir (Union[str, pathlib.Path, None], optional): Directory where the FT state should be saved.
+            Must be available for all training jobs. NOTE: Beware that PTL/NeMo can move files written directly to `trainer.log_dir`.
+            Defaults to None, in which case it defaults to `trainer.log_dir/ft_state/`.
+        logger_name (Optional[str], optional): Logger name to be used.
+            Defaults to "nemo_logger.FaultToleranceCallback".
+    """
+```             
+
+### 2. Implementing auto-resume
+
+Auto-resume is a feature that simplifies running a training consisting of multiple subsequent training jobs. 
+
+NOTE: Auto-resume is not a part of the FT package. It is entirely implemented in a launcher script and the `FaultToleranceCallback`. 
+
+`FaultToleranceCallback` exposes an "interface" that allows implementing an auto-resume launcher script.  
+Specifically, if `autoresume=True` the FT callback creates a special marker file when a training is completed.  
+The marker file location is expected to be set in the `FAULT_TOL_FINISHED_FLAG_FILE` environment variable.
+
+The following mechanism can be used to implement an auto-resuming launcher script:
+- Launcher script starts ranks with `ft_launcher`
+- `FAULT_TOL_FINISHED_FLAG_FILE` should be passed to rank processes
+- When a `ft_launcher` exits, a launcher script checks if the `FAULT_TOL_FINISHED_FLAG_FILE` file was created.
+    - If `FAULT_TOL_FINISHED_FLAG_FILE` exists, the auto-resume loop can be broken, as the training is completed.
+    - If `FAULT_TOL_FINISHED_FLAG_FILE` does not exist, the continuation job can be issued
+        (other conditions can be checked e.g. if the maximum number of failures is not reached).
+
+## Straggler Detection integration guide
+
+### Include `plt_resiliency.StragglerDetectionCallback` in a PTL trainer callbacks. 
+
+```
+straggler_cb_args = dict(
+    report_time_interval=300.0,
+    calc_relative_gpu_perf=True,
+    calc_individual_gpu_perf=True,
+    num_gpu_perf_scores_to_log=3,
+    gpu_relative_perf_threshold=0.7,
+    gpu_individual_perf_threshold=0.7,
+    stop_if_detected=False,
+    logger_name="test_logger",
+)
+
+straggler_det_cb = StragglerDetectionCallback(**cb_args)
+
+trainer = pl.Trainer(
+    ...
+    callbacks=[..., straggler_det_cb],
+)
+```
+
+`StragglerDetectionCallback` initialization params:
+
+```
+def __init__(
+    self,
+    report_time_interval: float,
+    calc_relative_gpu_perf: bool,
+    calc_individual_gpu_perf: bool,
+    num_gpu_perf_scores_to_log: int,
+    gpu_relative_perf_threshold: float,
+    gpu_individual_perf_threshold: float,
+    stop_if_detected: bool,
+    logger_name: Optional[str] = "nemo_logger.StragglerDetectionCallback",
+):
+    """
+    Initialize straggler detection callback instance.
+
+    Args:
+        report_time_interval (float): Interval [seconds] of the straggler check
+        calc_relative_gpu_perf (bool): Calculate relative GPU performance
+        calc_individual_gpu_perf (bool): Calculate individual GPU performance
+        num_gpu_perf_scores_to_log (int): How many best and worst scores to log (0 - does not log periodically, but only if stragglers are detected)
+        gpu_relative_perf_threshold (float): Threshold for relative GPU performance scores
+        gpu_individual_perf_threshold (float): Threshold for individual GPU performance scores
+        stop_if_detected (bool): Set to True, to terminate the workload if stragglers are detected
+        logger_name (Optional[str], optional): Defaults to "nemo_logger.StragglerDetectionCallback".
+
+    Raises:
+        ValueError: If invalid config was provided.
+    """
+```
+
+More info on straggler detection can be found in the straggler package's README.