|
| 1 | +--- |
| 2 | +title: Sangho's Internship at Quansight with PyTorch-Ignite project |
| 3 | +slug: pytorch-ignite-during-quansight-internship |
| 4 | +description: PyTorch-Ignite project during internship at Quansight |
| 5 | +date: 2023-02-03 |
| 6 | +tags: |
| 7 | + - Deep Learning |
| 8 | + - Machine Learning |
| 9 | + - PyTorch-Ignite |
| 10 | + - PyTorch |
| 11 | + - Internship |
| 12 | + - Metrics |
| 13 | + - Distributed |
| 14 | +--- |
| 15 | + |
| 16 | +Hey, I'm Sangho Lee, a master's student from Seoul National University. |
| 17 | +I have participated in the [PyTorch-Ignite](https://pytorch-ignite.ai/) project internship at Quansight Labs, working on test code improvements and features for distributed computations. |
| 18 | + |
| 19 | +<!--more--> |
| 20 | + |
| 21 | +Crossposted from https://labs.quansight.org/blog/sangho-internship-blogpost |
| 22 | + |
| 23 | +The first part of my contributions is improvements to the test code for metric computation in Distributed Data Parallel (DDP) configuration. |
| 24 | +Then I worked on adding the `group` argument to the `all_reduce` and `all_gather` methods in [`ignite.distributed`](https://pytorch.org/ignite/distributed.html) module. |
| 25 | + |
| 26 | +## [About PyTorch-Ignite and distributed computations](https://pytorch-ignite.ai/tutorials/advanced/01-collective-communication/) |
| 27 | + |
| 28 | +PyTorch-Ignite is a high-level library which helps with training and evaluating neural networks in PyTorch flexibly and transparently. |
| 29 | +By using PyTorch-Ignite, we can get the benefit of less code than pure PyTorch and extensible API for metrics, experiments, and other components. |
| 30 | +In this point of view, PyTorch-Ignite also supports distributed computations with the PyTorch-Ignite distributed module for distributed computations. |
| 31 | + |
| 32 | +When you train a batch of datasets larger than your machine's capacity, then you need to parallelize computations by distributing data. |
| 33 | +From this situation, Distributed Data-Parallel(DDP) training is widely adopted when training in distributed configuration with `PyTorch`. |
| 34 | +With DDP, the model is replicated on every process, and every model replica will be fed with a different set of input data samples. |
| 35 | + |
| 36 | +However, when we use another backend like `horovod` or `xla`, then we should rewrite the code for each configuration. |
| 37 | +PyTorch-Ignite distributed module (`ignite.distributed`) is a helper module to use distributed settings for multiple backends like `nccl`, `gloo`, `mpi`, `xla`, and `horovod`, accordingly, we can use `ignite.distributed` for distributed computations regardless of backend. |
| 38 | + |
| 39 | + |
| 40 | + |
| 41 | + |
| 42 | +By simply designating the current backend, `ignite.distributed` provides a context manager to simplify the code of distributed configuration setup for all above supported backends. And it also provides methods for using existing configurations from each backend. |
| 43 | +For example, `auto_model`, `auto_optim` or `auto_dataloader` helps provided model to adapt existing configuration and `Parallel` helps |
| 44 | +to simplify distributed configuration setup for multiple backends. |
| 45 | + |
| 46 | + |
| 47 | +## How I contributed with improving test code in DDP config |
| 48 | + |
| 49 | +Problem : test code for metrics computation has incorrectly implemented correctness checks in a distributed configuration |
| 50 | + |
| 51 | +There are 3 items to be checked to ensure that the test code for each metric works correctly in DDP configuration. |
| 52 | +1) Generate random input data on each rank and make sure it is different on each rank. This input data is used to compute a metric value with ignite. |
| 53 | +2) Gather data with [`idist.all_gather`](https://pytorch-ignite.ai/tutorials/advanced/01-collective-communication/#all-gather) on each rank such that they have the same data before computing reference metric |
| 54 | +3) Compare the computed metric value with the reference value (e.g. computed with scikit-learn) |
| 55 | + |
| 56 | + |
| 57 | +## How I contributed new feature: `group` argument to `all_reduce` and `all_gather` methods |
| 58 | + |
| 59 | +Problem : Existing methods in PyTorch-Ignite uses all ranks, however, for certain use cases users may want to choose a subset of ranks for collecting the data like in the picture. |
| 60 | + |
| 61 | +As mentioned, the distributed part of Ignite is a wrapper of different backends like [horovod](https://horovod.ai/), [nccl](https://developer.nvidia.com/nccl), [gloo](https://github.com/facebookincubator/gloo) and [xla](https://github.com/pytorch/xla). |
| 62 | +I added a new group method for generating group depending on its backend and modified all_reduce and all_gather to take group arguments for users to select the devices. |
| 63 | + |
| 64 | + |
| 65 | + |
| 66 | +### My contributions |
| 67 | + |
| 68 | + |
| 69 | + |
| 70 | + |
| 71 | +### What I learned |
| 72 | + |
| 73 | +These 3 months were really precious time for me as an intern of Quansight. |
| 74 | + |
| 75 | +PS: I want to thank my mentor [Victor Fomin](https://github.com/vfdev-5) for the teaching and support during the internship. |
0 commit comments