Skip to content

Commit 4aa72d5

Browse files
puhukvfdev-5
andauthored
blog cross post (#172)
* blogpost cross posting cross posting of quansight blog * Update 2023-02-03-internship-blogpost.md * Apply suggestions from code review * Update 2023-02-03-internship-blogpost.md --------- Co-authored-by: vfdev <[email protected]>
1 parent 2d1ec7c commit 4aa72d5

5 files changed

+75
-0
lines changed
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
---
2+
title: Sangho's Internship at Quansight with PyTorch-Ignite project
3+
slug: pytorch-ignite-during-quansight-internship
4+
description: PyTorch-Ignite project during internship at Quansight
5+
date: 2023-02-03
6+
tags:
7+
- Deep Learning
8+
- Machine Learning
9+
- PyTorch-Ignite
10+
- PyTorch
11+
- Internship
12+
- Metrics
13+
- Distributed
14+
---
15+
16+
Hey, I'm Sangho Lee, a master's student from Seoul National University.
17+
I have participated in the [PyTorch-Ignite](https://pytorch-ignite.ai/) project internship at Quansight Labs, working on test code improvements and features for distributed computations.
18+
19+
<!--more-->
20+
21+
Crossposted from https://labs.quansight.org/blog/sangho-internship-blogpost
22+
23+
The first part of my contributions is improvements to the test code for metric computation in Distributed Data Parallel (DDP) configuration.
24+
Then I worked on adding the `group` argument to the `all_reduce` and `all_gather` methods in [`ignite.distributed`](https://pytorch.org/ignite/distributed.html) module.
25+
26+
## [About PyTorch-Ignite and distributed computations](https://pytorch-ignite.ai/tutorials/advanced/01-collective-communication/)
27+
28+
PyTorch-Ignite is a high-level library which helps with training and evaluating neural networks in PyTorch flexibly and transparently.
29+
By using PyTorch-Ignite, we can get the benefit of less code than pure PyTorch and extensible API for metrics, experiments, and other components.
30+
In this point of view, PyTorch-Ignite also supports distributed computations with the PyTorch-Ignite distributed module for distributed computations.
31+
32+
When you train a batch of datasets larger than your machine's capacity, then you need to parallelize computations by distributing data.
33+
From this situation, Distributed Data-Parallel(DDP) training is widely adopted when training in distributed configuration with `PyTorch`.
34+
With DDP, the model is replicated on every process, and every model replica will be fed with a different set of input data samples.
35+
36+
However, when we use another backend like `horovod` or `xla`, then we should rewrite the code for each configuration.
37+
PyTorch-Ignite distributed module (`ignite.distributed`) is a helper module to use distributed settings for multiple backends like `nccl`, `gloo`, `mpi`, `xla`, and `horovod`, accordingly, we can use `ignite.distributed` for distributed computations regardless of backend.
38+
39+
40+
![idist configuration](/_images/2023-02-03-internship-blogpost_ddp0.png)
41+
42+
By simply designating the current backend, `ignite.distributed` provides a context manager to simplify the code of distributed configuration setup for all above supported backends. And it also provides methods for using existing configurations from each backend.
43+
For example, `auto_model`, `auto_optim` or `auto_dataloader` helps provided model to adapt existing configuration and `Parallel` helps
44+
to simplify distributed configuration setup for multiple backends.
45+
46+
47+
## How I contributed with improving test code in DDP config
48+
49+
Problem : test code for metrics computation has incorrectly implemented correctness checks in a distributed configuration
50+
51+
There are 3 items to be checked to ensure that the test code for each metric works correctly in DDP configuration.
52+
1) Generate random input data on each rank and make sure it is different on each rank. This input data is used to compute a metric value with ignite.
53+
2) Gather data with [`idist.all_gather`](https://pytorch-ignite.ai/tutorials/advanced/01-collective-communication/#all-gather) on each rank such that they have the same data before computing reference metric
54+
3) Compare the computed metric value with the reference value (e.g. computed with scikit-learn)
55+
56+
57+
## How I contributed new feature: `group` argument to `all_reduce` and `all_gather` methods
58+
59+
Problem : Existing methods in PyTorch-Ignite uses all ranks, however, for certain use cases users may want to choose a subset of ranks for collecting the data like in the picture.
60+
61+
As mentioned, the distributed part of Ignite is a wrapper of different backends like [horovod](https://horovod.ai/), [nccl](https://developer.nvidia.com/nccl), [gloo](https://github.com/facebookincubator/gloo) and [xla](https://github.com/pytorch/xla).
62+
I added a new group method for generating group depending on its backend and modified all_reduce and all_gather to take group arguments for users to select the devices.
63+
![Code snippets](/_images/2023-02-03-internship-blogpost_code1.png)
64+
65+
66+
### My contributions
67+
68+
![Contributions with improving test](/_images/2023-02-03-internship-blogpost_cont1.png)
69+
![Contributions with improving test](/_images/2023-02-03-internship-blogpost_cont2.png)
70+
71+
### What I learned
72+
73+
These 3 months were really precious time for me as an intern of Quansight.
74+
75+
PS: I want to thank my mentor [Victor Fomin](https://github.com/vfdev-5) for the teaching and support during the internship.
Loading
Loading
Loading
Loading

0 commit comments

Comments
 (0)