Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
36a5f84
Added the capacity of detecting topology information.
Jun 16, 2021
5719f60
Added RCCL Topology Visualizer
Jun 23, 2021
d25aabb
Update rccl_nccl_parser.py
hubertlu-tw Jul 7, 2021
01d1835
Update net_unique_topo.sh
hubertlu-tw Jul 7, 2021
45082e7
Added the debugging tool for device indices mapping for ROCM-SMI and …
hubertlu-tw Jul 9, 2021
cc11a91
Fix issues when there is HIP_VISIBLE_DEVICES is used.
hubertlu-tw Jul 9, 2021
ad4c9e6
Updated outputs for Device ID Mapping.
hubertlu-tw Jul 9, 2021
042d892
Updated README for Device ID Mapping.
hubertlu-tw Jul 9, 2021
52acdf8
Used rocm_smi_lib APIs instead of extracting outputs from rocm-smi --…
hubertlu-tw Jul 9, 2021
f50476a
Refactored the code for the output formatting with fixed width.
hubertlu-tw Jul 12, 2021
73a29a3
Fixed output formatting.
hubertlu-tw Jul 12, 2021
873a8f3
Add top info to the unique command script.
hubertlu-tw Jul 30, 2021
49426ee
Create install.sh
hubertlu-tw Jul 30, 2021
c7b2581
Update README.md
hubertlu-tw Jul 30, 2021
64b7773
Update README.md
hubertlu-tw Jul 30, 2021
931121a
Update README.md
hubertlu-tw Jul 30, 2021
8747d88
Replace --cuda to --new_log to prevent confusion
hubertlu-tw Jul 30, 2021
b4427ed
Update net_unique_topo.sh
hubertlu-tw Aug 2, 2021
b2b045c
Update README.md
hubertlu-tw Aug 3, 2021
974d4a9
Update generate_summary.py
hubertlu-tw Aug 3, 2021
0956ed9
Create generate_summary.py
hubertlu-tw Aug 3, 2021
dd7aba3
Update run_parser_and_generate_summary.py
hubertlu-tw Aug 3, 2021
d62dc3c
Update rccl_nccl_parser.py
hubertlu-tw Aug 3, 2021
ab1fd5a
Update README.md
hubertlu-tw Aug 3, 2021
0b96fab
Add coll trace processor.
Aug 3, 2021
4b2644e
Update log_processor.py
hubertlu-tw Aug 3, 2021
b2449f8
Fix collective trace processor
Aug 4, 2021
2dc1df1
Polish the branch
Aug 4, 2021
c0d6892
Update README.md
hubertlu-tw Aug 4, 2021
54fb727
Fix some bugs
Aug 4, 2021
f97a293
Merge branch 'main' into topology_hubert
hubertlu-tw Aug 4, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 23 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# nccl-rccl-parser
# Topology-aware nccl-rccl-parser
This tool is used for dumping out the rccl-tests/nccl-test commands directly from an application to identify any potential bottlenecks of scaling while using RCCL/NCCL modules when running a distributed applications.

To get started please clone the following repository:
Expand All @@ -11,8 +11,14 @@ To run the tests, we use the following repositories:

# Pre-requisites:
* RCCL/NCCL installed.
* rccl-tests or nccl-tests installed.

* Clone this repo with
```
git clone --recursive https://github.com/lcskrishna/nccl-rccl-parser.git
```
* Run installation script by
```
sh install.sh
```
# How to use the tool:

### Run application and collect RCCL/NCCL Log:**
Expand All @@ -29,7 +35,6 @@ NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL <application> |& tee nccl_debug_log.
HSA_FORCE_FINE_GRAIN_PCIE=1 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL <application> |& tee nccl_debug_log.txt
```


### Automated way:

To gather the performance results once you have the debug log with you. Run the below command.
Expand All @@ -38,25 +43,33 @@ On CUDA devices, use --cuda argument.

On ROCm devices, use --rocm argument.

With NCCL or RCCL 2.8 or below, the argument "--legacy-device-grouping" is required for device grouping in applications.

Note: If you don't mention the arguments the automated script only dumps out the output data from the parser.

**On ROCm:**

```
python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_log.txt --rocm
python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_log.txt --rocm --legacy-device-grouping
```

```
python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_new_log.txt --rocm
```

**On CUDA:**

```
python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_log.txt --cuda
python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_log.txt --cuda --legacy-device-grouping
```

### To run the tool manually step by step:

**Use Parser to dump out the test commands:**

Once the log is being collected, use the parser to dump out all the rccl/nccl test commands or just the unique commands with their respective counts of the workload.
Note: To dump out the unique commands use the --unique argument.
Note: To dump out the commands for the applications with NCCL or RCCL 2.8 or below use --legacy-device-grouping argument.
Optional parameters: output-script-name, unique

Here is the usage of the script
Expand All @@ -65,6 +78,8 @@ Here is the usage of the script
python rccl_nccl_parser.py --nccl-debug-log nccl_debug_log.txt --output-script-name net
(or)
python rccl_nccl_parser.py --nccl-debug-log nccl_debug_log.txt --output-script-name net --unique
(or)
python rccl_nccl_parser.py --nccl-debug-log nccl_debug_log.txt --output-script-name net --unique --legacy-device-grouping"
```

The first command dumps out all the rccl/nccl tests in the order they get executed in the application. (net_rccl_nccl.sh file).
Expand All @@ -75,7 +90,7 @@ The second command dumps out a script file with unique commands and a csv file w
Once you dump out the scripts, make sure to copy the script in nccl-tests/rccl-tests folder and run the script and gather the
Inside nccl-tests/rccl-tests repository:

```sh net_unique.sh |& tee rccl_perf_data.txt```
```sh net_unique_topo.sh |& tee topo_rccl_tests.txt```

Once you run the above script, the performance data of each command is redirected to a text file.

Expand All @@ -86,7 +101,7 @@ Now the final step is to use the above performance log and generate a summary in
To generate the summary, navigate to the tool nccl-rccl-parser:

```
python generate_summary.py --log-file rccl_perf_data.txt --output-file-name test_app_data--script-file net_unique.sh
python generate_summary.py --log-file topo_rccl_tests.txt --output-file-name net_summary --count-file net_counts.csv
```
This dumps out a csv file with performance data for further analysis.

Expand Down
Binary file added coll_trace_processor/0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
39 changes: 39 additions & 0 deletions coll_trace_processor/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# NCCL/RCCL Log Processor

This tool is used to collect RCCL collective traces, visulize topologies for rings and trees in RCCL, and get device grouping information.

## Requirement
The tool currently works for applications with RCCL 2.9 or above. However, the collective trace processor function works for an application without multiple device groups in RCCL 2.8 or below.

From ROCm 4.3:
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL only enable collective API trace. Collective trace mode is enabled separately by RCCL_KERNEL_COLL_TRACE_ENABLE=1 which has the outputs in the new format as below:
```
[0] NCCL INFO ## [1703255.821541] [01:00] 000035 KL HWID 4230c540 AllReduceTreeLLSum_f32 nt 256 bi 0 nc 1 busId C3000
```
**Run application and collect RCCL/NCCL Log:**

```
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL,GRAPH RCCL_KERNEL_COLL_TRACE_ENABLE=1 <application> |& tee nccl_debug_log.txt
```

## Usage
For more information about RCCL collective traces, please go to [here](https://confluence.amd.com/display/MLSE/RCCL+Collective+Trace).

Example command lines:
```shell
python log_processor.py --rccl-debug-log gpt2_rccl_mp4_log_newPR.txt
```
Notice that since NCCL and RCCL 2.8 or below has no sufficient inforamtion for device grouping, "--cuda" flag needs to be specified and the number of devices used in the application is also required.
```shell
python log_processor.py --rccl-debug-log base_2.8.log --cuda --num_devices 8
```

## Example Output
If ROCm 2.8 or above is used, there will be multiple RCCL topology graphs, time tables for each RCCL operations and devices, bandwidth tables for each RCCL operations and devices, and a text file which contains device grouping information. </br>
For example, if there are 6 device groups in an application, there will be 12 (=6*2) output tables in csv files. The numbering of the tables is followed by the line number in device_groups.txt.

![image info](0.png)


## Copyright
All source code and accompanying documentation are copyright (c) 2019-2020 Advanced Micro Devices, Inc. All rights reserved.
Loading