diff --git a/README.md b/README.md index 8e3baa9..7a242bf 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# nccl-rccl-parser +# Topology-aware nccl-rccl-parser This tool is used for dumping out the rccl-tests/nccl-test commands directly from an application to identify any potential bottlenecks of scaling while using RCCL/NCCL modules when running a distributed applications. To get started please clone the following repository: @@ -11,8 +11,14 @@ To run the tests, we use the following repositories: # Pre-requisites: * RCCL/NCCL installed. -* rccl-tests or nccl-tests installed. - +* Clone this repo with + ``` + git clone --recursive https://github.com/lcskrishna/nccl-rccl-parser.git + ``` +* Run installation script by + ``` + sh install.sh + ``` # How to use the tool: ### Run application and collect RCCL/NCCL Log:** @@ -29,7 +35,6 @@ NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL |& tee nccl_debug_log. HSA_FORCE_FINE_GRAIN_PCIE=1 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL |& tee nccl_debug_log.txt ``` - ### Automated way: To gather the performance results once you have the debug log with you. Run the below command. @@ -38,25 +43,33 @@ On CUDA devices, use --cuda argument. On ROCm devices, use --rocm argument. +With NCCL or RCCL 2.8 or below, the argument "--legacy-device-grouping" is required for device grouping in applications. + Note: If you don't mention the arguments the automated script only dumps out the output data from the parser. **On ROCm:** ``` -python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_log.txt --rocm +python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_log.txt --rocm --legacy-device-grouping +``` + +``` +python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_new_log.txt --rocm ``` **On CUDA:** ``` -python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_log.txt --cuda +python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_log.txt --cuda --legacy-device-grouping ``` + ### To run the tool manually step by step: **Use Parser to dump out the test commands:** Once the log is being collected, use the parser to dump out all the rccl/nccl test commands or just the unique commands with their respective counts of the workload. Note: To dump out the unique commands use the --unique argument. +Note: To dump out the commands for the applications with NCCL or RCCL 2.8 or below use --legacy-device-grouping argument. Optional parameters: output-script-name, unique Here is the usage of the script @@ -65,6 +78,8 @@ Here is the usage of the script python rccl_nccl_parser.py --nccl-debug-log nccl_debug_log.txt --output-script-name net (or) python rccl_nccl_parser.py --nccl-debug-log nccl_debug_log.txt --output-script-name net --unique +(or) +python rccl_nccl_parser.py --nccl-debug-log nccl_debug_log.txt --output-script-name net --unique --legacy-device-grouping" ``` The first command dumps out all the rccl/nccl tests in the order they get executed in the application. (net_rccl_nccl.sh file). @@ -75,7 +90,7 @@ The second command dumps out a script file with unique commands and a csv file w Once you dump out the scripts, make sure to copy the script in nccl-tests/rccl-tests folder and run the script and gather the Inside nccl-tests/rccl-tests repository: -```sh net_unique.sh |& tee rccl_perf_data.txt``` +```sh net_unique_topo.sh |& tee topo_rccl_tests.txt``` Once you run the above script, the performance data of each command is redirected to a text file. @@ -86,7 +101,7 @@ Now the final step is to use the above performance log and generate a summary in To generate the summary, navigate to the tool nccl-rccl-parser: ``` -python generate_summary.py --log-file rccl_perf_data.txt --output-file-name test_app_data--script-file net_unique.sh +python generate_summary.py --log-file topo_rccl_tests.txt --output-file-name net_summary --count-file net_counts.csv ``` This dumps out a csv file with performance data for further analysis. diff --git a/coll_trace_processor/0.png b/coll_trace_processor/0.png new file mode 100644 index 0000000..408a293 Binary files /dev/null and b/coll_trace_processor/0.png differ diff --git a/coll_trace_processor/README.md b/coll_trace_processor/README.md new file mode 100644 index 0000000..8758b75 --- /dev/null +++ b/coll_trace_processor/README.md @@ -0,0 +1,39 @@ +# NCCL/RCCL Log Processor + +This tool is used to collect RCCL collective traces, visulize topologies for rings and trees in RCCL, and get device grouping information. + +## Requirement +The tool currently works for applications with RCCL 2.9 or above. However, the collective trace processor function works for an application without multiple device groups in RCCL 2.8 or below. + +From ROCm 4.3: +NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL only enable collective API trace. Collective trace mode is enabled separately by RCCL_KERNEL_COLL_TRACE_ENABLE=1 which has the outputs in the new format as below: +``` +[0] NCCL INFO ## [1703255.821541] [01:00] 000035 KL HWID 4230c540 AllReduceTreeLLSum_f32 nt 256 bi 0 nc 1 busId C3000 +``` +**Run application and collect RCCL/NCCL Log:** + +``` +NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL,GRAPH RCCL_KERNEL_COLL_TRACE_ENABLE=1 |& tee nccl_debug_log.txt +``` + +## Usage +For more information about RCCL collective traces, please go to [here](https://confluence.amd.com/display/MLSE/RCCL+Collective+Trace). + +Example command lines: +```shell +python log_processor.py --rccl-debug-log gpt2_rccl_mp4_log_newPR.txt +``` +Notice that since NCCL and RCCL 2.8 or below has no sufficient inforamtion for device grouping, "--cuda" flag needs to be specified and the number of devices used in the application is also required. +```shell +python log_processor.py --rccl-debug-log base_2.8.log --cuda --num_devices 8 +``` + +## Example Output +If ROCm 2.8 or above is used, there will be multiple RCCL topology graphs, time tables for each RCCL operations and devices, bandwidth tables for each RCCL operations and devices, and a text file which contains device grouping information.
+For example, if there are 6 device groups in an application, there will be 12 (=6*2) output tables in csv files. The numbering of the tables is followed by the line number in device_groups.txt. + +![image info](0.png) + + +## Copyright +All source code and accompanying documentation are copyright (c) 2019-2020 Advanced Micro Devices, Inc. All rights reserved. diff --git a/coll_trace_processor/extract_topo.awk b/coll_trace_processor/extract_topo.awk new file mode 100644 index 0000000..e91a232 --- /dev/null +++ b/coll_trace_processor/extract_topo.awk @@ -0,0 +1,291 @@ +#!/usr/bin/gawk -f +# Copyright (c) 2019-2021 Advanced Micro Devices, Inc. All rights reserved. +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in +# all copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +# THE SOFTWARE.# Usage: + +BEGIN { + max_rank=0 + rings[""]=0 + max_ring=0 + treedns[""]=0 + max_treedn=0 + conn[""]=0 + has_collnet=0 + max_collnet=0 + max_collnet_rank=0 + collnet[""]=0 + collnet_conn[""]=0 + collnet_conn_type[""]=0 + col_start=2 + col_p1=col_start+1 + col_p2=col_start+2 + col_p3=col_start+3 + col_p4=col_start+4 + col_p5=col_start+5 + col_p6=col_start+6 + col_p7=col_start+7 + col_p8=col_start+8 +} + +{ + if($3=="NCCL" && $4=="INFO" && col_start==2) { + col_start=5 + col_p1=col_start+1 + col_p2=col_start+2 + col_p3=col_start+3 + col_p4=col_start+4 + col_p5=col_start+5 + col_p6=col_start+6 + col_p7=col_start+7 + col_p8=col_start+8 + } + + if($5=="NCCL" && $6=="INFO" && col_start==2) { + col_start=7 + col_p1=col_start+1 + col_p2=col_start+2 + col_p3=col_start+3 + col_p4=col_start+4 + col_p5=col_start+5 + col_p6=col_start+6 + col_p7=col_start+7 + col_p8=col_start+8 + } + + if($col_start=="Ring" && $col_p4=="->" && $col_p6=="->") { + chan=strtonum($col_p1) + rank=strtonum($col_p5) + next_rank=strtonum($col_p7) + rings[rank "," next_rank "," chan]="1" + if(chan>max_ring) + max_ring=chan + if(rank>max_rank) + max_rank=rank + } + + if($col_start=="Trees") { + col_1=col_start+1 + col_2=col_start+2 + do { + match($col_1, /\[([0-9]+)\]/, ary) + chan=strtonum(ary[1]) + where = match($col_2, /(\-?[0-9]+)\/(\-?[0-9]+)\/(\-?[0-9]+)\->(\-?[0-9]+)\->(\-?[0-9]+)\|(\-?[0-9]+)\->(\-?[0-9]+)\->(\-?[0-9]+)\/(\-?[0-9]+)\/(\-?[0-9]+)/, ary) + if(where != 0) { + if(ary[8]!="-1") + treedns[ary[7] "," ary[8] "," chan]="1" + if(ary[9]!="-1") + treedns[ary[7] "," ary[9] "," chan]="1" + if(ary[10]!="-1") + treedns[ary[7] "," ary[10] "," chan]="1" + } else { + where = match($col_2, /(\-?[0-9]+)\/(\-?[0-9]+)\/(\-?[0-9]+)\->(\-?[0-9]+)\->(\-?[0-9]+)/, ary) + if(where != 0) { + if(ary[1]!="-1") + treedns[ary[4] "," ary[1] "," chan]="1" + if(ary[2]!="-1") + treedns[ary[4] "," ary[2] "," chan]="1" + if(ary[3]!="-1") + treedns[ary[4] "," ary[3] "," chan]="1" + } + } + if(chan>max_treedn) + max_treedn=chan + col_1=col_1+2 + col_2=col_2+2 + } while ($col_1!="") + } + + if($col_start=="CollNet" && $col_p1=="Channel") { + chan=strtonum($col_p2) + rank=strtonum($col_p4) + up_rank=strtonum($col_p6) + collnet[up_rank "," rank "," chan]="1" + if(has_collnet==0) + has_collnet=1 + if(chan>max_collnet) + max_collnet=chan + if(up_rank>max_collnet_rank) + max_collnet_rank=up_rank + } + + if($col_start=="Coll" && $col_p2==":") { + chan=strtonum($col_p1) + rank=strtonum($col_p3) + if($col_p4=="[receive]") + collnet_conn[rank "," chan]=0 + else if($col_p4=="[send]") + collnet_conn[rank "," chan]=1 + else + printf "Error!\n" + collnet_conn_type[rank "," chan]=$col_p6 + } + + if($col_p6=="via") { + match($col_p1, /([0-9]+)/, ary) + chan=strtonum(ary[1]) + match($col_p3, /([0-9]+)\[.*\]/, ary) + s=ary[1] + match($col_p5, /([0-9]+)\[.*\]/, ary) + d=ary[1] + conn[s "," d "," chan]=$col_p7 + } + + if($col_p6=="[receive]" && $col_p7=="via") { + match($col_p1, /([0-9]+)/, ary) + chan=strtonum(ary[1]) + match($col_p3, /([0-9]+)\[.*\]/, ary) + s=ary[1] + match($col_p5, /([0-9]+)\[.*\]/, ary) + d=ary[1] + conn[s "," d "," chan]=$col_p8 + } +} + +END { + printf "digraph RCCL {\n" + for(r=0;r t%d_%d [label=\"%s\",color=\"%s\",style=\"%s\",fontname=\"Helvetica\"];\n", r, s, r, d, val, color, style + } + } + } + printf "\n" + for(s=0;s<=max_rank;s++) { + printf " t%d_%d [label=\"%d\",fontsize=\"28\"];\n", r, s, s + } + printf " }\n\n" + } + + for(r=0;r r%d_%d [label=\"%s\",color=\"%s\",style=\"%s\",fontname=\"Helvetica\"];\n", r, s, r, d, val, color, style + } + } + } + printf "\n" + for(s=0;s<=max_rank;s++) { + printf " r%d_%d [label=\"%d\",fontsize=\"28\"];\n", r, s, s + } + printf " }\n\n" + } + + for(r=0; has_collnet && r<=max_collnet; r++) { + printf " subgraph collnet_%d {\n", r + num_top_ranks=0 + for(s=0;s c%d_%d [dir=back label=\"%s\",color=\"%s\",style=\"%s\",fontname=\"Helvetica\"];\n", r, max_collnet_rank, r, rank, val, color, style + else + printf " c%d_%d -> c%d_%d [label=\"%s\",color=\"%s\",style=\"%s\",fontname=\"Helvetica\"];\n", r, max_collnet_rank, r, rank, val, color, style + while(1) { + for(s=0;s c%d_%d [dir=back label=\"%s\",color=\"%s\",style=\"%s\",fontname=\"Helvetica\"];\n", r, rank, r, s, val, color, style + else + printf " c%d_%d -> c%d_%d [label=\"%s\",color=\"%s\",style=\"%s\",fontname=\"Helvetica\"];\n", r, rank, r, s, val, color, style + rank=s + break; + } + } + if(s>=max_collnet_rank) { + break; + } + } + } + printf "\n" + for(s=0;s<=max_collnet_rank;s++) { + printf " c%d_%d [label=\"%d\",fontsize=\"28\"];\n", r, s, s + } + printf " }\n\n" + } + printf "}\n" +} \ No newline at end of file diff --git a/coll_trace_processor/log_processor.py b/coll_trace_processor/log_processor.py new file mode 100644 index 0000000..2f4d7ac --- /dev/null +++ b/coll_trace_processor/log_processor.py @@ -0,0 +1,480 @@ +import os +import sys +import subprocess +import argparse +import pkg_resources +import re +import time + +coll_op_map = { + "Broadcast": "broadcast_perf", + "Reduce": "reduce_perf", + "AllGather": "all_gather_perf", + "ReduceScatter": "reduce_scatter_perf", + "AllReduce": "all_reduce_perf", + "Gather": "gather_perf", + "Scatter": "scatter_perf", + "AllToAll": "alltoall_perf", +# "AllToAllv": "alltoallv_perf", + "Send": "sendrecv_perf", + "Recv": "sendrecv_perf", + } + +reduction_op_map = { + "0" : "sum", + "1" : "prod", + "2" : "max", + "3" : "min", + "4" : "all", + } + +data_types_map = { + "0" : "int8", + "1" : "uint8", + "2" : "int32", + "3" : "uint32", + "4" : "int64", + "5" : "uint64", + "6" : "half", + "7" : "float", + "8" : "double", + "9" : "bf16", + #"10" : "ncclNumTypes Equivalent?" + } + +data_type_bytes_map = { + "0" : 1, + "1" : 1, + "2" : 4, + "3" : 4, + "4" : 8, + "5" : 8, + "6" : 2, + "7" : 4, + "8" : 8, + "9" : 2, + #"10" : Not sure. + } + +# n: number of ranks +# n links of Bandwidth B to perform a operation +# def factor_1(n): +# return n +def all_gather_factor(n): + return (n-1)/n +def reduce_scatter_factor(n): + return (n-1)/n +def all_reduce_factor(n): + return 2*(n-1)/n +def all_to_all_factor(n): + return 2*(n-1)/n + +def coll_op_algobw_factor(coll_type, nranks): + nranks = float(nranks) + if coll_type == "AllGather": + return float(all_gather_factor(nranks)) + elif coll_type == "ReduceScatter": + return float(reduce_scatter_factor(nranks)) + elif coll_type == "AllReduce": + return float(all_reduce_factor(nranks)) + elif coll_type == "AllToAll": + return float(all_to_all_factor(nranks)) + else: + return float(1) + +def algobw_factor_times_size(coll_type, nranks, total_bytes): + nranks = float(nranks) + if coll_type == "AllGather": + return all_gather_factor(nranks) * float(total_bytes) + elif coll_type == "ReduceScatter": + return reduce_scatter_factor(nranks) * float(total_bytes) + elif coll_type == "AllReduce": + return all_reduce_factor(nranks) * float(total_bytes) + elif coll_type == "AllToAll": + return all_to_all_factor(nranks) * float(total_bytes) + else: + return float(1) * float(total_bytes) + +class DisjointSet(object): # https://stackoverflow.com/questions/3067529/a-set-union-find-algorithm + def __init__(self): + self.leader = {} # maps a member to the group's leader + self.group = {} # maps a group leader to the group (which is a set) + + def add(self, a, b): + leadera = self.leader.get(a) + leaderb = self.leader.get(b) + if leadera is not None: + if leaderb is not None: + if leadera == leaderb: return # nothing to do + groupa = self.group[leadera] + groupb = self.group[leaderb] + if len(groupa) < len(groupb): + a, leadera, groupa, b, leaderb, groupb = b, leaderb, groupb, a, leadera, groupa + groupa |= groupb + del self.group[leaderb] + for k in groupb: + self.leader[k] = leadera + else: + self.group[leadera].add(b) + self.leader[b] = leadera + else: + if leaderb is not None: + self.group[leaderb].add(a) + self.leader[a] = leaderb + else: + self.leader[a] = self.leader[b] = a + self.group[a] = set([a, b]) + + +def get_useful_info(log_file): + fs = open(log_file, 'r') + lines = fs.readlines() + fs.close() + + coll_lines, conn_lines, comm_lines, ring_lines, tree_lines, coll_trace_lines = [], [], [], [], [], [] + for j in range(len(lines)): + line = lines[j].rstrip() + if ("opCount" in line and "sendbuff" in line): + coll_lines.append(line) + elif ("Channel" in line and "via" in line): + conn_lines.append(line) + elif ("Init COMPLETE" in line and "busId" in line): + comm_lines.append(line) + elif ("NCCL INFO Ring" in line): + ring_lines.append(line) + elif ("NCCL INFO Trees" in line): + tree_lines.append(line) + elif ((" ## " in line) and ("KL HWID" in line or "KE" in line or "CE" in line)): # RCCL From ROCm 4.3 + # Bug: [ 6628.064978] we need to consider the case when there is a spac right after '[' + # Everything with split_list[index] will break. + if "[ " in line: + line = line.replace("[ ", "[") + coll_trace_lines.append(line) + + return coll_lines, conn_lines, comm_lines, ring_lines, tree_lines, coll_trace_lines + +def coll_table_build(coll_lines): + opCount, coll, count, datatype, op_type, root, comm, nranks, data_size = [], [], [], [], [], [], [], [], [] + for line in coll_lines: + split_list = line.split(" ") + coll.append(split_list[4][:-1]) + opCount.append(int(split_list[6], 16)) + count.append(split_list[12]) + datatype.append(split_list[14]) + op_type.append(split_list[16]) + root.append(split_list[18]) + comm.append(split_list[20]) + nranks.append(int(next(item for item in split_list if 'nranks' in item).split("=")[1].replace("]", ""))) ### + data_size.append(int(split_list[split_list.index("count") + 1]) * data_type_bytes_map[split_list[split_list.index("datatype") + 1]]) + + dict_coll = {'coll': coll, 'opCount': opCount, 'datatype': datatype, 'count':count, 'op_type':op_type, 'root':root, 'comm':comm, 'nranks':nranks, 'data_size':data_size} + table = pd.DataFrame(dict_coll) + table['algobw_factor_times_size'] = table.apply(lambda row: + algobw_factor_times_size(row['coll'], row['nranks'], row['data_size']), axis=1) + table = table[['opCount','nranks','algobw_factor_times_size', 'data_size']].drop_duplicates(subset = ['opCount', 'data_size']) + return table + +## It works for connection information like XXX via "P2P/direct pointer%s".(useReadStr), --> pointer%s will not be collected +## "P2P/IPC%s".(useReadStr), +## "P2P/indirect/%d[%lx]%s".(intermediateRank,comm->peerInfo[intermediateRank].busId, useReadStr) +## "direct shared memory" --> shared memory +## However, it will not capture the information for CollNet for now. + +## xxx [4] NCCL INFO Channel 00 : 0[e3000] -> 1[c3000] via P2P/IPC comm 0x7f53bc000e50 nRanks 04', #(new output) + +def conn_table_build(conn_lines): # Only works for RCCL 2.9 or above + def process_string(line): + split_list = line.split("[") + return [split_list[0], split_list[1].split("]")[0]] + + start_rank, start_busid, end_rank, end_busid, connection, comm, nranks = [], [], [], [], [], [], [] + for line in conn_lines: + split_list = line.split(" ") + sr, sb = process_string(split_list[split_list.index(":") + 1]) # first device + er, eb = process_string(split_list[split_list.index("->") + 1]) # second device + start_rank.append(sr) + start_busid.append(sb) + end_rank.append(er) + end_busid.append(eb) + connection.append(split_list[split_list.index("via") + 1]) # if it is direct, it means the connection is done by direct shared memory + comm.append(split_list[split_list.index("comm") + 1]) + nranks.append(int(split_list[split_list.index("nRanks") + 1])) + + dict_conn = {'start_rank': start_rank, 'start_busid': start_busid, 'end_rank': end_rank, 'end_busid': end_busid, + 'connection': connection, 'comm':comm, 'nranks':nranks, 'conn_line':conn_lines} + return pd.DataFrame(dict_conn) + + +def comm_table_build(comm_lines): + comm, rank, nranks, cudaDev, busId = [], [], [], [], [] + for line in comm_lines: + split_list = line.rstrip().split(" ") + comm.append(split_list[5]) + rank.append(split_list[7]) + nranks.append(int(split_list[9])) + cudaDev.append(split_list[11]) + busId.append(split_list[13]) + dict_comm = {'comm':comm, 'rank':rank, 'nranks':nranks, 'cudaDev':cudaDev, 'busId':busId} + return pd.DataFrame(dict_comm) + + +def topo_table_build(topo_lines): + comm, nranks, busId, topo_line = [], [], [], [] + for line in topo_lines: + split_list = line.split(" ") + comm.append(split_list[split_list.index("comm") + 1]) + nranks.append(int(split_list[split_list.index("nRanks") + 1])) + busId.append(split_list[split_list.index("busId") + 1]) + topo_line.append(line) + dict_table = {'comm':comm, 'nranks':nranks, 'busId':busId, 'topo_line':topo_line} + return pd.DataFrame(dict_table) + +def create_table(log_name): + log_file = os.path.abspath(log_name) + coll_lines, conn_lines, comm_lines, ring_lines, tree_lines, coll_trace_lines = get_useful_info(log_file) + return coll_table_build(coll_lines), conn_table_build(conn_lines), comm_table_build(comm_lines), topo_table_build(ring_lines), topo_table_build(tree_lines), coll_trace_lines + + +def device_grouping(comm_table, conn_table): + groups = [] + for index, row in comm_table.iterrows(): + temp = [row['busId'], list(conn_table[(conn_table['comm'] == row['comm']) & (conn_table['start_busid'] == row['busId'])]['end_busid'].unique())] + groups.append(temp) + nranks = list(comm_table['nranks']) + outputs = [] + rank_outputs = [] + tempRank = None + for id, group in enumerate(groups): + if tempRank == None: + tempRank = nranks[id] + ds = DisjointSet() + else: + if tempRank != nranks[id]: + for _, v in ds.group.items(): + if v not in outputs: + outputs.append(v) + ds = DisjointSet() + tempRank = nranks[id] + for node in group[1]: + ds.add(group[0], node) + + if id == len(groups) - 1: + for _, v in ds.group.items(): + if v not in outputs: + outputs.append(v) + return outputs + +def collect_topo(comm, nranks, ring_table, tree_table, conn_table, lines): + ring_lines = ring_table[(ring_table['comm'] == comm) & (ring_table['nranks'] == nranks)]['topo_line'].tolist() + tree_lines = tree_table[(tree_table['comm'] == comm) & (tree_table['nranks'] == nranks)]['topo_line'].tolist() + conn_lines = conn_table[(conn_table['comm'] == comm) & (conn_table['nranks'] == nranks)]['conn_line'].tolist() + for line in (ring_lines + tree_lines + conn_lines): + lines.append(line) + return lines + +def modify_label(in_txt, out_txt, rank_busId_mapping): + fs = open(in_txt, 'r') + lines = fs.readlines() + fs.close() + with open(out_txt, 'w') as f: + for line in lines: + if "fontsize" in line: + for m in re.finditer('"([^"]*)"', line): + rank_quot = line[m.start(0):m.end(0)] + rank = line[m.start(0)+1:m.end(0)-1] # find the rank with no quotation + line = line.replace(rank_quot,'\"'+rank_busId_mapping[rank]+'\"', 1) + f.write("{}".format(line)) + break + else: + f.write("{}".format(line)) + + +def coll_trace_processor(coll_trace_lines, group_list, coll_table): + print(group_list) + def process_string(line): + split_list = line.split("[") + return split_list[1].split("]")[0] + # First pass + results, func_name = {}, {} + busIds = [] + RCCL_2_8 = False + for line in coll_trace_lines: + if ((" ## " in line) and ("KL HWID" in line or "KE" in line or "CE" in line)): # RCCL From ROCm 4.3 + split_list = line.split(" ") + t = float(process_string(split_list[5])) # seconds.microseconds + rk, blk = process_string(split_list[6]).split(":") # rank:block_id + op = int(split_list[7], 16) # hex to decimal + ####### Kernel Launch ####### + if split_list[8] == 'KL': + if "busId" in split_list: + busId = split_list[split_list.index("busId") + 1] + nranks = split_list[split_list.index("nRanks") + 1] + KL_key = str(op) + "," + str(busId) + "," + str(nranks) + ",t0" + else: + # RCCL 2.8 or older + busId = int(rk) + RCCL_2_8 = True + KL_key = str(op) + "," + str(busId) + ",t0" + if KL_key not in results or results[KL_key] > t: + results[KL_key] = t + if busId not in busIds: + busIds.append(busId) + if op not in func_name: + func_name[op] = split_list[11] # only work for KL + + ####### Kernel End ####### + elif split_list[8] == 'KE': + if "busId" in split_list: + busId = split_list[split_list.index("busId") + 1] + nranks = split_list[split_list.index("nRanks") + 1] + KE_key = str(op) + "," + str(busId) + "," + str(nranks) + ",t1" + else: + # RCCL 2.8 or below + busId = int(rk) + RCCL_2_8 = True + KE_key = str(op) + "," + str(busId) + ",t1" + if KE_key not in results or results[KE_key] < t: + results[KE_key] = t + + + elif split_list[8] == 'CE': + if "busId" in split_list: + busId = split_list[split_list.index("busId") + 1] + nranks = split_list[split_list.index("nRanks") + 1] + CE_key = str(op) + "," + str(busId) + "," + str(nranks) + ",t1" + else: + # RCCL 2.8 or older + busId = int(rk) + RCCL_2_8 = True + CE_key = str(op) + "," + str(busId) + ",t1" + + if CE_key not in results or results[CE_key] < t: + results[CE_key] = t + + elif ((" ## " in line) and ("Abort" in line)): + raise AssertionError("Abort") + + + # Second pass + bw_tables, time_tables = [], [] + for i, group in enumerate(group_list): + group = list(group) + bw_list, time_list = [], [] + for op in func_name: + found = False + time_start, time_end = 1e9, 0 + if RCCL_2_8: + for busId in range(len(group)): + key = str(op) + "," + str(busId) + if (key + ",t0") in results: + found = True + if busId == 0: + temp = [op, func_name[op]] # opCount, function name + t_start = results[key + ",t0"] + t_end = results[key + ",t1"] + temp.append(t_end - t_start) + if t_start < time_start: + time_start = t_start + if t_end > time_end: + time_end = t_end + else: + for busId in group: + key = str(op) + "," + str(busId) + "," + str(len(group)) + if (key + ",t0") in results: + found = True + if busId == group[0]: + temp = [op, func_name[op]] # opCount, function name + t_start = results[key + ",t0"] + t_end = results[key + ",t1"] + temp.append(t_end - t_start) + if t_start < time_start: + time_start = t_start + if t_end > time_end: + time_end = t_end + if found: + algobw_factor_times_size = coll_table[(coll_table['opCount'] == op) & (coll_table['nranks'] == len(group))]['algobw_factor_times_size'].unique()[0] + data_size = coll_table[(coll_table['opCount'] == op) & (coll_table['nranks'] == len(group))]['data_size'].unique()[0] + bw = temp[:2] + for k in range(len(group)): + bw.append(algobw_factor_times_size / temp[k + 2] /1e9) + + bw = bw + [data_size, algobw_factor_times_size/(time_end - time_start)/1e9] + temp.append(data_size) + time_list.append(temp) + bw_list.append(bw) + + time_tables.append(pd.DataFrame(time_list, columns = ['opCount', 'Function Name'] + group + ['data_size'])) + bw_tables.append(pd.DataFrame(bw_list, columns = ['opCount', 'Function Name'] + group + ['data_size', 'algBW'])) + return time_tables, bw_tables + +def rccl_log_process(): + debug_log = os.path.abspath(args.rccl_debug_log) + if args.cuda: + coll_lines, conn_lines, comm_lines, ring_lines, tree_lines, coll_trace_lines = get_useful_info(debug_log) + coll_table = coll_table_build(coll_lines) + device_group_list = [[i for i in range(args.num_devices)]] + print("Using CUDA or ROCm with RCCL 2.8 or below.") + else: + device_grouping_output = os.path.join(os.path.dirname(os.path.realpath(__file__)), "device_groups.txt") + coll_table, conn_table, comm_table, ring_table, tree_table, coll_trace_lines = create_table(debug_log) + device_group_list = device_grouping(comm_table, conn_table) + with open(device_grouping_output, 'w') as f: + for mySet in device_group_list: + f.write("%s\n" % str(mySet)) + + for i, group in enumerate(device_group_list): + group = list(group) + print(group) + temp_txt = os.path.join(os.path.dirname(os.path.realpath(__file__)), "temp_{}.txt".format(i)) + temp1_txt = os.path.join(os.path.dirname(os.path.realpath(__file__)), "temp_1_{}.txt".format(i)) + temp2_txt = os.path.join(os.path.dirname(os.path.realpath(__file__)), "temp_2_{}.txt".format(i)) + nranks = len(group) + lines = [] + rank_busId_mapping = {} + for device in group: + comms = comm_table[(comm_table['nranks'] == nranks) & (comm_table['busId'] == device)]['comm'].unique() + assert len(comms) == 1 # if not 1, it means different devices may have same nRanks and busId. + lines = collect_topo(comms[0], nranks, ring_table, tree_table, conn_table, lines) + rank = comm_table[(comm_table['nranks'] == nranks) & (comm_table['busId'] == device)]['rank'].unique()[0] + busId = comm_table[(comm_table['nranks'] == nranks) & (comm_table['busId'] == device)]['busId'].unique()[0] + rank_busId_mapping[rank] = busId + + with open(temp_txt, 'w') as f: + for line in lines: + f.write("{}\n".format(line)) + + p1 = subprocess.Popen("awk -f extract_topo.awk {} > {}".format(temp_txt, temp1_txt), + stdin=subprocess.PIPE, stdout=subprocess.PIPE, shell=True) + time.sleep(1) + modify_label(temp1_txt, temp2_txt, rank_busId_mapping) + p2 = subprocess.Popen("dot -Tpng {} -o '{}.png'".format(temp2_txt, i), stdin=subprocess.PIPE, stdout=subprocess.PIPE, shell=True) + p3 = subprocess.Popen("rm -r temp*.txt", stdin=subprocess.PIPE, stdout=subprocess.PIPE, shell=True) + + ########### Collective trace processor ########### + time_tables, bw_tables = coll_trace_processor(coll_trace_lines, device_group_list, coll_table) + for i, (time_table, bw_table) in enumerate(zip(time_tables, bw_tables)): + time_csv = os.path.join(os.path.dirname(os.path.realpath(__file__)), "time_{}.csv".format(i)) + bw_csv = os.path.join(os.path.dirname(os.path.realpath(__file__)), "bw_{}.csv".format(i)) + time_table.to_csv(time_csv) + bw_table.to_csv(bw_csv) + + + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument("--rccl-debug-log", type=str, required=True, \ + help="RCCL log after running app with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL RCCL_KERNEL_COLL_TRACE_ENABLE=1 executable") + + parser.add_argument("--cuda", action="store_true", default=False, help="If the application is using CUDA systems or ROCm systems with RCCL 2.8 or below, the topology visualizer will not be enabled.") # + parser.add_argument('--num_devices', type=int, default=1, help="If the application is using CUDA systems or ROCm systems with RCCL 2.8 or below, the number of devices needs to be specified.") # + args = parser.parse_args() + #### Requirement check #### + required = {'pandas'} + installed = {pkg.key for pkg in pkg_resources.working_set} + missing = required - installed + if missing: + python = sys.executable + subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL) + import pandas as pd + rccl_log_process() + diff --git a/deviceIdMapping/Makefile b/deviceIdMapping/Makefile new file mode 100644 index 0000000..48b749d --- /dev/null +++ b/deviceIdMapping/Makefile @@ -0,0 +1,28 @@ +HIP_PATH?= $(wildcard /opt/rocm/hip) + +ifeq (,$(HIP_PATH)) +$(error Please set environment variable HIP_PATH manually.) +endif + +ROCM_PATH?= $(wildcard /opt/rocm) +ifeq (,$(HIP_PATH)) +$(error Please set environment variable ROCM_PATH manually.) +endif + +HIPCC=$(HIP_PATH)/bin/hipcc + + +EXE=hip_rocm_smi_mapping + +files=$(EXE).cpp + +CXXFLAGS=-g -O3 -I$(ROCM_PATH)/rocm_smi/include -lrocm_smi64 -L$(ROCM_PATH)/rocm_smi/lib + +all: $(EXE) + +$(EXE): $(files) + $(HIPCC) $(CXXFLAGS) $^ -o $@ + + +clean: + rm -f *.o $(EXE) \ No newline at end of file diff --git a/deviceIdMapping/README.md b/deviceIdMapping/README.md new file mode 100644 index 0000000..93de978 --- /dev/null +++ b/deviceIdMapping/README.md @@ -0,0 +1,66 @@ +# Device ID Mapping + +The device indices on HIP, ROCM-SMI, and RCCL (cudaDev) are not necessarily matching. The only unique information for devices is PCI bus ID. Therefore, this sub-project is meant to create a debugging tool across different software stacks. + + + +**On ROCM-SMI (--showbus):** +The following outputs are from this tool on a server with 8 MI60 GPUs. +``` +======================= ROCm System Management Interface ======================= +================================== PCI Bus ID ================================== +GPU[0] : PCI Bus: 0000:43:00.0 +GPU[1] : PCI Bus: 0000:23:00.0 +GPU[2] : PCI Bus: 0000:26:00.0 +GPU[3] : PCI Bus: 0000:03:00.0 +GPU[4] : PCI Bus: 0000:E3:00.0 +GPU[5] : PCI Bus: 0000:C3:00.0 +GPU[6] : PCI Bus: 0000:C6:00.0 +GPU[7] : PCI Bus: 0000:83:00.0 +================================================================================ +============================= End of ROCm SMI Log ============================== +``` + + +**Device ID Mapping** + +The following outputs are from this tool on a server with 8 MI60 GPUs. We can use PCI bus id to get device id on HIP by using [hipDeviceGetByPCIBusId](https://rocmdocs.amd.com/en/latest/ROCm_API_References/HIP_API/Initialization-and-Version.html?highlight=hipDeviceGetByPCIBusId#hipdevicegetbypcibusid). + +``` +====================== Number of HIP visible devices: 8 +====================== Number of GPUs on your machine which can be observed by ROCm-SMI: 8 +====== ROCm-SMI device ID ======= PCI bus ID ======= HIP device ID ====== + 0 0000:43:00.0 0 + 1 0000:23:00.0 N/A (cannot map PCI Bus ID: 0000:23:00.0 to a HIP visible device) + 2 0000:26:00.0 N/A (cannot map PCI Bus ID: 0000:26:00.0 to a HIP visible device) + 3 0000:03:00.0 1 + 4 0000:e3:00.0 N/A (cannot map PCI Bus ID: 0000:e3:00.0 to a HIP visible device) + 5 0000:c3:00.0 N/A (cannot map PCI Bus ID: 0000:c3:00.0 to a HIP visible device) + 6 0000:c6:00.0 N/A (cannot map PCI Bus ID: 0000:c6:00.0 to a HIP visible device) + 7 0000:83:00.0 2 + +``` + +As the results shown above, we confirmed that the device (enumeration) indices on HIP and ROCM-SMI are identical when there is no HIP_VISIBLE_DEVICES used. + + +When [HIP_VISIBLE_DEVICES](https://rocmdocs.amd.com/en/latest/Other_Solutions/Other-Solutions.html?highlight=HIP_VISIBLE_DEVICES#hip-environment-variables) or [ROCR_VISIBLE_DEVICES](https://rocmdocs.amd.com/en/latest/ROCm_System_Managment/ROCm-System-Managment.html?highlight=ROCR_VISIBLE_DEVICES#rocr-visible-devices) is used, the indices will not match for HIP and ROCM-SMI. The following outputs are from +``` +HIP_VISIBLE_DEVICES=0,3,7 ./hip_rocm_smi_mapping +or +ROCR_VISIBLE_DEVICES=0,3,7 ./hip_rocm_smi_mapping +``` +``` +====================== Number of HIP visible devices: 3 +====================== Number of GPUs on your machine which can be observed by ROCm-SMI: 8 +=================== ROCm-SMI device ID =================== PCI bus ID =================== HIP device ID =================== +0 ---> 0000:43:00.0 ---> 0 +1 ---> 0000:23:00.0 ---> N/A (cannot map PCI Bus ID: 0000:23:00.0 to a HIP visible device) +2 ---> 0000:26:00.0 ---> N/A (cannot map PCI Bus ID: 0000:26:00.0 to a HIP visible device) +3 ---> 0000:03:00.0 ---> 1 +4 ---> 0000:e3:00.0 ---> N/A (cannot map PCI Bus ID: 0000:e3:00.0 to a HIP visible device) +5 ---> 0000:c3:00.0 ---> N/A (cannot map PCI Bus ID: 0000:c3:00.0 to a HIP visible device) +6 ---> 0000:c6:00.0 ---> N/A (cannot map PCI Bus ID: 0000:c6:00.0 to a HIP visible device) +7 ---> 0000:83:00.0 ---> 2 + +``` \ No newline at end of file diff --git a/deviceIdMapping/hip_rocm_smi_mapping.cpp b/deviceIdMapping/hip_rocm_smi_mapping.cpp new file mode 100644 index 0000000..3033cfa --- /dev/null +++ b/deviceIdMapping/hip_rocm_smi_mapping.cpp @@ -0,0 +1,110 @@ +/* +Copyright (c) 2015-present Advanced Micro Devices, Inc. All rights reserved. + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE. +*/ + +#include +#include +#include "hip/hip_runtime.h" +#include +#include +#include +#include +#include +#include +#include +#include "rocm_smi/rocm_smi.h" + +#define KNRM "\x1B[0m" +#define KRED "\x1B[31m" +#define KGRN "\x1B[32m" +#define KYEL "\x1B[33m" +#define KBLU "\x1B[34m" +#define KMAG "\x1B[35m" +#define KCYN "\x1B[36m" +#define KWHT "\x1B[37m" + +#define failed(...) \ + printf("%serror: ", KRED); \ + printf(__VA_ARGS__); \ + printf("\n"); \ + printf("error: TEST FAILED\n%s", KNRM); \ + exit(EXIT_FAILURE); + +#define HIPCHECK(error) \ + if (error != hipSuccess) { \ + printf("%serror: '%s'(%d) at %s:%d%s\n", KRED, hipGetErrorString(error), error, __FILE__, \ + __LINE__, KNRM); \ + failed("API returned error code."); \ + } + +std::string int2hex(uint64_t num, int digits) { + std::stringstream sstream; + sstream << std::setfill('0') << std::setw(digits) << std::hex << num; + std::string result = sstream.str(); + return result; +} + + +int main(int argc, char* argv[]) { + int deviceCnt; + HIPCHECK(hipGetDeviceCount(&deviceCnt)); + std::cout << "====================== Number of HIP visible devices: " << deviceCnt << std::endl; + /* Example outputs from rocm-smi --showbus + + GPU[0] : PCI Bus: 0000:43:00.0 + GPU[1] : PCI Bus: 0000:23:00.0 + GPU[2] : PCI Bus: 0000:26:00.0 + GPU[3] : PCI Bus: 0000:03:00.0 + GPU[4] : PCI Bus: 0000:E3:00.0 + GPU[5] : PCI Bus: 0000:C3:00.0 + GPU[6] : PCI Bus: 0000:C6:00.0 + GPU[7] : PCI Bus: 0000:83:00.0 + */ + + // ROCm-smi + rsmi_status_t ret; + uint32_t num_devices; + ret = rsmi_init(0); + ret = rsmi_num_monitor_devices(&num_devices); + std::cout << "====================== Number of GPUs on your machine which can be observed by ROCm-SMI: "<< num_devices << std::endl; + std::cout << "====== ROCm-SMI device ID ======= PCI bus ID ======= HIP device ID ======" << std::endl; + for (int i = 0; i < num_devices; i++) { + uint64_t val_ui64; // bdfid in rocm_smi.cc + rsmi_status_t err = rsmi_dev_pci_id_get(i, &val_ui64); + //std::cout << "\t**PCI ID (BDFID): 0x" << std::hex << val_ui64 << std::endl; + auto domain = (val_ui64 >> 32) & 0xffff; + auto bus = (val_ui64 >> 8) & 0xff; + auto device = (val_ui64 >> 3) & 0x1f; + auto function = val_ui64 & 0x7; + std::string pciString = int2hex(domain, 4) + ":" + int2hex(bus, 2) + ":" + int2hex(device, 2) + "." + int2hex(function, 1); + const char* busIdStr = (pciString).c_str(); + int hipDeviceId; + std::cout << " " << i << " "<< pciString << " "; + if (hipDeviceGetByPCIBusId(&hipDeviceId, busIdStr) != hipSuccess) { + std::cout << "N/A (cannot map PCI Bus ID: " << busIdStr << " to a HIP visible device)" << std::endl; + } else std::cout << hipDeviceId << std::endl; + + // pic_id = '{:04X}:{:02X}:{:02X}.{:0X}'.format(domain, bus, device, function) + // + //std::cout << domain << "; "<< bus << "; "<< device << "; "<< function << std::endl; + } + ret = rsmi_shut_down(); +} \ No newline at end of file diff --git a/generate_summary.py b/generate_summary.py index 6412527..272d059 100644 --- a/generate_summary.py +++ b/generate_summary.py @@ -1,124 +1,75 @@ import os import sys import argparse -import re +import pandas as pd -def get_script_commands(script_file): +def rccl_tests_log_processor(script_file): fs = open(script_file, 'r') lines = fs.readlines() fs.close() - - commands = [] - for j in range(len(lines)): - line = lines[j].rstrip() - commands.append(line) - - return commands - -def parse_useful_information(log_file): - fs = open(log_file, 'r') - lines = fs.readlines() - fs.close() - - useful_lines = [] + results = [] + temp = [] for j in range(len(lines)): line = lines[j].rstrip() - if ("time" in line and "algbw" in line and "busbw" in line): + if ('==========================================================' in line and j != 0): + results.append(temp) + temp = [] + elif ("time" in line and "algbw" in line and "busbw" in line): perf_line = lines[j+2] - if ("Avg bus bandwidth" in lines[j+5]): + if ("Avg bus bandwidth" in lines[j + 5]): perf_line = perf_line + lines[j + 5] - elif ("Avg bus bandwidth" in lines[j+4]): - perf_line = perf_line + lines[j+4] - useful_lines.append(perf_line) - return useful_lines - -def parse_nccl_performance(useful_lines, commands): + elif ("Avg bus bandwidth" in lines[j + 4]): + perf_line = perf_line + lines[j + 4] + temp.append(perf_line) - perf_lines = [] - perf_lines.append("sep=|") - header = "size|count|type|redop|root|time-oplace(us)|algbw(gb/s)-oplace|busbw(gb/s)-oplace|error|" + \ - "time-iplace(us)|algbw(gb/s)-iplace|busbw(gb/s)-iplace|error|avg_bus_bw|commands" - #print(header) - num_fields = len(header.split("|")) - perf_lines.append(header) - for j in range(len(useful_lines)): - line = useful_lines[j] - line = line.replace("# Avg bus bandwidth : ", "") - - split_list = line.split() - perf_line = "" - field_index = 0 - for i in range(len(split_list)): - perf_line = perf_line + split_list[i] + "|" - # Some collectives do not involve a redop - if field_index==2 and "reduce" not in commands[j].lower(): - perf_line = perf_line + "|" - field_index = field_index + 1 - # Only broadcast and reduce involve a root - if ( - field_index==3 and - re.search(r'\Wreduce_perf', commands[j]) is None and - re.search(r'\Wbroadcast_perf', commands[j]) is None - ): - perf_line = perf_line + "|" - field_index = field_index + 1 - field_index = field_index + 1 - #print (perf_line + commands[j]) - perf_line = perf_line + commands[j] - assert len(perf_line.split("|")) == num_fields - perf_lines.append(perf_line) + return results - return perf_lines - -def get_counts_from_file(count_file): - fs = open(count_file, 'r') - lines = fs.readlines() - fs.close() - - counts = [] - for j in range(1, len(lines)): - line = lines[j].rstrip() - counts.append(line.split("|")[1]) - return counts - -def update_perf_lines(perf_lines, counts): +def parse_nccl_performance(perf_lines, net_counts_csv, output_file): + size, count_1, datatype, op_type, time_oplace, algbw_gbs_oplace, busbw_oplace, error_oplace, time_iplace, algbw_gbs_iplace, busbw_iplace, error_iplace, avg_busbw, count = [], [], [], [], [], [], [], [], [], [], [], [], [], [] + count_table = pd.read_csv(net_counts_csv) + for i, row in count_table.iterrows(): + for line in perf_lines[i]: + split_list = line.split() + size.append(split_list[0]) + count_1.append(split_list[1]) + datatype.append(split_list[2]) + op_type.append(split_list[3]) + time_oplace.append(split_list[4]) + algbw_gbs_oplace.append(split_list[5]) + busbw_oplace.append(split_list[6]) + error_oplace.append(split_list[7]) + time_iplace.append(split_list[8]) + algbw_gbs_iplace.append(split_list[9]) + busbw_iplace.append(split_list[10]) + error_iplace.append(split_list[11]) + avg_busbw.append(split_list[-1]) + assert row['count'] % len(perf_lines[i]) == 0 + count.append(int(row['count']//len(perf_lines[i]))) - updated_lines = [] - updated_lines.append("sep=|") - updated_lines.append(perf_lines[1] + "|count") - for j in range(2, len(perf_lines)): - perf_line = perf_lines[j] + "|" + counts[j-2] - updated_lines.append(perf_line) + dict_summary = {"size":size, "count_1":count_1, "datatype":datatype, + "op_type":op_type, "time_oplace":time_oplace, "algbw_gbs_oplace":algbw_gbs_oplace, + "busbw_oplace":busbw_oplace, "error_oplace":error_oplace, "time_iplace":time_iplace, + "algbw_gbs_iplace":algbw_gbs_iplace, "busbw_iplace":busbw_iplace, "error_iplace":error_iplace, + "avg_busbw":avg_busbw, "count":count} + table = pd.DataFrame(dict_summary) + table.to_csv(output_file) + print ("INFO: Dumped out the count of each command in a file named: {}".format(output_file)) + return table - return updated_lines - -def generate_output_file(out_file, perf_lines): - fs = open(out_file, 'w') - for j in range(len(perf_lines)): - fs.write(perf_lines[j]) - fs.write('\n') - fs.close() - print ("INFO: Dumped out the performance.") def main(): log_file = os.path.abspath(args.log_file) + count_file = os.path.abspath(args.count_file) out_file = args.output_file_name + ".csv" - script_file = os.path.abspath(args.script_file) - - commands = get_script_commands(script_file) - useful_lines = parse_useful_information(log_file) - perf_lines = parse_nccl_performance(useful_lines, commands) - if args.count_file: - counts = get_counts_from_file(os.path.abspath(args.count_file)) - perf_lines = update_perf_lines(perf_lines, counts) - generate_output_file(out_file, perf_lines) + results = rccl_tests_log_processor(log_file) # net_unique_topo.sh + parse_nccl_performance(results, count_file, out_file) + if __name__ == '__main__': parser = argparse.ArgumentParser() - parser.add_argument("--log-file", type=str, required=True, help="Log file generated while running rccl-tests") + parser.add_argument("--log-file", type=str, required=True, help="Log file generated while running rccl-tests with net_unique_topo.sh") parser.add_argument("--output-file-name", type=str, required=False, default="net_summary") - parser.add_argument("--script-file", type=str, required=True, help="Script file to run NCCL/RCCL Tests") - parser.add_argument("--count-file", type=str, required=False, help="net_count file generated while running unique option in parser.") + parser.add_argument("--count-file", type=str, required=False, default="net_counts.csv", help="net_count file generated while running unique option in parser.") args = parser.parse_args() main() diff --git a/install.sh b/install.sh new file mode 100644 index 0000000..d7567fd --- /dev/null +++ b/install.sh @@ -0,0 +1,7 @@ +#!/bin/bash +sudo apt-get install -y gawk +sudo apt-get install -y graphviz +pip install pandas +pip install networkx +cd deviceIdMapping; make +./hip_rocm_smi_mapping > busId_HIP_map.txt diff --git a/net_unique_topo.sh b/net_unique_topo.sh new file mode 100644 index 0000000..2939cb6 --- /dev/null +++ b/net_unique_topo.sh @@ -0,0 +1,65 @@ +echo '==========================================================' +HIP_VISIBLE_DEVICES=6,4,5,7 ./build/broadcast_perf -d int8 -b 40 -e 40 -o sum -g 4 +HIP_VISIBLE_DEVICES=3,1,0,2 ./build/broadcast_perf -d int8 -b 40 -e 40 -o sum -g 4 +echo '==========================================================' +HIP_VISIBLE_DEVICES=6,2 ./build/broadcast_perf -d int8 -b 25952256 -e 25952256 -o sum -g 2 +HIP_VISIBLE_DEVICES=3,7 ./build/broadcast_perf -d int8 -b 25952256 -e 25952256 -o sum -g 2 +HIP_VISIBLE_DEVICES=0,4 ./build/broadcast_perf -d int8 -b 25952256 -e 25952256 -o sum -g 2 +HIP_VISIBLE_DEVICES=1,5 ./build/broadcast_perf -d int8 -b 25952256 -e 25952256 -o sum -g 2 +echo '==========================================================' +HIP_VISIBLE_DEVICES=6,2 ./build/broadcast_perf -d int8 -b 2097152 -e 2097152 -o sum -g 2 +HIP_VISIBLE_DEVICES=3,7 ./build/broadcast_perf -d int8 -b 2097152 -e 2097152 -o sum -g 2 +HIP_VISIBLE_DEVICES=0,4 ./build/broadcast_perf -d int8 -b 2097152 -e 2097152 -o sum -g 2 +HIP_VISIBLE_DEVICES=1,5 ./build/broadcast_perf -d int8 -b 2097152 -e 2097152 -o sum -g 2 +echo '==========================================================' +HIP_VISIBLE_DEVICES=6,2 ./build/broadcast_perf -d int8 -b 2048 -e 2048 -o sum -g 2 +HIP_VISIBLE_DEVICES=3,7 ./build/broadcast_perf -d int8 -b 2048 -e 2048 -o sum -g 2 +HIP_VISIBLE_DEVICES=0,4 ./build/broadcast_perf -d int8 -b 2048 -e 2048 -o sum -g 2 +HIP_VISIBLE_DEVICES=1,5 ./build/broadcast_perf -d int8 -b 2048 -e 2048 -o sum -g 2 +echo '==========================================================' +HIP_VISIBLE_DEVICES=6,2 ./build/broadcast_perf -d int8 -b 1572864 -e 1572864 -o sum -g 2 +HIP_VISIBLE_DEVICES=3,7 ./build/broadcast_perf -d int8 -b 1572864 -e 1572864 -o sum -g 2 +HIP_VISIBLE_DEVICES=0,4 ./build/broadcast_perf -d int8 -b 1572864 -e 1572864 -o sum -g 2 +HIP_VISIBLE_DEVICES=1,5 ./build/broadcast_perf -d int8 -b 1572864 -e 1572864 -o sum -g 2 +echo '==========================================================' +HIP_VISIBLE_DEVICES=6,2 ./build/broadcast_perf -d int8 -b 1536 -e 1536 -o sum -g 2 +HIP_VISIBLE_DEVICES=3,7 ./build/broadcast_perf -d int8 -b 1536 -e 1536 -o sum -g 2 +HIP_VISIBLE_DEVICES=0,4 ./build/broadcast_perf -d int8 -b 1536 -e 1536 -o sum -g 2 +HIP_VISIBLE_DEVICES=1,5 ./build/broadcast_perf -d int8 -b 1536 -e 1536 -o sum -g 2 +echo '==========================================================' +HIP_VISIBLE_DEVICES=6,2 ./build/broadcast_perf -d int8 -b 524288 -e 524288 -o sum -g 2 +HIP_VISIBLE_DEVICES=3,7 ./build/broadcast_perf -d int8 -b 524288 -e 524288 -o sum -g 2 +HIP_VISIBLE_DEVICES=0,4 ./build/broadcast_perf -d int8 -b 524288 -e 524288 -o sum -g 2 +HIP_VISIBLE_DEVICES=1,5 ./build/broadcast_perf -d int8 -b 524288 -e 524288 -o sum -g 2 +echo '==========================================================' +HIP_VISIBLE_DEVICES=6,4,5,7 ./build/broadcast_perf -d int8 -b 32 -e 32 -o sum -g 4 +HIP_VISIBLE_DEVICES=3,1,0,2 ./build/broadcast_perf -d int8 -b 32 -e 32 -o sum -g 4 +echo '==========================================================' +HIP_VISIBLE_DEVICES=6,4,5,7 ./build/broadcast_perf -d int8 -b 65600 -e 65600 -o sum -g 4 +HIP_VISIBLE_DEVICES=3,1,0,2 ./build/broadcast_perf -d int8 -b 65600 -e 65600 -o sum -g 4 +echo '==========================================================' +HIP_VISIBLE_DEVICES=6,4,5,7 ./build/all_reduce_perf -d half -b 16777216 -e 16777216 -o sum -g 4 +HIP_VISIBLE_DEVICES=3,1,0,2 ./build/all_reduce_perf -d half -b 16777216 -e 16777216 -o sum -g 4 +echo '==========================================================' +HIP_VISIBLE_DEVICES=6,4,5,7 ./build/all_reduce_perf -d float -b 32768 -e 32768 -o max -g 4 +HIP_VISIBLE_DEVICES=3,1,0,2 ./build/all_reduce_perf -d float -b 32768 -e 32768 -o max -g 4 +echo '==========================================================' +HIP_VISIBLE_DEVICES=6,4,5,7 ./build/all_reduce_perf -d float -b 32768 -e 32768 -o sum -g 4 +HIP_VISIBLE_DEVICES=3,1,0,2 ./build/all_reduce_perf -d float -b 32768 -e 32768 -o sum -g 4 +echo '==========================================================' +HIP_VISIBLE_DEVICES=6,4,5,7 ./build/all_gather_perf -d int8 -b 4194304 -e 4194304 -o sum -g 4 +HIP_VISIBLE_DEVICES=3,1,0,2 ./build/all_gather_perf -d int8 -b 4194304 -e 4194304 -o sum -g 4 +echo '==========================================================' +HIP_VISIBLE_DEVICES=6,2 ./build/all_reduce_perf -d half -b 179429376 -e 179429376 -o sum -g 2 +HIP_VISIBLE_DEVICES=3,7 ./build/all_reduce_perf -d half -b 179429376 -e 179429376 -o sum -g 2 +HIP_VISIBLE_DEVICES=0,4 ./build/all_reduce_perf -d half -b 179429376 -e 179429376 -o sum -g 2 +HIP_VISIBLE_DEVICES=1,5 ./build/all_reduce_perf -d half -b 179429376 -e 179429376 -o sum -g 2 +echo '==========================================================' +HIP_VISIBLE_DEVICES=6,2 ./build/all_reduce_perf -d uint8 -b 1 -e 1 -o max -g 2 +HIP_VISIBLE_DEVICES=3,7 ./build/all_reduce_perf -d uint8 -b 1 -e 1 -o max -g 2 +HIP_VISIBLE_DEVICES=0,4 ./build/all_reduce_perf -d uint8 -b 1 -e 1 -o max -g 2 +HIP_VISIBLE_DEVICES=1,5 ./build/all_reduce_perf -d uint8 -b 1 -e 1 -o max -g 2 +echo '==========================================================' +HIP_VISIBLE_DEVICES=6,4,5,7 ./build/all_reduce_perf -d uint8 -b 1 -e 1 -o max -g 4 +HIP_VISIBLE_DEVICES=3,1,0,2 ./build/all_reduce_perf -d uint8 -b 1 -e 1 -o max -g 4 +echo '==========================================================' diff --git a/rccl_nccl_parser.py b/rccl_nccl_parser.py index 8e12a84..4a06b37 100644 --- a/rccl_nccl_parser.py +++ b/rccl_nccl_parser.py @@ -1,6 +1,9 @@ import os import sys import argparse +from collections import defaultdict +import pandas as pd +import networkx as nx coll_op_map = { "Broadcast": "broadcast_perf", @@ -51,49 +54,59 @@ "9" : 2, #"10" : Not sure. } - + +def algobw_factor_times_size(coll_type, nranks, total_bytes): + # n: number of ranks + # n links of Bandwidth B to perform a operation + # def factor_1(n): + # return n + def all_gather_factor(n): + return (n-1)/n + def reduce_scatter_factor(n): + return (n-1)/n + def all_reduce_factor(n): + return 2*(n-1)/n + def all_to_all_factor(n): + return 2*(n-1)/n + + nranks = float(nranks) + if coll_type == "AllGather": + return all_gather_factor(nranks) * float(total_bytes) + elif coll_type == "ReduceScatter": + return reduce_scatter_factor(nranks) * float(total_bytes) + elif coll_type == "AllReduce": + return all_reduce_factor(nranks) * float(total_bytes) + elif coll_type == "AllToAll": + return all_to_all_factor(nranks) * float(total_bytes) + else: + return float(1) * float(total_bytes) + def get_useful_info(log_file): fs = open(log_file, 'r') lines = fs.readlines() fs.close() - useful_lines = [] + coll_lines, conn_lines, comm_lines, ring_lines, tree_lines, coll_trace_lines = [], [], [], [], [], [] for j in range(len(lines)): line = lines[j].rstrip() if ("opCount" in line and "sendbuff" in line): - useful_lines.append(line) - - return useful_lines - -def parse_nccl_log(nccl_lines): - - commands = [] - for j in range(len(nccl_lines)): - line = nccl_lines[j] - split_list = line.split(" ") - comm = split_list[split_list.index("INFO") + 1].replace(":", "") - count = split_list[split_list.index("count") + 1] - datatype = split_list[split_list.index("datatype") + 1] - op_type = split_list[split_list.index("op") + 1] - root = split_list[split_list.index("root") + 1] - nnranks = next(item for item in split_list if 'nranks' in item).split("=")[1].replace("]", "") - - #print (comm) - #print (count) - #print (datatype) - #print (op_type) - #print (root) - #print (nnranks) - - total_bytes = int(count) * data_type_bytes_map[datatype] - - test_cmd = "./build/" + coll_op_map[comm] + " -d " + data_types_map[datatype] + \ - " -b " + str(total_bytes) + " -e " + str(total_bytes) + \ - " -o " + reduction_op_map[op_type] + " -g " + str(nnranks) - #print (test_cmd) - commands.append((test_cmd, int(nnranks))) - - return commands + coll_lines.append(line) + elif ("Channel" in line and "via" in line): + conn_lines.append(line) + elif ("Init COMPLETE" in line and "busId" in line): + comm_lines.append(line) + elif ("NCCL INFO Ring" in line): + ring_lines.append(line) + elif ("NCCL INFO Trees" in line): + tree_lines.append(line) + elif ((" ## " in line) and ("KL HWID" in line or "KE" in line or "CE" in line)): # RCCL From ROCm 4.3 + # Bug: [ 6628.064978] we need to consider the case when there is a spac right after '[' + # Everything with split_list[index] will break. + if "[ " in line: + line = line.replace("[ ", "[") + coll_trace_lines.append(line) + + return coll_lines, conn_lines, comm_lines, ring_lines, tree_lines, coll_trace_lines def generate_script(commands, output_script): filename = output_script + ".sh" @@ -104,55 +117,299 @@ def generate_script(commands, output_script): fs.close() print("INFO: Dumped out the commands in a script named: {}".format(filename)) -def dump_counts_map(counts_map, output_file): +def dump_counts_map(unique_command_list, counts_list, output_file): ########################### filename = output_file + ".csv" - fs = open(filename, 'w') - fs.write("sep=|") - fs.write("\n") - keys = counts_map.keys() - for key in keys: - fs.write(key + "|" + str(counts_map[key])) - fs.write("\n") + dict_command_count = {'command': unique_command_list, 'count': counts_list} + table = pd.DataFrame(dict_command_count) + table.to_csv(filename) + print ("INFO: Dumped out the count of each command in a file named: {}".format(filename)) + +# ts1-sjc2-27:11585:11585 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f988f200400 recvbuff 0x7f988f200400 count 440 datatype 0 op 0 root 0 comm 0x7f905c000dc0 [nranks=4] stream 0x55aa0a8d78f0 +def coll_table_build(coll_lines): + opCount, coll, count, datatype, op_type, root, comm, nranks, data_size = [], [], [], [], [], [], [], [], [] + for line in coll_lines: + split_list = line.split() + coll.append(split_list[4][:-1]) + opCount.append(int(split_list[6], 16)) + count.append(split_list[12]) + datatype.append(split_list[14]) + op_type.append(split_list[16]) + root.append(split_list[18]) + comm.append(split_list[20]) + nranks.append(int(next(item for item in split_list if 'nranks' in item).split("=")[1].replace("]", ""))) ### + data_size.append(int(split_list[split_list.index("count") + 1]) * data_type_bytes_map[split_list[split_list.index("datatype") + 1]]) + + dict_coll = {'coll': coll, 'opCount': opCount, 'datatype': datatype, 'count':count, 'op_type':op_type, 'root':root, 'comm':comm, 'nranks':nranks, 'data_size':data_size} + table = pd.DataFrame(dict_coll) + table['algobw_factor_times_size'] = table.apply(lambda row: + algobw_factor_times_size(row['coll'], row['nranks'], row['data_size']), axis=1) + table['raw_command'] = coll_lines + return table + +def conn_table_build(conn_lines, legacy_device_grouping): # Only works for RCCL 2.9 or above + def process_string(line): + split_list = line.split("[") + return [split_list[0], split_list[1].split("]")[0]] + + start_rank, start_busid, end_rank, end_busid, connection, comm, nranks = [], [], [], [], [], [], [] + for line in conn_lines: + split_list = line.split(" ") + sr, sb = process_string(split_list[split_list.index(":") + 1]) # first device + er, eb = process_string(split_list[split_list.index("->") + 1]) # second device + start_rank.append(sr) + start_busid.append(sb) + end_rank.append(er) + end_busid.append(eb) + connection.append(split_list[split_list.index("via") + 1]) # if it is direct, it means the connection is done by direct shared memory + if not legacy_device_grouping: + if "comm" not in line: + raise AssertionError("This NCCL/RCCL log is from an older version. Please use RCCL 2.9 or above.") + comm.append(split_list[split_list.index("comm") + 1]) + nranks.append(int(split_list[split_list.index("nRanks") + 1])) + + if not legacy_device_grouping: + dict_conn = {'start_rank': start_rank, 'start_busid': start_busid, 'end_rank': end_rank, 'end_busid': end_busid, + 'connection': connection, 'comm':comm, 'nranks':nranks, 'conn_line':conn_lines} + else: + dict_conn = {'start_rank': start_rank, 'start_busid': start_busid, 'end_rank': end_rank, 'end_busid': end_busid, + 'connection': connection, 'conn_line':conn_lines} + return pd.DataFrame(dict_conn) + +def comm_table_build(comm_lines): + comm, rank, nranks, cudaDev, busId = [], [], [], [], [] + for line in comm_lines: + split_list = line.rstrip().split(" ") + comm.append(split_list[5]) + rank.append(split_list[7]) + nranks.append(int(split_list[9])) + cudaDev.append(split_list[11]) + busId.append(split_list[13]) + dict_comm = {'comm':comm, 'rank':rank, 'nranks':nranks, 'cudaDev':cudaDev, 'busId':busId} + return pd.DataFrame(dict_comm) + + +class DisjointSet(object): # https://stackoverflow.com/questions/3067529/a-set-union-find-algorithm + + def __init__(self): + self.leader = {} # maps a member to the group's leader + self.group = {} # maps a group leader to the group (which is a set) + + def add(self, a, b): + leadera = self.leader.get(a) + leaderb = self.leader.get(b) + if leadera is not None: + if leaderb is not None: + if leadera == leaderb: return # nothing to do + groupa = self.group[leadera] + groupb = self.group[leaderb] + if len(groupa) < len(groupb): + a, leadera, groupa, b, leaderb, groupb = b, leaderb, groupb, a, leadera, groupa + groupa |= groupb + del self.group[leaderb] + for k in groupb: + self.leader[k] = leadera + else: + self.group[leadera].add(b) + self.leader[b] = leadera + else: + if leaderb is not None: + self.group[leaderb].add(a) + self.leader[a] = leaderb + else: + self.leader[a] = self.leader[b] = a + self.group[a] = set([a, b]) + +def buildGraph(graphs, connectionList): # add an input of connection list + G = nx.DiGraph() + node_pool = [] + for graph in graphs: + if graph[0][0] not in node_pool: + G.add_node(graph[0][0]) + node_pool.append(graph[0][0]) + for node in graph[1]: + if node not in node_pool: + G.add_node(node) + node_pool.append(node) + label = connectionList[(connectionList['start_busid'] == graph[0][0]) & (connectionList['end_busid'] == node)]['connection'].unique()[0] + G.add_edge(graph[0][0], node, label=label) + + to_remove = [] + for edge in G.edges(): + if (G.has_edge(edge[1], edge[0]) == False): to_remove.append([edge[0], edge[1]]) + for pair in to_remove: G.remove_edge(pair[0], pair[1]) + edges = G.to_undirected().edges() + + pos = nx.circular_layout(G) + nx.draw_networkx_nodes(G, pos, node_color='#ffff00') + nx.draw_networkx_labels(G, pos, font_size=8) + nx.draw_networkx_edges(G, pos, edge_color='b', arrows = True) + nx.draw_networkx_edge_labels(G,pos,edge_labels=nx.get_edge_attributes(G,'label')) + + parents = {} + ds = DisjointSet() + for edge in edges: ds.add(edge[0], edge[1]) + + outputs = [] + for k, v in ds.group.items(): + outputs.append(v) + return outputs + + +def hip_busId_mapping(path_to_deviceIdMapping): + def processBusId(busId): + split_list = busId.split(':') + temp = split_list[2].split('.') + return split_list[1].lstrip('0') + temp[0] + temp[1] + + fs = open(path_to_deviceIdMapping, 'r') + lines = fs.readlines() fs.close() - print ("INFO: Dumped out the count of each command in a file named: {}".format(filename)) - -def get_unique_commands(commands_and_nranks): - unique_values = [] - counts_map = {} - nranks_map = {} - for c_and_nr in commands_and_nranks: - cmd = c_and_nr[0] - nranks = c_and_nr[1] - if (cmd not in unique_values): - counts_map[cmd] = 1 - nranks_map[cmd] = nranks - unique_values.append(cmd) + busId_HIP_map = {} + for j in range(len(lines)): + line = lines[j].rstrip() + if "=" not in line: + split_list = line.split() + busId_HIP_map[processBusId(split_list[1])] = split_list[2] + return busId_HIP_map + +def device_grouping(comm_table, conn_table): + groups = [] + for index, row in comm_table.iterrows(): + temp = [row['busId'], list(conn_table[(conn_table['comm'] == row['comm']) & (conn_table['start_busid'] == row['busId'])]['end_busid'].unique())] + groups.append(temp) + nranks = list(comm_table['nranks']) + outputs = [] + rank_outputs = [] + tempRank = None + for id, group in enumerate(groups): + if tempRank == None: + tempRank = nranks[id] + ds = DisjointSet() else: - counts_map[cmd] = counts_map[cmd] + 1 - assert len(counts_map) == len(nranks_map) - for cmd in counts_map.keys(): - assert counts_map[cmd] % nranks_map[cmd] == 0 - counts_map[cmd] = int(counts_map[cmd] / nranks_map[cmd]) - return unique_values, counts_map + if tempRank != nranks[id]: + for _, v in ds.group.items(): + if v not in outputs: + outputs.append(v) + ds = DisjointSet() + tempRank = nranks[id] + for node in group[1]: + ds.add(group[0], node) + + if id == len(groups) - 1: + for _, v in ds.group.items(): + if v not in outputs: + outputs.append(v) + return outputs + +def generate_topo_script(commands, topo_info, busId_HIP_map, output_script): + filename = output_script + ".sh" + fs = open(filename, "w") + for j in range(len(commands)): + fs.write("echo '==========================================================' \n") + for device_set in topo_info[j]: + device_setting = "HIP_VISIBLE_DEVICES=" + for k, device in enumerate(list(device_set)): + if k == len(device_set) - 1: + device_setting = device_setting + str(busId_HIP_map[device]) + " " + else: + device_setting = device_setting + str(busId_HIP_map[device]) + "," + fs.write(device_setting + commands[j]) + fs.write("\n") + fs.write("echo '==========================================================' \n") + fs.close() + print("INFO: Dumped out the commands with device assignment in a script named: {}".format(filename)) + +def generate_topo(busId_HIP_map, command_list, raw_command_list, coll_table, conn_table, comm_table, legacy_device_grouping, output_name): + all_info = pd.merge(coll_table, comm_table, on=['comm','nranks']) + topo_info = [] + if legacy_device_grouping: + for command in raw_command_list: + split_list = command.split() + coll = split_list[4][:-1] + opCount = int(split_list[6], 16) + count = split_list[12] + datatype = split_list[14] + op_type = split_list[16] + nranks = int(next(item for item in split_list if 'nranks' in item).split("=")[1].replace("]", "")) + #### Filter + selected_info = all_info[(all_info['coll'] == coll) & (all_info['opCount'] == opCount) + & (all_info['datatype'] == datatype) & (all_info['count'] == count) + & (all_info['nranks'] == nranks) & (all_info['op_type'] == op_type)] + stage_2 = pd.merge(selected_info, conn_table, left_on=['rank','busId'], right_on=['start_rank','start_busid'],how='left') + graphs = [] + for _ , subgroup in stage_2.groupby(['busId']): + graphs.append([list(subgroup['busId'].unique()), list(subgroup['end_busid'].unique())]) + outputs = buildGraph(graphs, conn_table) + topo_info.append(outputs) + generate_topo_script(command_list, topo_info, busId_HIP_map, output_name) + else: + device_group_list = device_grouping(comm_table, conn_table) + for command in command_list: + split_list = command.split() + nranks = int(split_list[split_list.index("-g") + 1]) + temp = [] + for deviceSet in device_group_list: + if len(deviceSet) == nranks: + temp.append(deviceSet) + topo_info.append(temp) + generate_topo_script(command_list, topo_info, busId_HIP_map, output_name) + +def get_commands(coll_table, unique): + def nccl_rccl_tests_command(row): + test_cmd = "./build/" + coll_op_map[row['coll']] + " -d " + data_types_map[row['datatype']] \ + + " -b " + str(row['data_size']) + " -e " + str(row['data_size']) \ + + " -o " + reduction_op_map[row['op_type']] + " -g " + str(row['nranks']) + return test_cmd + + command_list = [] + raw_command_list = [] + counts_list = [] + if unique: + unique_coll_table = coll_table.drop_duplicates(subset = ['coll','datatype', 'op_type', 'nranks', 'data_size']) + for _, row in unique_coll_table.iterrows(): + test_cmd = nccl_rccl_tests_command(row) + command_list.append(test_cmd) + count = coll_table[(coll_table['coll'] == row['coll']) & (coll_table['datatype'] == row['datatype']) & + (coll_table['op_type'] == row['op_type']) & (coll_table['nranks'] == row['nranks']) & + (coll_table['data_size'] == row['data_size'])].shape[0] + assert count % row['nranks'] == 0 + counts_list.append(int(count / row['nranks'])) + raw_command_list.append(row['raw_command']) + else: + for _, row in coll_table.iterrows(): + test_cmd = nccl_rccl_tests_command(row) + command_list.append(test_cmd) + return command_list, counts_list, raw_command_list + def main(): log_file = os.path.abspath(args.nccl_debug_log) - nccl_lines = get_useful_info(log_file) - commands_and_nranks = parse_nccl_log(nccl_lines) - #generate_script(commands, args.output_script_name) + coll_lines, conn_lines, comm_lines, ring_lines, tree_lines, coll_trace_lines = get_useful_info(log_file) + coll_table = coll_table_build(coll_lines) + conn_table = conn_table_build(conn_lines, args.legacy_device_grouping) + comm_table = comm_table_build(comm_lines) + command_list, counts_list, raw_command_list = get_commands(coll_table, args.unique) + path_to_deviceIdMapping = os.path.join(os.path.dirname(os.path.realpath(__file__)), "deviceIdMapping/busId_HIP_map.txt") + if os.path.exists(path_to_deviceIdMapping) == False: + raise AssertionError("Remember to run 'sh install.sh' before using this tool.") + busId_hip_map = hip_busId_mapping(path_to_deviceIdMapping) if (args.unique): - new_commands, counts_map = get_unique_commands(commands_and_nranks) - generate_script(new_commands, args.output_script_name + "_unique") - dump_counts_map(counts_map, args.output_script_name + "_counts") + dump_counts_map(command_list, counts_list, args.output_script_name + "_counts") + generate_topo(busId_hip_map, command_list, raw_command_list, coll_table, conn_table, + comm_table, args.legacy_device_grouping, args.output_script_name + "_unique_topo") else: - commands = list(zip(*commands_and_nranks))[0] - generate_script(commands, args.output_script_name) + generate_script(command_list, args.output_script_name) if __name__ == '__main__': parser = argparse.ArgumentParser() - parser.add_argument("--nccl-debug-log", type=str, required=True, help="Log from app with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL") + parser.add_argument("--nccl-debug-log", type=str, required=True, help="RCCL log after running app with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL RCCL_KERNEL_COLL_TRACE_ENABLE=1 ") + parser.add_argument("--legacy-device-grouping", action="store_true", default=False, help="If the application is using CUDA systems or ROCm systems with RCCL 2.8 or below.") # parser.add_argument("--output-script-name", type=str, required=False, default="net_nccl_rccl", help="Output command script") parser.add_argument("--unique", action="store_true", default=False, help="Get only the unique commands.") args = parser.parse_args() main() + +# python rccl_nccl_parser_new.py --nccl-debug-log gpt2_rccl_mp4_log.txt --output-script-name net +# python rccl_nccl_parser_new.py --nccl-debug-log gpt2_rccl_mp4_log.txt --output-script-name net --unique --legacy-device-grouping +# python rccl_nccl_parser_new.py --nccl-debug-log gpt2_rccl_mp4_log_newPR.txt --output-script-name net --unique diff --git a/run_parser_and_generate_summary.py b/run_parser_and_generate_summary.py index 76641a4..21f266e 100644 --- a/run_parser_and_generate_summary.py +++ b/run_parser_and_generate_summary.py @@ -6,7 +6,10 @@ def main(): debug_log = os.path.abspath(args.nccl_debug_log) ## Generate a script to run nccl/rccl tests. - gen_cmd = "python rccl_nccl_parser.py --nccl-debug-log " + debug_log + " --output-script-name net --unique" + if args.legacy_device_grouping: + gen_cmd = gen_cmd = "python rccl_nccl_parser.py --nccl-debug-log " + debug_log + " --output-script-name net --unique --legacy-device-grouping" + else: + gen_cmd = "python rccl_nccl_parser.py --nccl-debug-log " + debug_log + " --output-script-name net --unique" if os.system(gen_cmd): print ("ERROR: Failed to parse the log.") sys.exit(1) @@ -14,43 +17,48 @@ def main(): ## change directory to rccl-tests/nccl-tests if args.rocm: rccl_tests_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "rccl-tests") - os.system("cp net_unique.sh " + rccl_tests_path) - #os.chdir(os.path.join(os.path.dirname(os.path.realpath(__file__)), "rccl-tests")) + os.system("cp net_unique_topo.sh " + rccl_tests_path) os.chdir(rccl_tests_path) if os.system("./install.sh > /dev/null 2>&1"): print("ERROR: Failed to install rccl-tests.") sys.exit(1) - os.system("cat net_unique.sh") - run_script_cmd = "HSA_FORCE_FINE_GRAIN_PCIE=1 sh net_unique.sh | tee rccl_perf_log.txt" + os.system("cat net_unique_topo.sh") + run_script_cmd = "HSA_FORCE_FINE_GRAIN_PCIE=1 sh net_unique_topo.sh | tee topo_rccl_tests.txt" + if os.system(run_script_cmd): print ("ERROR: Unable to run rccl-tests properly.") sys.exit(1) - os.system("mv rccl_perf_log.txt ../") + os.system("mv topo_rccl_tests.txt ../") os.chdir(os.path.join(os.path.dirname(os.path.realpath(__file__)), "../")) print (os.getcwd()) - summary_cmd = "python generate_summary.py --log-file rccl_perf_log.txt --script-file net_unique.sh --count-file net_counts.csv" + summary_cmd = "python generate_summary.py --log-file topo_rccl_tests.txt --output-file-name net_summary --count-file net_counts.csv" + os.system(summary_cmd) print ("INFO: Finished dumping all data.") + + if args.cuda: nccl_tests_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "nccl-tests") - os.system("cp net_unique.sh " + nccl_tests_path) + os.system("cp net_unique_topo.sh " + nccl_tests_path) os.chdir(nccl_tests_path) - if os.system("make > /dev/null 2>&1"): - print ("ERROR: Failed to install nccl-unit tests") + if os.system("./install.sh > /dev/null 2>&1"): + print("ERROR: Failed to install nccl-tests.") sys.exit(1) - os.system("cat net_unique.sh") - run_script_cmd = "sh net_unique.sh | tee nccl_perf_log.txt" + os.system("cat net_unique_topo.sh") + run_script_cmd = "HSA_FORCE_FINE_GRAIN_PCIE=1 sh net_unique_topo.sh | tee topo_rccl_tests.txt" if os.system(run_script_cmd): - print ("ERROR: unable to run nccl-tests") + print ("ERROR: Unable to run nccl-tests properly.") sys.exit(1) - os.system("mv nccl_perf_log.txt ../") + os.system("mv topo_rccl_tests.txt ../") os.chdir(os.path.join(os.path.dirname(os.path.realpath(__file__)), "../")) - - summary_cmd = "python generate_summary.py --log-file nccl_perf_log.txt --script-file net_unique.sh --output-file-name nv_net_summary --count-file net_counts.csv" + + print (os.getcwd()) + summary_cmd = "python generate_summary.py --log-file topo_rccl_tests.txt --output-file-name net_summary --count-file net_counts.csv" + os.system(summary_cmd) print ("INFO: Finished dumping all data.") @@ -60,6 +68,7 @@ def main(): help="NCCL/RCCL log after running app with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL") parser.add_argument("--rocm", action="store_true", default=False, help="Run the tests on ROCm using rccl-tests") parser.add_argument("--cuda", action="store_true", default=False, help="Run the tests on CUDA using nccl-tests") + parser.add_argument("--legacy-device-grouping", action="store_true", default=False, help="NCCL/RCCL log after running app with NCCL or RCCL 2.8 or below") args = parser.parse_args() main()