diff --git a/README.md b/README.md
index 8e3baa9..7a242bf 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# nccl-rccl-parser
+# Topology-aware nccl-rccl-parser
 This tool is used for dumping out the rccl-tests/nccl-test commands directly from an application to identify any potential bottlenecks of scaling while using RCCL/NCCL modules when running a distributed applications.
 
 To get started please clone the following repository: 
@@ -11,8 +11,14 @@ To run the tests, we use the following repositories:
 
 # Pre-requisites:
 * RCCL/NCCL installed. 
-* rccl-tests or nccl-tests installed.
-
+* Clone this repo with 
+  ```
+  git clone --recursive https://github.com/lcskrishna/nccl-rccl-parser.git
+  ```
+* Run installation script by 
+  ```
+  sh install.sh
+  ```
 # How to use the tool:
 
 ### Run application and collect RCCL/NCCL Log:**
@@ -29,7 +35,6 @@ NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL <application> |& tee nccl_debug_log.
 HSA_FORCE_FINE_GRAIN_PCIE=1 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL <application> |& tee nccl_debug_log.txt
 ```
 
-
 ### Automated way:
 
 To gather the performance results once you have the debug log with you. Run the below command. 
@@ -38,25 +43,33 @@ On CUDA devices, use --cuda argument.
 
 On ROCm devices, use --rocm argument.
 
+With NCCL or RCCL 2.8 or below, the argument "--legacy-device-grouping" is required for device grouping in applications. 
+
 Note: If you don't mention the arguments the automated script only dumps out the output data from the parser. 
 
 **On ROCm:**
 
 ```
-python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_log.txt --rocm
+python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_log.txt --rocm --legacy-device-grouping
+```
+
+```
+python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_new_log.txt --rocm
 ```
 
 **On CUDA:**
 
 ```
-python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_log.txt --cuda
+python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_log.txt --cuda --legacy-device-grouping
 ```
+
 ### To run the tool manually step by step:
 
 **Use Parser to dump out the test commands:**
 
 Once the log is being collected, use the parser to dump out all the rccl/nccl test commands or just the unique commands with their respective counts of the workload.
 Note: To dump out the unique commands use the --unique argument. 
+Note: To dump out the commands for the applications with NCCL or RCCL 2.8 or below use --legacy-device-grouping argument. 
 Optional parameters: output-script-name, unique
 
 Here is the usage of the script
@@ -65,6 +78,8 @@ Here is the usage of the script
 python rccl_nccl_parser.py --nccl-debug-log nccl_debug_log.txt --output-script-name net
 (or)
 python rccl_nccl_parser.py --nccl-debug-log nccl_debug_log.txt --output-script-name net --unique
+(or)
+python rccl_nccl_parser.py --nccl-debug-log nccl_debug_log.txt --output-script-name net --unique --legacy-device-grouping"
 ```
 
 The first command dumps out all the rccl/nccl tests in the order they get executed in the application. (net_rccl_nccl.sh file).
@@ -75,7 +90,7 @@ The second command dumps out a script file with unique commands and a csv file w
 Once you dump out the scripts, make sure to copy the script in nccl-tests/rccl-tests folder and run the script and gather the 
 Inside nccl-tests/rccl-tests repository:
 
-```sh net_unique.sh |& tee rccl_perf_data.txt```
+```sh net_unique_topo.sh |& tee topo_rccl_tests.txt```
 
 Once you run the above script, the performance data of each command is redirected to a text file. 
 
@@ -86,7 +101,7 @@ Now the final step is to use the above performance log and generate a summary in
 To generate the summary, navigate to the tool nccl-rccl-parser:
 
 ```
-python generate_summary.py --log-file rccl_perf_data.txt --output-file-name test_app_data--script-file net_unique.sh 
+python generate_summary.py --log-file topo_rccl_tests.txt --output-file-name net_summary --count-file net_counts.csv
 ```
 This dumps out a csv file with performance data for further analysis. 
 
diff --git a/coll_trace_processor/0.png b/coll_trace_processor/0.png
new file mode 100644
index 0000000..408a293
Binary files /dev/null and b/coll_trace_processor/0.png differ
diff --git a/coll_trace_processor/README.md b/coll_trace_processor/README.md
new file mode 100644
index 0000000..8758b75
--- /dev/null
+++ b/coll_trace_processor/README.md
@@ -0,0 +1,39 @@
+# NCCL/RCCL Log Processor
+
+This tool is used to collect RCCL collective traces, visulize topologies for rings and trees in RCCL, and get device grouping information. 
+
+## Requirement 
+The tool currently works for applications with RCCL 2.9 or above. However, the collective trace processor function works for an application without multiple device groups in RCCL 2.8 or below.
+
+From ROCm 4.3:
+NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL only enable collective API trace. Collective trace mode is enabled separately by RCCL_KERNEL_COLL_TRACE_ENABLE=1 which has the outputs in the new format as below:
+```
+[0] NCCL INFO ## [1703255.821541] [01:00] 000035 KL HWID 4230c540 AllReduceTreeLLSum_f32 nt 256 bi 0 nc 1 busId C3000
+```
+**Run application and collect RCCL/NCCL Log:**
+
+```
+NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL,GRAPH RCCL_KERNEL_COLL_TRACE_ENABLE=1 <application> |& tee nccl_debug_log.txt
+```
+
+## Usage
+For more information about RCCL collective traces, please go to [here](https://confluence.amd.com/display/MLSE/RCCL+Collective+Trace).
+
+Example command lines:
+```shell
+python log_processor.py --rccl-debug-log gpt2_rccl_mp4_log_newPR.txt
+```
+Notice that since NCCL and RCCL 2.8 or below has no sufficient inforamtion for device grouping, "--cuda" flag needs to be specified and the number of devices used in the application is also required.
+```shell
+python log_processor.py --rccl-debug-log base_2.8.log --cuda --num_devices 8
+```
+
+## Example Output
+If ROCm 2.8 or above is used, there will be multiple RCCL topology graphs, time tables for each RCCL operations and devices, bandwidth tables for each RCCL operations and devices, and a text file which contains device grouping information. </br>
+For example, if there are 6 device groups in an application, there will be 12 (=6*2) output tables in csv files. The numbering of the tables is followed by the line number in device_groups.txt.
+
+![image info](0.png) 
+
+
+## Copyright
+All source code and accompanying documentation are copyright (c) 2019-2020 Advanced Micro Devices, Inc. All rights reserved.
diff --git a/coll_trace_processor/extract_topo.awk b/coll_trace_processor/extract_topo.awk
new file mode 100644
index 0000000..e91a232
--- /dev/null
+++ b/coll_trace_processor/extract_topo.awk
@@ -0,0 +1,291 @@
+#!/usr/bin/gawk -f
+# Copyright (c) 2019-2021 Advanced Micro Devices, Inc. All rights reserved.
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+# THE SOFTWARE.# Usage:
+
+BEGIN {
+  max_rank=0
+  rings[""]=0
+  max_ring=0
+  treedns[""]=0
+  max_treedn=0
+  conn[""]=0
+  has_collnet=0
+  max_collnet=0
+  max_collnet_rank=0
+  collnet[""]=0
+  collnet_conn[""]=0
+  collnet_conn_type[""]=0
+  col_start=2
+  col_p1=col_start+1
+  col_p2=col_start+2
+  col_p3=col_start+3
+  col_p4=col_start+4
+  col_p5=col_start+5
+  col_p6=col_start+6
+  col_p7=col_start+7
+  col_p8=col_start+8
+}
+
+{
+  if($3=="NCCL" && $4=="INFO" && col_start==2) {
+    col_start=5
+    col_p1=col_start+1
+    col_p2=col_start+2
+    col_p3=col_start+3
+    col_p4=col_start+4
+    col_p5=col_start+5
+    col_p6=col_start+6
+    col_p7=col_start+7
+    col_p8=col_start+8
+  }
+
+  if($5=="NCCL" && $6=="INFO" && col_start==2) {
+    col_start=7
+    col_p1=col_start+1
+    col_p2=col_start+2
+    col_p3=col_start+3
+    col_p4=col_start+4
+    col_p5=col_start+5
+    col_p6=col_start+6
+    col_p7=col_start+7
+    col_p8=col_start+8
+  }
+
+  if($col_start=="Ring" && $col_p4=="->" && $col_p6=="->") {
+    chan=strtonum($col_p1)
+    rank=strtonum($col_p5)
+    next_rank=strtonum($col_p7)
+    rings[rank "," next_rank "," chan]="1"
+    if(chan>max_ring)
+      max_ring=chan
+    if(rank>max_rank)
+      max_rank=rank
+  }
+
+  if($col_start=="Trees") {
+    col_1=col_start+1
+    col_2=col_start+2
+    do {
+      match($col_1, /\[([0-9]+)\]/, ary)
+      chan=strtonum(ary[1])
+      where = match($col_2, /(\-?[0-9]+)\/(\-?[0-9]+)\/(\-?[0-9]+)\->(\-?[0-9]+)\->(\-?[0-9]+)\|(\-?[0-9]+)\->(\-?[0-9]+)\->(\-?[0-9]+)\/(\-?[0-9]+)\/(\-?[0-9]+)/, ary)
+      if(where != 0) {
+        if(ary[8]!="-1")
+          treedns[ary[7] "," ary[8] "," chan]="1"
+        if(ary[9]!="-1")
+          treedns[ary[7] "," ary[9] "," chan]="1"
+        if(ary[10]!="-1")
+          treedns[ary[7] "," ary[10] "," chan]="1"
+      } else {
+        where = match($col_2, /(\-?[0-9]+)\/(\-?[0-9]+)\/(\-?[0-9]+)\->(\-?[0-9]+)\->(\-?[0-9]+)/, ary)
+        if(where != 0) {
+          if(ary[1]!="-1")
+            treedns[ary[4] "," ary[1] "," chan]="1"
+          if(ary[2]!="-1")
+            treedns[ary[4] "," ary[2] "," chan]="1"
+          if(ary[3]!="-1")
+            treedns[ary[4] "," ary[3] "," chan]="1"
+        }
+      }
+      if(chan>max_treedn)
+        max_treedn=chan
+      col_1=col_1+2
+      col_2=col_2+2
+    } while ($col_1!="")
+  }
+
+  if($col_start=="CollNet" && $col_p1=="Channel") {
+    chan=strtonum($col_p2)
+    rank=strtonum($col_p4)
+    up_rank=strtonum($col_p6)
+    collnet[up_rank "," rank "," chan]="1"
+    if(has_collnet==0)
+      has_collnet=1
+    if(chan>max_collnet)
+      max_collnet=chan
+    if(up_rank>max_collnet_rank)
+      max_collnet_rank=up_rank
+  }
+
+  if($col_start=="Coll" && $col_p2==":") {
+    chan=strtonum($col_p1)
+    rank=strtonum($col_p3)
+    if($col_p4=="[receive]")
+      collnet_conn[rank "," chan]=0
+    else if($col_p4=="[send]")
+      collnet_conn[rank "," chan]=1
+    else
+      printf "Error!\n"
+    collnet_conn_type[rank "," chan]=$col_p6
+  }
+
+  if($col_p6=="via") {
+    match($col_p1, /([0-9]+)/, ary)
+    chan=strtonum(ary[1])
+    match($col_p3, /([0-9]+)\[.*\]/, ary)
+    s=ary[1]
+    match($col_p5, /([0-9]+)\[.*\]/, ary)
+    d=ary[1]
+    conn[s "," d "," chan]=$col_p7
+  }
+
+  if($col_p6=="[receive]" && $col_p7=="via") {
+    match($col_p1, /([0-9]+)/, ary)
+    chan=strtonum(ary[1])
+    match($col_p3, /([0-9]+)\[.*\]/, ary)
+    s=ary[1]
+    match($col_p5, /([0-9]+)\[.*\]/, ary)
+    d=ary[1]
+    conn[s "," d "," chan]=$col_p8
+  }
+}
+
+END {
+  printf "digraph RCCL {\n"
+  for(r=0;r<max_treedn+1;r++) {
+    printf "  subgraph tree_%d {\n", r
+    for(s=0;s<=max_rank;s++) {
+      for(d=0;d<=max_rank;d++) {
+        if ((s "," d "," r) in treedns) {
+          val=conn[s "," d "," r]
+          style="solid"
+          color="red"
+          if (match(val,"NET")) {
+            style="dashed"
+            if (match(val,"GDRDMA"))
+              color="green"
+          }
+          if (match(val,"P2P")) {
+            color="green"
+          }
+          printf "    t%d_%d -> t%d_%d [label=\"%s\",color=\"%s\",style=\"%s\",fontname=\"Helvetica\"];\n", r, s, r, d, val, color, style
+        }
+      }
+    }
+    printf "\n"
+    for(s=0;s<=max_rank;s++) {
+      printf "    t%d_%d [label=\"%d\",fontsize=\"28\"];\n", r, s, s
+    }
+    printf "  }\n\n"
+  }
+
+  for(r=0;r<max_ring+1;r++) {
+    remove_ring=0
+    for(s=0;s<=max_rank;s++) {
+      for(d=0;d<=max_rank;d++) {
+        if ((s "," d "," r) in rings && !((s "," d "," r) in conn)) {
+          remove_ring=1
+          break;
+        }
+      }
+      if(d<=max_rank)
+        break;
+    }
+    if (remove_ring!=0)
+      continue
+    printf "  subgraph ring_%d {\n", r
+    for(s=0;s<=max_rank;s++) {
+      for(d=0;d<=max_rank;d++) {
+        if ((s "," d "," r) in rings) {
+          val=conn[s "," d "," r]
+          style="solid"
+          color="red"
+          if (match(val,"NET")) {
+            style="dashed"
+            if (match(val,"GDRDMA"))
+              color="green"
+          }
+          if (match(val,"P2P")) {
+            color="green"
+          }
+          printf "    r%d_%d -> r%d_%d [label=\"%s\",color=\"%s\",style=\"%s\",fontname=\"Helvetica\"];\n", r, s, r, d, val, color, style
+        }
+      }
+    }
+    printf "\n"
+    for(s=0;s<=max_rank;s++) {
+      printf "    r%d_%d [label=\"%d\",fontsize=\"28\"];\n", r, s, s
+    }
+    printf "  }\n\n"
+  }
+
+  for(r=0; has_collnet && r<=max_collnet; r++) {
+    printf "  subgraph collnet_%d {\n", r
+    num_top_ranks=0
+    for(s=0;s<max_collnet_rank;s++) {
+      if((max_collnet_rank "," s "," r) in collnet)
+        top_ranks[num_top_ranks++]=s
+    }
+    for(d=0; d<num_top_ranks; d++) {
+      rank=top_ranks[d]
+      send=collnet_conn[rank "," r]
+      val=collnet_conn_type[rank "," r]
+      style="solid"
+      color="red"
+      if (match(val,"COLLNET")) {
+        style="dashed"
+        if (match(val,"GDRDMA"))
+          color="green"
+      }
+      if (match(val,"P2P")) {
+        color="green"
+      }
+      if(send)
+        printf "    c%d_%d -> c%d_%d [dir=back label=\"%s\",color=\"%s\",style=\"%s\",fontname=\"Helvetica\"];\n", r, max_collnet_rank, r, rank, val, color, style
+      else
+        printf "    c%d_%d -> c%d_%d [label=\"%s\",color=\"%s\",style=\"%s\",fontname=\"Helvetica\"];\n", r, max_collnet_rank, r, rank, val, color, style
+      while(1) {
+        for(s=0;s<max_collnet_rank;s++) {
+          if((rank "," s "," r) in collnet) {
+            if(send)
+              val=conn[s "," rank "," r]
+            else
+              val=conn[rank "," s "," r]
+            style="solid"
+            color="red"
+            if (match(val,"NET")) {
+              style="dashed"
+              if (match(val,"GDRDMA"))
+                color="green"
+            }
+            if (match(val,"P2P")) {
+              color="green"
+            }
+            if(send)
+              printf "    c%d_%d -> c%d_%d [dir=back label=\"%s\",color=\"%s\",style=\"%s\",fontname=\"Helvetica\"];\n", r, rank, r, s, val, color, style
+            else
+              printf "    c%d_%d -> c%d_%d [label=\"%s\",color=\"%s\",style=\"%s\",fontname=\"Helvetica\"];\n", r, rank, r, s, val, color, style
+            rank=s
+            break;
+          }
+        }
+        if(s>=max_collnet_rank) {
+          break;
+        }
+      }
+    }
+    printf "\n"
+    for(s=0;s<=max_collnet_rank;s++) {
+      printf "    c%d_%d [label=\"%d\",fontsize=\"28\"];\n", r, s, s
+    }
+    printf "  }\n\n"
+  }
+  printf "}\n"
+}
\ No newline at end of file
diff --git a/coll_trace_processor/log_processor.py b/coll_trace_processor/log_processor.py
new file mode 100644
index 0000000..2f4d7ac
--- /dev/null
+++ b/coll_trace_processor/log_processor.py
@@ -0,0 +1,480 @@
+import os
+import sys
+import subprocess
+import argparse
+import pkg_resources
+import re
+import time
+
+coll_op_map = {
+            "Broadcast": "broadcast_perf",
+            "Reduce": "reduce_perf",
+            "AllGather": "all_gather_perf",
+            "ReduceScatter": "reduce_scatter_perf",
+            "AllReduce": "all_reduce_perf",
+            "Gather": "gather_perf",
+            "Scatter": "scatter_perf",
+            "AllToAll": "alltoall_perf",
+#            "AllToAllv": "alltoallv_perf",
+            "Send": "sendrecv_perf",
+            "Recv": "sendrecv_perf",
+            }
+
+reduction_op_map = {
+                "0" : "sum",
+                "1" : "prod",
+                "2" : "max",
+                "3" : "min",
+                "4" : "all",
+               }
+
+data_types_map = {
+                "0" : "int8",
+                "1" : "uint8",
+                "2" : "int32",
+                "3" : "uint32",
+                "4" : "int64",
+                "5" : "uint64",
+                "6" : "half",
+                "7" : "float",
+                "8" : "double",
+                "9" : "bf16",
+                #"10" : "ncclNumTypes Equivalent?"
+             }
+
+data_type_bytes_map = {
+                    "0" : 1,
+                    "1" : 1,
+                    "2" : 4,
+                    "3" : 4,
+                    "4" : 8,
+                    "5" : 8,
+                    "6" : 2,
+                    "7" : 4,
+                    "8" : 8,
+                    "9" : 2,
+                    #"10" : Not sure.
+                  }
+
+# n: number of ranks 
+# n links of Bandwidth B to perform a operation
+# def factor_1(n): 
+#     return n
+def all_gather_factor(n):
+     return (n-1)/n
+def reduce_scatter_factor(n):
+     return (n-1)/n        
+def all_reduce_factor(n):
+     return 2*(n-1)/n
+def all_to_all_factor(n):
+    return 2*(n-1)/n
+
+def coll_op_algobw_factor(coll_type, nranks):
+    nranks = float(nranks)
+    if coll_type == "AllGather":
+        return float(all_gather_factor(nranks))
+    elif coll_type == "ReduceScatter":
+        return float(reduce_scatter_factor(nranks))
+    elif coll_type == "AllReduce":
+        return float(all_reduce_factor(nranks))
+    elif coll_type == "AllToAll":
+        return float(all_to_all_factor(nranks))
+    else:
+        return float(1)
+    
+def algobw_factor_times_size(coll_type, nranks, total_bytes):
+    nranks = float(nranks)
+    if coll_type == "AllGather":
+        return all_gather_factor(nranks) * float(total_bytes)
+    elif coll_type == "ReduceScatter":
+        return reduce_scatter_factor(nranks) * float(total_bytes)
+    elif coll_type == "AllReduce":
+        return all_reduce_factor(nranks) * float(total_bytes)
+    elif coll_type == "AllToAll":
+        return all_to_all_factor(nranks) * float(total_bytes)
+    else:
+        return float(1) * float(total_bytes)   
+
+class DisjointSet(object): # https://stackoverflow.com/questions/3067529/a-set-union-find-algorithm
+    def __init__(self):
+        self.leader = {} # maps a member to the group's leader
+        self.group = {} # maps a group leader to the group (which is a set)
+
+    def add(self, a, b):
+        leadera = self.leader.get(a)
+        leaderb = self.leader.get(b)
+        if leadera is not None:
+            if leaderb is not None:
+                if leadera == leaderb: return # nothing to do
+                groupa = self.group[leadera]
+                groupb = self.group[leaderb]
+                if len(groupa) < len(groupb):
+                    a, leadera, groupa, b, leaderb, groupb = b, leaderb, groupb, a, leadera, groupa
+                groupa |= groupb
+                del self.group[leaderb]
+                for k in groupb:
+                    self.leader[k] = leadera
+            else:
+                self.group[leadera].add(b)
+                self.leader[b] = leadera
+        else:
+            if leaderb is not None:
+                self.group[leaderb].add(a)
+                self.leader[a] = leaderb
+            else:
+                self.leader[a] = self.leader[b] = a
+                self.group[a] = set([a, b])
+
+                
+def get_useful_info(log_file):
+    fs = open(log_file, 'r')
+    lines = fs.readlines()
+    fs.close()
+
+    coll_lines, conn_lines, comm_lines, ring_lines, tree_lines, coll_trace_lines = [], [], [], [], [], []
+    for j in range(len(lines)):
+        line = lines[j].rstrip()
+        if ("opCount" in line and "sendbuff" in line):
+            coll_lines.append(line)
+        elif ("Channel" in line and "via" in line):
+            conn_lines.append(line)
+        elif ("Init COMPLETE" in line and "busId" in line):
+            comm_lines.append(line)
+        elif ("NCCL INFO Ring" in line):
+            ring_lines.append(line)
+        elif ("NCCL INFO Trees" in line):
+            tree_lines.append(line)
+        elif ((" ## " in line) and ("KL HWID" in line or "KE" in line or "CE" in line)):  # RCCL From ROCm 4.3
+            # Bug: [ 6628.064978] we need to consider the case when there is a spac right after '['
+            #                     Everything with split_list[index] will break.
+            if "[ " in line:
+                line = line.replace("[ ", "[")
+            coll_trace_lines.append(line)
+            
+    return coll_lines, conn_lines, comm_lines, ring_lines, tree_lines, coll_trace_lines
+
+def coll_table_build(coll_lines):
+    opCount, coll, count, datatype, op_type, root, comm, nranks, data_size = [], [], [], [], [], [], [], [], []
+    for line in coll_lines:
+        split_list = line.split(" ")
+        coll.append(split_list[4][:-1])
+        opCount.append(int(split_list[6], 16))
+        count.append(split_list[12])
+        datatype.append(split_list[14])
+        op_type.append(split_list[16])
+        root.append(split_list[18])
+        comm.append(split_list[20])
+        nranks.append(int(next(item for item in split_list if 'nranks' in item).split("=")[1].replace("]", ""))) ### 
+        data_size.append(int(split_list[split_list.index("count") + 1]) * data_type_bytes_map[split_list[split_list.index("datatype") + 1]])
+        
+    dict_coll = {'coll': coll, 'opCount': opCount, 'datatype': datatype, 'count':count, 'op_type':op_type, 'root':root, 'comm':comm, 'nranks':nranks, 'data_size':data_size}
+    table = pd.DataFrame(dict_coll)    
+    table['algobw_factor_times_size'] = table.apply(lambda row: 
+                                                    algobw_factor_times_size(row['coll'], row['nranks'], row['data_size']), axis=1)
+    table = table[['opCount','nranks','algobw_factor_times_size', 'data_size']].drop_duplicates(subset = ['opCount', 'data_size'])
+    return table
+
+## It works for connection information like XXX via "P2P/direct pointer%s".(useReadStr),       --> pointer%s will not be collected
+##                                                  "P2P/IPC%s".(useReadStr),  
+##                                                  "P2P/indirect/%d[%lx]%s".(intermediateRank,comm->peerInfo[intermediateRank].busId, useReadStr) 
+##                                                  "direct shared memory"                     --> shared memory
+## However, it will not capture the information for CollNet for now.
+
+## xxx [4] NCCL INFO Channel 00 : 0[e3000] -> 1[c3000] via P2P/IPC comm 0x7f53bc000e50 nRanks 04',  #(new output)
+
+def conn_table_build(conn_lines):  # Only works for RCCL 2.9 or above
+    def process_string(line):
+        split_list = line.split("[")
+        return [split_list[0], split_list[1].split("]")[0]]
+
+    start_rank, start_busid, end_rank, end_busid, connection, comm, nranks = [], [], [], [], [], [], []
+    for line in conn_lines:
+        split_list = line.split(" ")
+        sr, sb = process_string(split_list[split_list.index(":") + 1])  # first device
+        er, eb = process_string(split_list[split_list.index("->") + 1]) # second device
+        start_rank.append(sr)
+        start_busid.append(sb)
+        end_rank.append(er)
+        end_busid.append(eb)
+        connection.append(split_list[split_list.index("via") + 1])  # if it is direct, it means the connection is done by direct shared memory
+        comm.append(split_list[split_list.index("comm") + 1])
+        nranks.append(int(split_list[split_list.index("nRanks") + 1]))
+
+    dict_conn = {'start_rank': start_rank, 'start_busid': start_busid, 'end_rank': end_rank, 'end_busid': end_busid, 
+                 'connection': connection, 'comm':comm, 'nranks':nranks, 'conn_line':conn_lines}      
+    return pd.DataFrame(dict_conn)
+
+
+def comm_table_build(comm_lines):
+    comm, rank, nranks, cudaDev, busId = [], [], [], [], []   
+    for line in comm_lines:
+        split_list = line.rstrip().split(" ")
+        comm.append(split_list[5])
+        rank.append(split_list[7])
+        nranks.append(int(split_list[9]))
+        cudaDev.append(split_list[11])
+        busId.append(split_list[13])
+    dict_comm = {'comm':comm, 'rank':rank, 'nranks':nranks, 'cudaDev':cudaDev, 'busId':busId}  
+    return pd.DataFrame(dict_comm)
+
+
+def topo_table_build(topo_lines):
+    comm, nranks, busId, topo_line = [], [], [], []
+    for line in topo_lines:
+        split_list = line.split(" ")
+        comm.append(split_list[split_list.index("comm") + 1])
+        nranks.append(int(split_list[split_list.index("nRanks") + 1]))
+        busId.append(split_list[split_list.index("busId") + 1])
+        topo_line.append(line)
+    dict_table = {'comm':comm, 'nranks':nranks, 'busId':busId, 'topo_line':topo_line}      
+    return pd.DataFrame(dict_table)
+
+def create_table(log_name):
+    log_file = os.path.abspath(log_name)
+    coll_lines, conn_lines, comm_lines, ring_lines, tree_lines, coll_trace_lines = get_useful_info(log_file)
+    return coll_table_build(coll_lines), conn_table_build(conn_lines), comm_table_build(comm_lines), topo_table_build(ring_lines), topo_table_build(tree_lines), coll_trace_lines
+
+
+def device_grouping(comm_table, conn_table):
+    groups = []
+    for index, row in comm_table.iterrows():
+        temp = [row['busId'], list(conn_table[(conn_table['comm'] == row['comm']) & (conn_table['start_busid'] == row['busId'])]['end_busid'].unique())]
+        groups.append(temp)    
+    nranks = list(comm_table['nranks'])
+    outputs = []
+    rank_outputs = []
+    tempRank = None
+    for id, group in enumerate(groups): 
+        if tempRank == None:
+            tempRank = nranks[id]
+            ds = DisjointSet()
+        else:
+            if tempRank != nranks[id]:
+                for _, v in ds.group.items():
+                    if v not in outputs: 
+                        outputs.append(v)
+                ds = DisjointSet()
+                tempRank = nranks[id]
+        for node in group[1]:
+            ds.add(group[0], node)
+
+        if id == len(groups) - 1: 
+            for _, v in ds.group.items():
+                if v not in outputs: 
+                    outputs.append(v)  
+    return outputs
+
+def collect_topo(comm, nranks, ring_table, tree_table, conn_table, lines):
+    ring_lines = ring_table[(ring_table['comm'] == comm) & (ring_table['nranks'] == nranks)]['topo_line'].tolist()
+    tree_lines = tree_table[(tree_table['comm'] == comm) & (tree_table['nranks'] == nranks)]['topo_line'].tolist()
+    conn_lines = conn_table[(conn_table['comm'] == comm) & (conn_table['nranks'] == nranks)]['conn_line'].tolist()
+    for line in (ring_lines + tree_lines + conn_lines):
+        lines.append(line)
+    return lines
+
+def modify_label(in_txt, out_txt, rank_busId_mapping):
+    fs = open(in_txt, 'r')
+    lines = fs.readlines()
+    fs.close()
+    with open(out_txt, 'w') as f:
+        for line in lines:
+            if "fontsize" in line: 
+                for m in re.finditer('"([^"]*)"', line):
+                    rank_quot = line[m.start(0):m.end(0)]
+                    rank = line[m.start(0)+1:m.end(0)-1] # find the rank with no quotation
+                    line = line.replace(rank_quot,'\"'+rank_busId_mapping[rank]+'\"', 1)
+                    f.write("{}".format(line))
+                    break
+            else:
+                f.write("{}".format(line)) 
+
+
+def coll_trace_processor(coll_trace_lines, group_list, coll_table):
+    print(group_list)
+    def process_string(line):
+        split_list = line.split("[")
+        return split_list[1].split("]")[0]
+    # First pass
+    results, func_name = {}, {}
+    busIds = []
+    RCCL_2_8 = False
+    for line in coll_trace_lines:
+        if ((" ## " in line) and ("KL HWID" in line or "KE" in line or "CE" in line)):  # RCCL From ROCm 4.3
+            split_list = line.split(" ")
+            t = float(process_string(split_list[5])) # seconds.microseconds
+            rk, blk = process_string(split_list[6]).split(":") # rank:block_id
+            op = int(split_list[7], 16) # hex to decimal
+            ####### Kernel Launch #######
+            if split_list[8] == 'KL':
+                if "busId" in split_list:
+                    busId = split_list[split_list.index("busId") + 1]
+                    nranks = split_list[split_list.index("nRanks") + 1]
+                    KL_key = str(op) + "," + str(busId) + "," + str(nranks) + ",t0"
+                else:
+                    # RCCL 2.8 or older
+                    busId = int(rk)
+                    RCCL_2_8 = True
+                    KL_key = str(op) + "," + str(busId) + ",t0"
+                if KL_key not in results or results[KL_key] > t:
+                    results[KL_key] = t
+                if busId not in busIds:
+                     busIds.append(busId)
+                if op not in func_name:
+                     func_name[op] = split_list[11] # only work for KL
+                        
+            ####### Kernel End #######
+            elif split_list[8] == 'KE':
+                if "busId" in split_list:
+                    busId = split_list[split_list.index("busId") + 1]
+                    nranks = split_list[split_list.index("nRanks") + 1]
+                    KE_key = str(op) + "," + str(busId) + "," + str(nranks) + ",t1"
+                else:
+                    # RCCL 2.8 or below
+                    busId = int(rk)
+                    RCCL_2_8 = True
+                    KE_key = str(op) + "," + str(busId) + ",t1"
+                if KE_key not in results or results[KE_key] < t:
+                    results[KE_key] = t
+                    
+                    
+            elif split_list[8] == 'CE':
+                if "busId" in split_list:
+                    busId = split_list[split_list.index("busId") + 1]
+                    nranks = split_list[split_list.index("nRanks") + 1]
+                    CE_key = str(op) + "," + str(busId) + "," + str(nranks) + ",t1"
+                else:
+                    # RCCL 2.8 or older
+                    busId = int(rk)
+                    RCCL_2_8 = True
+                    CE_key = str(op) + "," + str(busId) + ",t1"
+                    
+                if CE_key not in results or results[CE_key] < t:
+                    results[CE_key] = t
+
+        elif ((" ## " in line) and ("Abort" in line)):
+            raise AssertionError("Abort")
+    
+    
+    # Second pass
+    bw_tables, time_tables = [], []
+    for i, group in enumerate(group_list):
+        group = list(group)
+        bw_list, time_list = [], []
+        for op in func_name:
+            found = False
+            time_start, time_end = 1e9, 0
+            if RCCL_2_8:
+                for busId in range(len(group)):
+                    key = str(op) + "," + str(busId)
+                    if (key + ",t0") in results:
+                        found = True
+                        if busId == 0:
+                            temp = [op, func_name[op]] # opCount, function name
+                        t_start = results[key + ",t0"]
+                        t_end = results[key + ",t1"]
+                        temp.append(t_end - t_start) 
+                        if t_start < time_start:
+                            time_start = t_start
+                        if t_end > time_end:
+                            time_end = t_end
+            else:
+                for busId in group:
+                    key = str(op) + "," + str(busId) + "," + str(len(group))
+                    if (key + ",t0") in results:
+                        found = True
+                        if busId == group[0]:
+                            temp = [op, func_name[op]] # opCount, function name
+                        t_start = results[key + ",t0"]
+                        t_end = results[key + ",t1"]
+                        temp.append(t_end - t_start)
+                        if t_start < time_start:
+                            time_start = t_start
+                        if t_end > time_end:
+                            time_end = t_end
+            if found:
+                algobw_factor_times_size = coll_table[(coll_table['opCount'] == op) & (coll_table['nranks'] == len(group))]['algobw_factor_times_size'].unique()[0]
+                data_size = coll_table[(coll_table['opCount'] == op) & (coll_table['nranks'] == len(group))]['data_size'].unique()[0]
+                bw = temp[:2]
+                for k in range(len(group)):
+                    bw.append(algobw_factor_times_size / temp[k + 2] /1e9)
+                
+                bw = bw + [data_size, algobw_factor_times_size/(time_end - time_start)/1e9]
+                temp.append(data_size)
+                time_list.append(temp)
+                bw_list.append(bw)
+                
+        time_tables.append(pd.DataFrame(time_list, columns = ['opCount', 'Function Name'] + group + ['data_size']))
+        bw_tables.append(pd.DataFrame(bw_list, columns = ['opCount', 'Function Name'] + group + ['data_size', 'algBW']))
+    return time_tables, bw_tables
+
+def rccl_log_process():
+    debug_log = os.path.abspath(args.rccl_debug_log)
+    if args.cuda:
+        coll_lines, conn_lines, comm_lines, ring_lines, tree_lines, coll_trace_lines = get_useful_info(debug_log)
+        coll_table = coll_table_build(coll_lines)
+        device_group_list = [[i for i in range(args.num_devices)]]
+        print("Using CUDA or ROCm with RCCL 2.8 or below.")
+    else:
+        device_grouping_output = os.path.join(os.path.dirname(os.path.realpath(__file__)), "device_groups.txt")
+        coll_table, conn_table, comm_table, ring_table, tree_table, coll_trace_lines = create_table(debug_log)
+        device_group_list = device_grouping(comm_table, conn_table)
+        with open(device_grouping_output, 'w') as f:
+            for mySet in device_group_list:
+                f.write("%s\n" % str(mySet))    
+
+        for i, group in enumerate(device_group_list):
+            group = list(group)
+            print(group)
+            temp_txt = os.path.join(os.path.dirname(os.path.realpath(__file__)), "temp_{}.txt".format(i))
+            temp1_txt = os.path.join(os.path.dirname(os.path.realpath(__file__)), "temp_1_{}.txt".format(i))
+            temp2_txt = os.path.join(os.path.dirname(os.path.realpath(__file__)), "temp_2_{}.txt".format(i))
+            nranks = len(group)
+            lines = []
+            rank_busId_mapping = {}
+            for device in group:
+                comms = comm_table[(comm_table['nranks'] == nranks) & (comm_table['busId'] == device)]['comm'].unique()
+                assert len(comms) == 1       # if not 1, it means different devices may have same nRanks and busId.
+                lines = collect_topo(comms[0], nranks, ring_table, tree_table, conn_table, lines)
+                rank = comm_table[(comm_table['nranks'] == nranks) & (comm_table['busId'] == device)]['rank'].unique()[0]
+                busId = comm_table[(comm_table['nranks'] == nranks) & (comm_table['busId'] == device)]['busId'].unique()[0]
+                rank_busId_mapping[rank] = busId
+
+            with open(temp_txt, 'w') as f:
+                for line in lines:
+                    f.write("{}\n".format(line))
+
+            p1 = subprocess.Popen("awk -f extract_topo.awk {} > {}".format(temp_txt, temp1_txt), 
+                                  stdin=subprocess.PIPE, stdout=subprocess.PIPE, shell=True)
+            time.sleep(1)
+            modify_label(temp1_txt, temp2_txt, rank_busId_mapping)
+            p2 = subprocess.Popen("dot -Tpng {} -o '{}.png'".format(temp2_txt, i), stdin=subprocess.PIPE, stdout=subprocess.PIPE, shell=True)
+            p3 = subprocess.Popen("rm -r temp*.txt", stdin=subprocess.PIPE, stdout=subprocess.PIPE, shell=True)
+    
+    ########### Collective trace processor ###########
+    time_tables, bw_tables = coll_trace_processor(coll_trace_lines, device_group_list, coll_table)
+    for i, (time_table, bw_table) in enumerate(zip(time_tables, bw_tables)):
+        time_csv = os.path.join(os.path.dirname(os.path.realpath(__file__)), "time_{}.csv".format(i))
+        bw_csv = os.path.join(os.path.dirname(os.path.realpath(__file__)), "bw_{}.csv".format(i))
+        time_table.to_csv(time_csv)
+        bw_table.to_csv(bw_csv)
+        
+
+    
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--rccl-debug-log", type=str, required=True, \
+                            help="RCCL log after running app with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL RCCL_KERNEL_COLL_TRACE_ENABLE=1 executable")
+    
+    parser.add_argument("--cuda", action="store_true", default=False, help="If the application is using CUDA systems or ROCm systems with RCCL 2.8 or below, the topology visualizer will not be enabled.") # 
+    parser.add_argument('--num_devices', type=int, default=1, help="If the application is using CUDA systems or ROCm systems with RCCL 2.8 or below, the number of devices needs to be specified.") #
+    args = parser.parse_args()
+    #### Requirement check #### 
+    required = {'pandas'}
+    installed = {pkg.key for pkg in pkg_resources.working_set}
+    missing = required - installed
+    if missing:
+        python = sys.executable
+        subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)
+    import pandas as pd
+    rccl_log_process()
+
diff --git a/deviceIdMapping/Makefile b/deviceIdMapping/Makefile
new file mode 100644
index 0000000..48b749d
--- /dev/null
+++ b/deviceIdMapping/Makefile
@@ -0,0 +1,28 @@
+HIP_PATH?= $(wildcard /opt/rocm/hip)
+
+ifeq (,$(HIP_PATH))
+$(error Please set environment variable HIP_PATH manually.)
+endif
+
+ROCM_PATH?= $(wildcard /opt/rocm)
+ifeq (,$(HIP_PATH))
+$(error Please set environment variable ROCM_PATH manually.)
+endif
+
+HIPCC=$(HIP_PATH)/bin/hipcc
+
+
+EXE=hip_rocm_smi_mapping
+
+files=$(EXE).cpp
+
+CXXFLAGS=-g -O3 -I$(ROCM_PATH)/rocm_smi/include -lrocm_smi64 -L$(ROCM_PATH)/rocm_smi/lib
+
+all: $(EXE)
+
+$(EXE): $(files)
+		$(HIPCC) $(CXXFLAGS) $^ -o $@
+
+
+clean:
+		rm -f *.o $(EXE)
\ No newline at end of file
diff --git a/deviceIdMapping/README.md b/deviceIdMapping/README.md
new file mode 100644
index 0000000..93de978
--- /dev/null
+++ b/deviceIdMapping/README.md
@@ -0,0 +1,66 @@
+# Device ID Mapping
+
+The device indices on HIP, ROCM-SMI, and RCCL (cudaDev) are not necessarily matching. The only unique information for devices is PCI bus ID. Therefore, this sub-project is meant to create a debugging tool across different software stacks. 
+
+
+
+**On ROCM-SMI (--showbus):**
+The following outputs are from this tool on a server with 8 MI60 GPUs.
+```
+======================= ROCm System Management Interface =======================
+================================== PCI Bus ID ==================================
+GPU[0]          : PCI Bus: 0000:43:00.0
+GPU[1]          : PCI Bus: 0000:23:00.0
+GPU[2]          : PCI Bus: 0000:26:00.0
+GPU[3]          : PCI Bus: 0000:03:00.0
+GPU[4]          : PCI Bus: 0000:E3:00.0
+GPU[5]          : PCI Bus: 0000:C3:00.0
+GPU[6]          : PCI Bus: 0000:C6:00.0
+GPU[7]          : PCI Bus: 0000:83:00.0
+================================================================================
+============================= End of ROCm SMI Log ==============================
+```
+
+
+**Device ID Mapping**
+
+The following outputs are from this tool on a server with 8 MI60 GPUs. We can use PCI bus id to get device id on HIP by using [hipDeviceGetByPCIBusId](https://rocmdocs.amd.com/en/latest/ROCm_API_References/HIP_API/Initialization-and-Version.html?highlight=hipDeviceGetByPCIBusId#hipdevicegetbypcibusid).
+
+```
+====================== Number of HIP visible devices: 8
+====================== Number of GPUs on your machine which can be observed by ROCm-SMI: 8
+====== ROCm-SMI device ID ======= PCI bus ID ======= HIP device ID ======
+                0                0000:43:00.0              0
+                1                0000:23:00.0              N/A (cannot map PCI Bus ID: 0000:23:00.0 to a HIP visible device)
+                2                0000:26:00.0              N/A (cannot map PCI Bus ID: 0000:26:00.0 to a HIP visible device)
+                3                0000:03:00.0              1
+                4                0000:e3:00.0              N/A (cannot map PCI Bus ID: 0000:e3:00.0 to a HIP visible device)
+                5                0000:c3:00.0              N/A (cannot map PCI Bus ID: 0000:c3:00.0 to a HIP visible device)
+                6                0000:c6:00.0              N/A (cannot map PCI Bus ID: 0000:c6:00.0 to a HIP visible device)
+                7                0000:83:00.0              2
+
+```
+
+As the results shown above, we confirmed that the device (enumeration) indices on HIP and ROCM-SMI are identical when there is no HIP_VISIBLE_DEVICES used.
+
+
+When [HIP_VISIBLE_DEVICES](https://rocmdocs.amd.com/en/latest/Other_Solutions/Other-Solutions.html?highlight=HIP_VISIBLE_DEVICES#hip-environment-variables) or [ROCR_VISIBLE_DEVICES](https://rocmdocs.amd.com/en/latest/ROCm_System_Managment/ROCm-System-Managment.html?highlight=ROCR_VISIBLE_DEVICES#rocr-visible-devices) is used, the indices will not match for HIP and ROCM-SMI. The following outputs are from
+```
+HIP_VISIBLE_DEVICES=0,3,7 ./hip_rocm_smi_mapping
+or
+ROCR_VISIBLE_DEVICES=0,3,7 ./hip_rocm_smi_mapping
+```
+```
+====================== Number of HIP visible devices: 3
+====================== Number of GPUs on your machine which can be observed by ROCm-SMI: 8
+=================== ROCm-SMI device ID =================== PCI bus ID =================== HIP device ID ===================
+0      --->      0000:43:00.0      --->      0
+1      --->      0000:23:00.0      --->      N/A (cannot map PCI Bus ID: 0000:23:00.0 to a HIP visible device)
+2      --->      0000:26:00.0      --->      N/A (cannot map PCI Bus ID: 0000:26:00.0 to a HIP visible device)
+3      --->      0000:03:00.0      --->      1
+4      --->      0000:e3:00.0      --->      N/A (cannot map PCI Bus ID: 0000:e3:00.0 to a HIP visible device)
+5      --->      0000:c3:00.0      --->      N/A (cannot map PCI Bus ID: 0000:c3:00.0 to a HIP visible device)
+6      --->      0000:c6:00.0      --->      N/A (cannot map PCI Bus ID: 0000:c6:00.0 to a HIP visible device)
+7      --->      0000:83:00.0      --->      2
+
+```
\ No newline at end of file
diff --git a/deviceIdMapping/hip_rocm_smi_mapping.cpp b/deviceIdMapping/hip_rocm_smi_mapping.cpp
new file mode 100644
index 0000000..3033cfa
--- /dev/null
+++ b/deviceIdMapping/hip_rocm_smi_mapping.cpp
@@ -0,0 +1,110 @@
+/*
+Copyright (c) 2015-present Advanced Micro Devices, Inc. All rights reserved.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
+*/
+
+#include <iostream>
+#include <iomanip>
+#include "hip/hip_runtime.h"
+#include <cstdio>
+#include <memory>
+#include <stdexcept>
+#include <string>
+#include <array>
+#include <vector>
+#include <stdint.h>
+#include "rocm_smi/rocm_smi.h"
+
+#define KNRM "\x1B[0m"
+#define KRED "\x1B[31m"
+#define KGRN "\x1B[32m"
+#define KYEL "\x1B[33m"
+#define KBLU "\x1B[34m"
+#define KMAG "\x1B[35m"
+#define KCYN "\x1B[36m"
+#define KWHT "\x1B[37m"
+
+#define failed(...)                                                                                \
+    printf("%serror: ", KRED);                                                                     \
+    printf(__VA_ARGS__);                                                                           \
+    printf("\n");                                                                                  \
+    printf("error: TEST FAILED\n%s", KNRM);                                                        \
+    exit(EXIT_FAILURE);
+
+#define HIPCHECK(error)                                                                            \
+    if (error != hipSuccess) {                                                                     \
+        printf("%serror: '%s'(%d) at %s:%d%s\n", KRED, hipGetErrorString(error), error, __FILE__,  \
+               __LINE__, KNRM);                                                                    \
+        failed("API returned error code.");                                                        \
+    }
+
+std::string int2hex(uint64_t num, int digits) {
+    std::stringstream sstream;
+    sstream << std::setfill('0') << std::setw(digits) << std::hex << num;
+    std::string result = sstream.str();
+    return result;
+}
+
+
+int main(int argc, char* argv[]) {
+    int deviceCnt;
+    HIPCHECK(hipGetDeviceCount(&deviceCnt));
+    std::cout << "====================== Number of HIP visible devices: " << deviceCnt << std::endl;
+    /* Example outputs from rocm-smi --showbus
+
+    GPU[0]              : PCI Bus: 0000:43:00.0
+    GPU[1]              : PCI Bus: 0000:23:00.0
+    GPU[2]              : PCI Bus: 0000:26:00.0
+    GPU[3]              : PCI Bus: 0000:03:00.0
+    GPU[4]              : PCI Bus: 0000:E3:00.0
+    GPU[5]              : PCI Bus: 0000:C3:00.0
+    GPU[6]              : PCI Bus: 0000:C6:00.0
+    GPU[7]              : PCI Bus: 0000:83:00.0
+    */
+
+    // ROCm-smi
+    rsmi_status_t ret;
+    uint32_t num_devices;
+    ret = rsmi_init(0);
+    ret = rsmi_num_monitor_devices(&num_devices);
+    std::cout << "====================== Number of GPUs on your machine which can be observed by ROCm-SMI: "<< num_devices << std::endl;
+    std::cout << "====== ROCm-SMI device ID ======= PCI bus ID ======= HIP device ID ======" << std::endl;
+    for (int i = 0; i < num_devices; i++) {
+        uint64_t val_ui64; // bdfid in rocm_smi.cc
+        rsmi_status_t  err = rsmi_dev_pci_id_get(i, &val_ui64);
+        //std::cout << "\t**PCI ID (BDFID): 0x" << std::hex << val_ui64 << std::endl;
+        auto domain = (val_ui64 >> 32) & 0xffff;
+        auto bus = (val_ui64 >> 8) & 0xff;
+        auto device = (val_ui64 >> 3) & 0x1f;
+        auto function = val_ui64 & 0x7;
+        std::string pciString = int2hex(domain, 4) + ":" + int2hex(bus, 2) + ":" + int2hex(device, 2) + "." + int2hex(function, 1);
+        const char* busIdStr = (pciString).c_str();
+        int hipDeviceId;
+        std::cout << "                " << i << "                "<< pciString << "              ";
+        if (hipDeviceGetByPCIBusId(&hipDeviceId, busIdStr) != hipSuccess) {
+            std::cout << "N/A (cannot map PCI Bus ID: " << busIdStr << " to a HIP visible device)" << std::endl;
+        } else std::cout << hipDeviceId << std::endl;
+
+        // pic_id = '{:04X}:{:02X}:{:02X}.{:0X}'.format(domain, bus, device, function)
+        //
+        //std::cout << domain << "; "<< bus << "; "<< device << "; "<< function << std::endl;
+    }
+    ret = rsmi_shut_down();
+}
\ No newline at end of file
diff --git a/generate_summary.py b/generate_summary.py
index 6412527..272d059 100644
--- a/generate_summary.py
+++ b/generate_summary.py
@@ -1,124 +1,75 @@
 import os
 import sys
 import argparse
-import re
+import pandas as pd 
 
-def get_script_commands(script_file):
+def rccl_tests_log_processor(script_file):
     fs = open(script_file, 'r')
     lines = fs.readlines()
     fs.close()
-
-    commands = []
-    for j in range(len(lines)):
-        line = lines[j].rstrip()
-        commands.append(line)
-
-    return commands
-
-def parse_useful_information(log_file):
-    fs = open(log_file, 'r')
-    lines = fs.readlines()
-    fs.close()
-
-    useful_lines = []
+    results = []
+    temp = []
     for j in range(len(lines)):
         line = lines[j].rstrip()
-        if ("time" in line and "algbw" in line and "busbw" in line):
+        if ('==========================================================' in line and j != 0):
+            results.append(temp)
+            temp = []
+        elif ("time" in line and "algbw" in line and "busbw" in line):
             perf_line = lines[j+2]
-            if ("Avg bus bandwidth" in lines[j+5]):
+            if ("Avg bus bandwidth" in lines[j + 5]):
                 perf_line = perf_line + lines[j + 5]
-            elif ("Avg bus bandwidth" in lines[j+4]):
-                perf_line = perf_line + lines[j+4]
-            useful_lines.append(perf_line)
-    return useful_lines
-
-def parse_nccl_performance(useful_lines, commands):
+            elif ("Avg bus bandwidth" in lines[j + 4]):
+                perf_line = perf_line + lines[j + 4]
+            temp.append(perf_line)
     
-    perf_lines = []
-    perf_lines.append("sep=|")
-    header = "size|count|type|redop|root|time-oplace(us)|algbw(gb/s)-oplace|busbw(gb/s)-oplace|error|" + \
-             "time-iplace(us)|algbw(gb/s)-iplace|busbw(gb/s)-iplace|error|avg_bus_bw|commands"
-    #print(header)
-    num_fields = len(header.split("|"))
-    perf_lines.append(header)
-    for j in range(len(useful_lines)):
-        line = useful_lines[j]
-        line = line.replace("# Avg bus bandwidth    : ", "")
-        
-        split_list = line.split()
-        perf_line = ""
-        field_index = 0
-        for i in range(len(split_list)):
-            perf_line = perf_line + split_list[i] + "|"
-            # Some collectives do not involve a redop
-            if field_index==2 and "reduce" not in commands[j].lower():
-                perf_line = perf_line + "|"
-                field_index = field_index + 1
-            # Only broadcast and reduce involve a root
-            if (
-               field_index==3 and
-               re.search(r'\Wreduce_perf', commands[j]) is None and
-               re.search(r'\Wbroadcast_perf', commands[j]) is None
-            ):
-                perf_line = perf_line + "|"
-                field_index = field_index + 1
-            field_index = field_index + 1
-        #print (perf_line + commands[j])
-        perf_line = perf_line + commands[j]
-        assert len(perf_line.split("|")) == num_fields
-        perf_lines.append(perf_line)
+    return results
 
-    return perf_lines
-
-def get_counts_from_file(count_file):
-    fs = open(count_file, 'r')
-    lines = fs.readlines()
-    fs.close()
-
-    counts = []
-    for j in range(1, len(lines)):
-        line = lines[j].rstrip()
-        counts.append(line.split("|")[1])
-    return counts
-
-def update_perf_lines(perf_lines, counts):
+def parse_nccl_performance(perf_lines, net_counts_csv, output_file):
+    size, count_1, datatype, op_type, time_oplace, algbw_gbs_oplace, busbw_oplace, error_oplace, time_iplace, algbw_gbs_iplace, busbw_iplace, error_iplace, avg_busbw, count = [], [], [], [], [], [], [], [], [], [], [], [], [], []
+    count_table = pd.read_csv(net_counts_csv)
+    for i, row in count_table.iterrows():
+        for line in perf_lines[i]:
+            split_list = line.split()
+            size.append(split_list[0])
+            count_1.append(split_list[1])
+            datatype.append(split_list[2])
+            op_type.append(split_list[3])
+            time_oplace.append(split_list[4])
+            algbw_gbs_oplace.append(split_list[5])
+            busbw_oplace.append(split_list[6])
+            error_oplace.append(split_list[7])
+            time_iplace.append(split_list[8])
+            algbw_gbs_iplace.append(split_list[9])
+            busbw_iplace.append(split_list[10])
+            error_iplace.append(split_list[11])
+            avg_busbw.append(split_list[-1])
+            assert row['count'] % len(perf_lines[i]) == 0
+            count.append(int(row['count']//len(perf_lines[i])))
     
-    updated_lines = []
-    updated_lines.append("sep=|")
-    updated_lines.append(perf_lines[1] + "|count")
-    for j in range(2, len(perf_lines)):
-        perf_line = perf_lines[j] + "|" +  counts[j-2]
-        updated_lines.append(perf_line)
+    dict_summary = {"size":size, "count_1":count_1, "datatype":datatype, 
+                          "op_type":op_type, "time_oplace":time_oplace, "algbw_gbs_oplace":algbw_gbs_oplace,
+                          "busbw_oplace":busbw_oplace, "error_oplace":error_oplace, "time_iplace":time_iplace,
+                          "algbw_gbs_iplace":algbw_gbs_iplace, "busbw_iplace":busbw_iplace, "error_iplace":error_iplace,
+                          "avg_busbw":avg_busbw, "count":count}
+    table = pd.DataFrame(dict_summary)
+    table.to_csv(output_file)
+    print ("INFO: Dumped out the count of each command in a file named: {}".format(output_file)) 
+    return table
 
-    return updated_lines
-        
-def generate_output_file(out_file, perf_lines):
-    fs = open(out_file, 'w')
-    for j in range(len(perf_lines)):
-        fs.write(perf_lines[j])
-        fs.write('\n')
-    fs.close()
-    print ("INFO: Dumped out the performance.")
 
 def main():
     log_file = os.path.abspath(args.log_file)
+    count_file = os.path.abspath(args.count_file)
     out_file = args.output_file_name + ".csv"
-    script_file = os.path.abspath(args.script_file)
-
-    commands = get_script_commands(script_file)
-    useful_lines = parse_useful_information(log_file)
-    perf_lines = parse_nccl_performance(useful_lines, commands)
-    if args.count_file:
-        counts = get_counts_from_file(os.path.abspath(args.count_file))
-        perf_lines = update_perf_lines(perf_lines, counts)
-    generate_output_file(out_file, perf_lines)
+    results = rccl_tests_log_processor(log_file) # net_unique_topo.sh
+    parse_nccl_performance(results, count_file, out_file)
+        
 
 if __name__ == '__main__':
     parser = argparse.ArgumentParser()
-    parser.add_argument("--log-file", type=str, required=True, help="Log file generated while running rccl-tests")
+    parser.add_argument("--log-file", type=str, required=True, help="Log file generated while running rccl-tests with net_unique_topo.sh")
     parser.add_argument("--output-file-name", type=str, required=False, default="net_summary")
-    parser.add_argument("--script-file", type=str, required=True, help="Script file to run NCCL/RCCL Tests")
-    parser.add_argument("--count-file", type=str, required=False, help="net_count file generated while running unique option in parser.")
+    parser.add_argument("--count-file", type=str, required=False, default="net_counts.csv", help="net_count file generated while running unique option in parser.")
 
     args = parser.parse_args()
     main()
diff --git a/install.sh b/install.sh
new file mode 100644
index 0000000..d7567fd
--- /dev/null
+++ b/install.sh
@@ -0,0 +1,7 @@
+#!/bin/bash
+sudo apt-get install -y gawk
+sudo apt-get install -y graphviz
+pip install pandas
+pip install networkx
+cd deviceIdMapping; make
+./hip_rocm_smi_mapping > busId_HIP_map.txt
diff --git a/net_unique_topo.sh b/net_unique_topo.sh
new file mode 100644
index 0000000..2939cb6
--- /dev/null
+++ b/net_unique_topo.sh
@@ -0,0 +1,65 @@
+echo '==========================================================' 
+HIP_VISIBLE_DEVICES=6,4,5,7 ./build/broadcast_perf -d int8 -b 40 -e 40 -o sum -g 4
+HIP_VISIBLE_DEVICES=3,1,0,2 ./build/broadcast_perf -d int8 -b 40 -e 40 -o sum -g 4
+echo '==========================================================' 
+HIP_VISIBLE_DEVICES=6,2 ./build/broadcast_perf -d int8 -b 25952256 -e 25952256 -o sum -g 2
+HIP_VISIBLE_DEVICES=3,7 ./build/broadcast_perf -d int8 -b 25952256 -e 25952256 -o sum -g 2
+HIP_VISIBLE_DEVICES=0,4 ./build/broadcast_perf -d int8 -b 25952256 -e 25952256 -o sum -g 2
+HIP_VISIBLE_DEVICES=1,5 ./build/broadcast_perf -d int8 -b 25952256 -e 25952256 -o sum -g 2
+echo '==========================================================' 
+HIP_VISIBLE_DEVICES=6,2 ./build/broadcast_perf -d int8 -b 2097152 -e 2097152 -o sum -g 2
+HIP_VISIBLE_DEVICES=3,7 ./build/broadcast_perf -d int8 -b 2097152 -e 2097152 -o sum -g 2
+HIP_VISIBLE_DEVICES=0,4 ./build/broadcast_perf -d int8 -b 2097152 -e 2097152 -o sum -g 2
+HIP_VISIBLE_DEVICES=1,5 ./build/broadcast_perf -d int8 -b 2097152 -e 2097152 -o sum -g 2
+echo '==========================================================' 
+HIP_VISIBLE_DEVICES=6,2 ./build/broadcast_perf -d int8 -b 2048 -e 2048 -o sum -g 2
+HIP_VISIBLE_DEVICES=3,7 ./build/broadcast_perf -d int8 -b 2048 -e 2048 -o sum -g 2
+HIP_VISIBLE_DEVICES=0,4 ./build/broadcast_perf -d int8 -b 2048 -e 2048 -o sum -g 2
+HIP_VISIBLE_DEVICES=1,5 ./build/broadcast_perf -d int8 -b 2048 -e 2048 -o sum -g 2
+echo '==========================================================' 
+HIP_VISIBLE_DEVICES=6,2 ./build/broadcast_perf -d int8 -b 1572864 -e 1572864 -o sum -g 2
+HIP_VISIBLE_DEVICES=3,7 ./build/broadcast_perf -d int8 -b 1572864 -e 1572864 -o sum -g 2
+HIP_VISIBLE_DEVICES=0,4 ./build/broadcast_perf -d int8 -b 1572864 -e 1572864 -o sum -g 2
+HIP_VISIBLE_DEVICES=1,5 ./build/broadcast_perf -d int8 -b 1572864 -e 1572864 -o sum -g 2
+echo '==========================================================' 
+HIP_VISIBLE_DEVICES=6,2 ./build/broadcast_perf -d int8 -b 1536 -e 1536 -o sum -g 2
+HIP_VISIBLE_DEVICES=3,7 ./build/broadcast_perf -d int8 -b 1536 -e 1536 -o sum -g 2
+HIP_VISIBLE_DEVICES=0,4 ./build/broadcast_perf -d int8 -b 1536 -e 1536 -o sum -g 2
+HIP_VISIBLE_DEVICES=1,5 ./build/broadcast_perf -d int8 -b 1536 -e 1536 -o sum -g 2
+echo '==========================================================' 
+HIP_VISIBLE_DEVICES=6,2 ./build/broadcast_perf -d int8 -b 524288 -e 524288 -o sum -g 2
+HIP_VISIBLE_DEVICES=3,7 ./build/broadcast_perf -d int8 -b 524288 -e 524288 -o sum -g 2
+HIP_VISIBLE_DEVICES=0,4 ./build/broadcast_perf -d int8 -b 524288 -e 524288 -o sum -g 2
+HIP_VISIBLE_DEVICES=1,5 ./build/broadcast_perf -d int8 -b 524288 -e 524288 -o sum -g 2
+echo '==========================================================' 
+HIP_VISIBLE_DEVICES=6,4,5,7 ./build/broadcast_perf -d int8 -b 32 -e 32 -o sum -g 4
+HIP_VISIBLE_DEVICES=3,1,0,2 ./build/broadcast_perf -d int8 -b 32 -e 32 -o sum -g 4
+echo '==========================================================' 
+HIP_VISIBLE_DEVICES=6,4,5,7 ./build/broadcast_perf -d int8 -b 65600 -e 65600 -o sum -g 4
+HIP_VISIBLE_DEVICES=3,1,0,2 ./build/broadcast_perf -d int8 -b 65600 -e 65600 -o sum -g 4
+echo '==========================================================' 
+HIP_VISIBLE_DEVICES=6,4,5,7 ./build/all_reduce_perf -d half -b 16777216 -e 16777216 -o sum -g 4
+HIP_VISIBLE_DEVICES=3,1,0,2 ./build/all_reduce_perf -d half -b 16777216 -e 16777216 -o sum -g 4
+echo '==========================================================' 
+HIP_VISIBLE_DEVICES=6,4,5,7 ./build/all_reduce_perf -d float -b 32768 -e 32768 -o max -g 4
+HIP_VISIBLE_DEVICES=3,1,0,2 ./build/all_reduce_perf -d float -b 32768 -e 32768 -o max -g 4
+echo '==========================================================' 
+HIP_VISIBLE_DEVICES=6,4,5,7 ./build/all_reduce_perf -d float -b 32768 -e 32768 -o sum -g 4
+HIP_VISIBLE_DEVICES=3,1,0,2 ./build/all_reduce_perf -d float -b 32768 -e 32768 -o sum -g 4
+echo '==========================================================' 
+HIP_VISIBLE_DEVICES=6,4,5,7 ./build/all_gather_perf -d int8 -b 4194304 -e 4194304 -o sum -g 4
+HIP_VISIBLE_DEVICES=3,1,0,2 ./build/all_gather_perf -d int8 -b 4194304 -e 4194304 -o sum -g 4
+echo '==========================================================' 
+HIP_VISIBLE_DEVICES=6,2 ./build/all_reduce_perf -d half -b 179429376 -e 179429376 -o sum -g 2
+HIP_VISIBLE_DEVICES=3,7 ./build/all_reduce_perf -d half -b 179429376 -e 179429376 -o sum -g 2
+HIP_VISIBLE_DEVICES=0,4 ./build/all_reduce_perf -d half -b 179429376 -e 179429376 -o sum -g 2
+HIP_VISIBLE_DEVICES=1,5 ./build/all_reduce_perf -d half -b 179429376 -e 179429376 -o sum -g 2
+echo '==========================================================' 
+HIP_VISIBLE_DEVICES=6,2 ./build/all_reduce_perf -d uint8 -b 1 -e 1 -o max -g 2
+HIP_VISIBLE_DEVICES=3,7 ./build/all_reduce_perf -d uint8 -b 1 -e 1 -o max -g 2
+HIP_VISIBLE_DEVICES=0,4 ./build/all_reduce_perf -d uint8 -b 1 -e 1 -o max -g 2
+HIP_VISIBLE_DEVICES=1,5 ./build/all_reduce_perf -d uint8 -b 1 -e 1 -o max -g 2
+echo '==========================================================' 
+HIP_VISIBLE_DEVICES=6,4,5,7 ./build/all_reduce_perf -d uint8 -b 1 -e 1 -o max -g 4
+HIP_VISIBLE_DEVICES=3,1,0,2 ./build/all_reduce_perf -d uint8 -b 1 -e 1 -o max -g 4
+echo '==========================================================' 
diff --git a/rccl_nccl_parser.py b/rccl_nccl_parser.py
index 8e12a84..4a06b37 100644
--- a/rccl_nccl_parser.py
+++ b/rccl_nccl_parser.py
@@ -1,6 +1,9 @@
 import os
 import sys
 import argparse
+from collections import defaultdict
+import pandas as pd
+import networkx as nx
 
 coll_op_map = {
             "Broadcast": "broadcast_perf",
@@ -51,49 +54,59 @@
                     "9" : 2,
                     #"10" : Not sure.
                   }
-                
+
+def algobw_factor_times_size(coll_type, nranks, total_bytes):
+    # n: number of ranks 
+    # n links of Bandwidth B to perform a operation
+    # def factor_1(n): 
+    #     return n
+    def all_gather_factor(n):
+        return (n-1)/n
+    def reduce_scatter_factor(n):
+        return (n-1)/n        
+    def all_reduce_factor(n):
+        return 2*(n-1)/n
+    def all_to_all_factor(n):
+        return 2*(n-1)/n
+        
+    nranks = float(nranks)
+    if coll_type == "AllGather":
+        return all_gather_factor(nranks) * float(total_bytes)
+    elif coll_type == "ReduceScatter":
+        return reduce_scatter_factor(nranks) * float(total_bytes)
+    elif coll_type == "AllReduce":
+        return all_reduce_factor(nranks) * float(total_bytes)
+    elif coll_type == "AllToAll":
+        return all_to_all_factor(nranks) * float(total_bytes)
+    else:
+        return float(1) * float(total_bytes)   
+
 def get_useful_info(log_file):
     fs = open(log_file, 'r')
     lines = fs.readlines()
     fs.close()
 
-    useful_lines = []
+    coll_lines, conn_lines, comm_lines, ring_lines, tree_lines, coll_trace_lines = [], [], [], [], [], []
     for j in range(len(lines)):
         line = lines[j].rstrip()
         if ("opCount" in line and "sendbuff" in line):
-            useful_lines.append(line)
-
-    return useful_lines
-
-def parse_nccl_log(nccl_lines):
-    
-    commands = []
-    for j in range(len(nccl_lines)):
-        line = nccl_lines[j]
-        split_list = line.split(" ")
-        comm = split_list[split_list.index("INFO") + 1].replace(":", "")
-        count = split_list[split_list.index("count") + 1]
-        datatype = split_list[split_list.index("datatype") + 1]
-        op_type = split_list[split_list.index("op") + 1]
-        root = split_list[split_list.index("root") + 1]
-        nnranks = next(item for item in split_list if 'nranks' in item).split("=")[1].replace("]", "")
-
-        #print (comm)
-        #print (count)
-        #print (datatype)
-        #print (op_type)
-        #print (root)
-        #print (nnranks)
-
-        total_bytes = int(count) * data_type_bytes_map[datatype]
-
-        test_cmd = "./build/" + coll_op_map[comm] + " -d " + data_types_map[datatype] + \
-                       " -b " + str(total_bytes) + " -e " + str(total_bytes) + \
-                       " -o " + reduction_op_map[op_type] + " -g " + str(nnranks)
-        #print (test_cmd)
-        commands.append((test_cmd, int(nnranks)))
-
-    return commands
+            coll_lines.append(line)
+        elif ("Channel" in line and "via" in line):
+            conn_lines.append(line)
+        elif ("Init COMPLETE" in line and "busId" in line):
+            comm_lines.append(line)
+        elif ("NCCL INFO Ring" in line):
+            ring_lines.append(line)
+        elif ("NCCL INFO Trees" in line):
+            tree_lines.append(line)
+        elif ((" ## " in line) and ("KL HWID" in line or "KE" in line or "CE" in line)):  # RCCL From ROCm 4.3
+            # Bug: [ 6628.064978] we need to consider the case when there is a spac right after '['
+            #                     Everything with split_list[index] will break.
+            if "[ " in line:
+                line = line.replace("[ ", "[")
+            coll_trace_lines.append(line)
+            
+    return coll_lines, conn_lines, comm_lines, ring_lines, tree_lines, coll_trace_lines
 
 def generate_script(commands, output_script):
     filename = output_script + ".sh"
@@ -104,55 +117,299 @@ def generate_script(commands, output_script):
     fs.close()
     print("INFO: Dumped out the commands in a script named: {}".format(filename))
 
-def dump_counts_map(counts_map, output_file):
+def dump_counts_map(unique_command_list, counts_list, output_file): ###########################
     filename = output_file + ".csv"
-    fs = open(filename, 'w')
-    fs.write("sep=|")
-    fs.write("\n")
-    keys = counts_map.keys()
-    for key in keys:
-        fs.write(key + "|" + str(counts_map[key]))
-        fs.write("\n")
+    dict_command_count = {'command': unique_command_list, 'count': counts_list}
+    table = pd.DataFrame(dict_command_count)
+    table.to_csv(filename)
+    print ("INFO: Dumped out the count of each command in a file named: {}".format(filename))    
+    
+# ts1-sjc2-27:11585:11585 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f988f200400 recvbuff 0x7f988f200400 count 440 datatype 0 op 0 root 0 comm 0x7f905c000dc0 [nranks=4] stream 0x55aa0a8d78f0
+def coll_table_build(coll_lines):
+    opCount, coll, count, datatype, op_type, root, comm, nranks, data_size = [], [], [], [], [], [], [], [], []
+    for line in coll_lines:
+        split_list = line.split()
+        coll.append(split_list[4][:-1])
+        opCount.append(int(split_list[6], 16))
+        count.append(split_list[12])
+        datatype.append(split_list[14])
+        op_type.append(split_list[16])
+        root.append(split_list[18])
+        comm.append(split_list[20])
+        nranks.append(int(next(item for item in split_list if 'nranks' in item).split("=")[1].replace("]", ""))) ### 
+        data_size.append(int(split_list[split_list.index("count") + 1]) * data_type_bytes_map[split_list[split_list.index("datatype") + 1]])
+
+    dict_coll = {'coll': coll, 'opCount': opCount, 'datatype': datatype, 'count':count, 'op_type':op_type, 'root':root, 'comm':comm, 'nranks':nranks, 'data_size':data_size}
+    table = pd.DataFrame(dict_coll)    
+    table['algobw_factor_times_size'] = table.apply(lambda row: 
+                                                    algobw_factor_times_size(row['coll'], row['nranks'], row['data_size']), axis=1)
+    table['raw_command'] = coll_lines
+    return table
+
+def conn_table_build(conn_lines, legacy_device_grouping):  # Only works for RCCL 2.9 or above
+    def process_string(line):
+        split_list = line.split("[")
+        return [split_list[0], split_list[1].split("]")[0]]
+
+    start_rank, start_busid, end_rank, end_busid, connection, comm, nranks = [], [], [], [], [], [], []
+    for line in conn_lines:
+        split_list = line.split(" ")
+        sr, sb = process_string(split_list[split_list.index(":") + 1])  # first device
+        er, eb = process_string(split_list[split_list.index("->") + 1]) # second device
+        start_rank.append(sr)
+        start_busid.append(sb)
+        end_rank.append(er)
+        end_busid.append(eb)
+        connection.append(split_list[split_list.index("via") + 1])  # if it is direct, it means the connection is done by direct shared memory
+        if not legacy_device_grouping:
+            if "comm" not in line:
+                raise AssertionError("This NCCL/RCCL log is from an older version. Please use RCCL 2.9 or above.")
+            comm.append(split_list[split_list.index("comm") + 1])
+            nranks.append(int(split_list[split_list.index("nRanks") + 1]))
+    
+    if not legacy_device_grouping:
+        dict_conn = {'start_rank': start_rank, 'start_busid': start_busid, 'end_rank': end_rank, 'end_busid': end_busid, 
+                    'connection': connection, 'comm':comm, 'nranks':nranks, 'conn_line':conn_lines}      
+    else:
+        dict_conn = {'start_rank': start_rank, 'start_busid': start_busid, 'end_rank': end_rank, 'end_busid': end_busid, 
+                    'connection': connection, 'conn_line':conn_lines}
+    return pd.DataFrame(dict_conn)
+    
+def comm_table_build(comm_lines):
+    comm, rank, nranks, cudaDev, busId = [], [], [], [], []   
+    for line in comm_lines:
+        split_list = line.rstrip().split(" ")
+        comm.append(split_list[5])
+        rank.append(split_list[7])
+        nranks.append(int(split_list[9]))
+        cudaDev.append(split_list[11])
+        busId.append(split_list[13])
+    dict_comm = {'comm':comm, 'rank':rank, 'nranks':nranks, 'cudaDev':cudaDev, 'busId':busId}  
+    return pd.DataFrame(dict_comm)
+
+
+class DisjointSet(object): # https://stackoverflow.com/questions/3067529/a-set-union-find-algorithm
+
+    def __init__(self):
+        self.leader = {} # maps a member to the group's leader
+        self.group = {} # maps a group leader to the group (which is a set)
+
+    def add(self, a, b):
+        leadera = self.leader.get(a)
+        leaderb = self.leader.get(b)
+        if leadera is not None:
+            if leaderb is not None:
+                if leadera == leaderb: return # nothing to do
+                groupa = self.group[leadera]
+                groupb = self.group[leaderb]
+                if len(groupa) < len(groupb):
+                    a, leadera, groupa, b, leaderb, groupb = b, leaderb, groupb, a, leadera, groupa
+                groupa |= groupb
+                del self.group[leaderb]
+                for k in groupb:
+                    self.leader[k] = leadera
+            else:
+                self.group[leadera].add(b)
+                self.leader[b] = leadera
+        else:
+            if leaderb is not None:
+                self.group[leaderb].add(a)
+                self.leader[a] = leaderb
+            else:
+                self.leader[a] = self.leader[b] = a
+                self.group[a] = set([a, b])
+                
+def buildGraph(graphs, connectionList): # add an input of connection list
+    G = nx.DiGraph()
+    node_pool = []
+    for graph in graphs:
+        if graph[0][0] not in node_pool: 
+            G.add_node(graph[0][0])
+            node_pool.append(graph[0][0])
+        for node in graph[1]:
+            if node not in node_pool: 
+                G.add_node(node)
+                node_pool.append(node)
+            label = connectionList[(connectionList['start_busid'] == graph[0][0]) & (connectionList['end_busid'] == node)]['connection'].unique()[0]
+            G.add_edge(graph[0][0], node, label=label)
+
+    to_remove = []
+    for edge in G.edges():
+        if (G.has_edge(edge[1], edge[0]) == False): to_remove.append([edge[0], edge[1]])
+    for pair in to_remove: G.remove_edge(pair[0], pair[1])
+    edges = G.to_undirected().edges()
+    
+    pos = nx.circular_layout(G)
+    nx.draw_networkx_nodes(G, pos, node_color='#ffff00')
+    nx.draw_networkx_labels(G, pos, font_size=8)
+    nx.draw_networkx_edges(G, pos, edge_color='b', arrows = True)
+    nx.draw_networkx_edge_labels(G,pos,edge_labels=nx.get_edge_attributes(G,'label'))
+    
+    parents = {}
+    ds = DisjointSet()
+    for edge in edges: ds.add(edge[0], edge[1])
+    
+    outputs = []
+    for k, v in ds.group.items():
+        outputs.append(v)
+    return outputs
+        
+    
+def hip_busId_mapping(path_to_deviceIdMapping):
+    def processBusId(busId):
+        split_list = busId.split(':')
+        temp = split_list[2].split('.')
+        return split_list[1].lstrip('0') + temp[0] + temp[1]
+        
+    fs = open(path_to_deviceIdMapping, 'r')
+    lines = fs.readlines()
     fs.close()
-    print ("INFO: Dumped out the count of each command in a file named: {}".format(filename))
-
-def get_unique_commands(commands_and_nranks):
-    unique_values = []
-    counts_map = {}
-    nranks_map = {}
-    for c_and_nr in commands_and_nranks:
-        cmd = c_and_nr[0]
-        nranks = c_and_nr[1]
-        if (cmd not in unique_values):
-            counts_map[cmd] = 1
-            nranks_map[cmd] = nranks
-            unique_values.append(cmd)
+    busId_HIP_map = {}
+    for j in range(len(lines)):
+        line = lines[j].rstrip()
+        if "=" not in line:
+            split_list = line.split()
+            busId_HIP_map[processBusId(split_list[1])] = split_list[2]
+    return busId_HIP_map
+
+def device_grouping(comm_table, conn_table):
+    groups = []
+    for index, row in comm_table.iterrows():
+        temp = [row['busId'], list(conn_table[(conn_table['comm'] == row['comm']) & (conn_table['start_busid'] == row['busId'])]['end_busid'].unique())]
+        groups.append(temp)    
+    nranks = list(comm_table['nranks'])
+    outputs = []
+    rank_outputs = []
+    tempRank = None
+    for id, group in enumerate(groups): 
+        if tempRank == None:
+            tempRank = nranks[id]
+            ds = DisjointSet()
         else:
-            counts_map[cmd] = counts_map[cmd] + 1
-    assert len(counts_map) == len(nranks_map)
-    for cmd in counts_map.keys():
-        assert counts_map[cmd] % nranks_map[cmd] == 0
-        counts_map[cmd] = int(counts_map[cmd] / nranks_map[cmd])
-    return unique_values, counts_map
+            if tempRank != nranks[id]:
+                for _, v in ds.group.items():
+                    if v not in outputs: 
+                        outputs.append(v)
+                ds = DisjointSet()
+                tempRank = nranks[id]
+        for node in group[1]:
+            ds.add(group[0], node)
+
+        if id == len(groups) - 1: 
+            for _, v in ds.group.items():
+                if v not in outputs: 
+                    outputs.append(v)  
+    return outputs
+
+def generate_topo_script(commands, topo_info, busId_HIP_map, output_script):
+    filename = output_script + ".sh"
+    fs = open(filename, "w")
+    for j in range(len(commands)):
+        fs.write("echo '==========================================================' \n")
+        for device_set in topo_info[j]:
+            device_setting = "HIP_VISIBLE_DEVICES="
+            for k, device in enumerate(list(device_set)):
+                if k == len(device_set) - 1:
+                    device_setting = device_setting + str(busId_HIP_map[device]) + " "
+                else:                   
+                    device_setting = device_setting + str(busId_HIP_map[device]) + ","
+            fs.write(device_setting + commands[j])
+            fs.write("\n")
+    fs.write("echo '==========================================================' \n")
+    fs.close()
+    print("INFO: Dumped out the commands with device assignment in a script named: {}".format(filename))
+
+def generate_topo(busId_HIP_map, command_list, raw_command_list, coll_table, conn_table, comm_table, legacy_device_grouping, output_name):
+    all_info = pd.merge(coll_table, comm_table, on=['comm','nranks'])
+    topo_info = []
+    if legacy_device_grouping:
+        for command in raw_command_list:
+            split_list = command.split()
+            coll = split_list[4][:-1]
+            opCount = int(split_list[6], 16)
+            count = split_list[12]
+            datatype = split_list[14]
+            op_type = split_list[16]
+            nranks = int(next(item for item in split_list if 'nranks' in item).split("=")[1].replace("]", ""))        
+            #### Filter
+            selected_info = all_info[(all_info['coll'] == coll) & (all_info['opCount'] == opCount)
+                                    & (all_info['datatype'] == datatype) & (all_info['count'] == count) 
+                                    & (all_info['nranks'] == nranks) & (all_info['op_type'] == op_type)] 
+            stage_2 = pd.merge(selected_info, conn_table, left_on=['rank','busId'], right_on=['start_rank','start_busid'],how='left')
+            graphs = []
+            for _ , subgroup in stage_2.groupby(['busId']):
+                graphs.append([list(subgroup['busId'].unique()), list(subgroup['end_busid'].unique())])
+            outputs = buildGraph(graphs, conn_table)
+            topo_info.append(outputs)
+        generate_topo_script(command_list, topo_info, busId_HIP_map, output_name)
+    else:
+        device_group_list = device_grouping(comm_table, conn_table)
+        for command in command_list:
+            split_list = command.split()
+            nranks = int(split_list[split_list.index("-g") + 1])
+            temp = []
+            for deviceSet in device_group_list:
+                if len(deviceSet) == nranks:
+                    temp.append(deviceSet)
+            topo_info.append(temp)
+        generate_topo_script(command_list, topo_info, busId_HIP_map, output_name)
+
+def get_commands(coll_table, unique):
+    def nccl_rccl_tests_command(row):
+        test_cmd = "./build/" + coll_op_map[row['coll']] + " -d " + data_types_map[row['datatype']] \
+                    + " -b " + str(row['data_size']) + " -e " + str(row['data_size']) \
+                    + " -o " + reduction_op_map[row['op_type']] + " -g " + str(row['nranks'])
+        return test_cmd
+    
+    command_list = []
+    raw_command_list = []
+    counts_list = []
+    if unique:
+        unique_coll_table = coll_table.drop_duplicates(subset = ['coll','datatype', 'op_type', 'nranks', 'data_size'])
+        for _, row in unique_coll_table.iterrows():
+            test_cmd = nccl_rccl_tests_command(row)
+            command_list.append(test_cmd)
+            count = coll_table[(coll_table['coll'] == row['coll']) & (coll_table['datatype'] == row['datatype']) & 
+                               (coll_table['op_type'] == row['op_type']) & (coll_table['nranks'] == row['nranks']) & 
+                               (coll_table['data_size'] == row['data_size'])].shape[0]
+            assert count % row['nranks'] == 0
+            counts_list.append(int(count / row['nranks']))
+            raw_command_list.append(row['raw_command'])
+    else:
+        for _, row in coll_table.iterrows():
+            test_cmd = nccl_rccl_tests_command(row)
+            command_list.append(test_cmd)
+    return command_list, counts_list, raw_command_list
+
 
 def main():
     log_file = os.path.abspath(args.nccl_debug_log)
-    nccl_lines = get_useful_info(log_file)
-    commands_and_nranks = parse_nccl_log(nccl_lines)
-    #generate_script(commands, args.output_script_name)
+    coll_lines, conn_lines, comm_lines, ring_lines, tree_lines, coll_trace_lines = get_useful_info(log_file)
+    coll_table = coll_table_build(coll_lines)
+    conn_table = conn_table_build(conn_lines, args.legacy_device_grouping)
+    comm_table = comm_table_build(comm_lines)
+    command_list, counts_list, raw_command_list = get_commands(coll_table, args.unique)
+    path_to_deviceIdMapping = os.path.join(os.path.dirname(os.path.realpath(__file__)), "deviceIdMapping/busId_HIP_map.txt")
+    if os.path.exists(path_to_deviceIdMapping) == False:
+        raise AssertionError("Remember to run 'sh install.sh' before using this tool.")
+    busId_hip_map = hip_busId_mapping(path_to_deviceIdMapping)
     if (args.unique):
-        new_commands, counts_map = get_unique_commands(commands_and_nranks)
-        generate_script(new_commands, args.output_script_name + "_unique")
-        dump_counts_map(counts_map, args.output_script_name + "_counts")
+        dump_counts_map(command_list, counts_list, args.output_script_name + "_counts")
+        generate_topo(busId_hip_map, command_list, raw_command_list, coll_table, conn_table, 
+                            comm_table, args.legacy_device_grouping, args.output_script_name + "_unique_topo")
     else:
-        commands = list(zip(*commands_and_nranks))[0]
-        generate_script(commands, args.output_script_name)
+        generate_script(command_list, args.output_script_name)
 
 if __name__ == '__main__':
     parser = argparse.ArgumentParser()
-    parser.add_argument("--nccl-debug-log", type=str, required=True, help="Log from app with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL")
+    parser.add_argument("--nccl-debug-log", type=str, required=True, help="RCCL log after running app with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL RCCL_KERNEL_COLL_TRACE_ENABLE=1 <executable>")
+    parser.add_argument("--legacy-device-grouping", action="store_true", default=False, help="If the application is using CUDA systems or ROCm systems with RCCL 2.8 or below.") # 
     parser.add_argument("--output-script-name", type=str, required=False, default="net_nccl_rccl", help="Output command script")
     parser.add_argument("--unique", action="store_true", default=False, help="Get only the unique commands.")
 
     args = parser.parse_args()
     main()
+
+# python rccl_nccl_parser_new.py --nccl-debug-log gpt2_rccl_mp4_log.txt --output-script-name net
+# python rccl_nccl_parser_new.py --nccl-debug-log gpt2_rccl_mp4_log.txt --output-script-name net --unique --legacy-device-grouping
+# python rccl_nccl_parser_new.py --nccl-debug-log gpt2_rccl_mp4_log_newPR.txt --output-script-name net --unique 
diff --git a/run_parser_and_generate_summary.py b/run_parser_and_generate_summary.py
index 76641a4..21f266e 100644
--- a/run_parser_and_generate_summary.py
+++ b/run_parser_and_generate_summary.py
@@ -6,7 +6,10 @@ def main():
     debug_log = os.path.abspath(args.nccl_debug_log)
     
     ## Generate a script to run nccl/rccl tests.
-    gen_cmd = "python rccl_nccl_parser.py --nccl-debug-log " + debug_log + " --output-script-name net --unique"
+    if args.legacy_device_grouping:
+        gen_cmd = gen_cmd = "python rccl_nccl_parser.py --nccl-debug-log " + debug_log + " --output-script-name net --unique --legacy-device-grouping"
+    else:
+        gen_cmd = "python rccl_nccl_parser.py --nccl-debug-log " + debug_log + " --output-script-name net --unique"
     if os.system(gen_cmd):
         print ("ERROR: Failed to parse the log.")
         sys.exit(1)
@@ -14,43 +17,48 @@ def main():
     ## change directory to rccl-tests/nccl-tests
     if args.rocm:
         rccl_tests_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "rccl-tests")
-        os.system("cp net_unique.sh " + rccl_tests_path)
-        #os.chdir(os.path.join(os.path.dirname(os.path.realpath(__file__)), "rccl-tests"))
+        os.system("cp net_unique_topo.sh " + rccl_tests_path)
         os.chdir(rccl_tests_path)
         if os.system("./install.sh > /dev/null 2>&1"):
             print("ERROR: Failed to install rccl-tests.")
             sys.exit(1)
         
-        os.system("cat net_unique.sh")
-        run_script_cmd = "HSA_FORCE_FINE_GRAIN_PCIE=1 sh net_unique.sh | tee rccl_perf_log.txt"
+        os.system("cat net_unique_topo.sh")
+        run_script_cmd = "HSA_FORCE_FINE_GRAIN_PCIE=1 sh net_unique_topo.sh | tee topo_rccl_tests.txt"
+
         if os.system(run_script_cmd):
             print ("ERROR: Unable to run rccl-tests properly.")
             sys.exit(1)
-        os.system("mv rccl_perf_log.txt ../")
+        os.system("mv topo_rccl_tests.txt ../")
         os.chdir(os.path.join(os.path.dirname(os.path.realpath(__file__)), "../"))
 
         print (os.getcwd())
-        summary_cmd = "python generate_summary.py --log-file rccl_perf_log.txt --script-file net_unique.sh --count-file net_counts.csv"
+        summary_cmd = "python generate_summary.py --log-file topo_rccl_tests.txt --output-file-name net_summary --count-file net_counts.csv"
+                
         os.system(summary_cmd)
         print ("INFO: Finished dumping all data.")
+        
+        
 
     if args.cuda:
         nccl_tests_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "nccl-tests")
-        os.system("cp net_unique.sh " + nccl_tests_path)
+        os.system("cp net_unique_topo.sh " + nccl_tests_path)
         os.chdir(nccl_tests_path)
-        if os.system("make > /dev/null 2>&1"):
-            print ("ERROR: Failed to install nccl-unit tests")
+        if os.system("./install.sh > /dev/null 2>&1"):
+            print("ERROR: Failed to install nccl-tests.")
             sys.exit(1)
         
-        os.system("cat net_unique.sh")
-        run_script_cmd = "sh net_unique.sh | tee nccl_perf_log.txt"
+        os.system("cat net_unique_topo.sh")
+        run_script_cmd = "HSA_FORCE_FINE_GRAIN_PCIE=1 sh net_unique_topo.sh | tee topo_rccl_tests.txt"
         if os.system(run_script_cmd):
-            print ("ERROR: unable to run nccl-tests")
+            print ("ERROR: Unable to run nccl-tests properly.")
             sys.exit(1)
-        os.system("mv nccl_perf_log.txt ../")
+        os.system("mv topo_rccl_tests.txt ../")
         os.chdir(os.path.join(os.path.dirname(os.path.realpath(__file__)), "../"))
-        
-        summary_cmd = "python generate_summary.py --log-file nccl_perf_log.txt --script-file net_unique.sh --output-file-name nv_net_summary --count-file net_counts.csv"
+
+        print (os.getcwd())
+        summary_cmd = "python generate_summary.py --log-file topo_rccl_tests.txt --output-file-name net_summary --count-file net_counts.csv"
+                
         os.system(summary_cmd)
         print ("INFO: Finished dumping all data.")
 
@@ -60,6 +68,7 @@ def main():
                             help="NCCL/RCCL log after running app with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL")
     parser.add_argument("--rocm", action="store_true", default=False, help="Run the tests on ROCm using rccl-tests")
     parser.add_argument("--cuda", action="store_true", default=False, help="Run the tests on CUDA using nccl-tests")
+    parser.add_argument("--legacy-device-grouping", action="store_true", default=False, help="NCCL/RCCL log after running app with NCCL or RCCL 2.8 or below")
 
     args = parser.parse_args()
     main()