Add files via upload

nipunmanral · web-flow · commit 6ee942606ed7 · 2019-07-15T20:50:47.000-07:00
diff --git a/README.md b/README.md
@@ -0,0 +1,51 @@
+# Community Detection In Graphs
+
+## Objective
+In this project, we will implement the Girvan-Newman algorithm to detect communities in graphs where each community is a set of users who have a similar business taste, using the Yelp dataset. We will identify these communities using map-reduce, Spark RDD and standard python libraries.
+
+## Environment Setup
+Python Version = 3.6
+
+Spark Version = 2.3.3
+
+## Dataset
+The dataset `sample_data.csv` is a sub-dataset which has been generated from the [Yelp review dataset](https://www.yelp.com/dataset). Each line in this dataset contains a user_id and business_id.
+
+## Code Execution
+Run the python code using the below command:
+
+    spark-submit community_detection.py <filter_threshold> <input_file_path> <betweenness_output_file_path> <community_output_file_path>
+
+- filter_threshold: the filter threshold to generate edges between user nodes.
+- input_file_path: the path to the input file including path, file name and extension.
+- betweenness_output_file_path: the path to the betweenness output file including path, file name and extension.
+- community_output_file_path: the path to the community output file including path, file name and extension.
+
+## Approach
+(Reference: 'Mining of Massive Datasets' by Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman)
+### Graph Construction
+We first construct the social network graph, where each node represents a user. An edge exists between two nodes if the number of times that the two users review the same business is greater than or equivalent to the filter threshold. For example, suppose user1 reviews [business1, business2, business3] and user2 reviews [business2, business3, business4, business5]. If the threshold is 2, then there exists an edge between user1 and user2. If a user node has no edge, then we do not include that node in the graph.
+
+### Betweenness Calculation
+In this part, we calculate the betweenness of each edge in the original graph and save the result in a txt file. The betweenness of an edge (a, b) is defined as the number of pairs of nodes x and y such that the edge (a ,b) lies on the shortest path between x and y. We use the [Girvan-Newman Algorithm](https://en.wikipedia.org/wiki/Girvan%E2%80%93Newman_algorithm) to calculate the number of shortest paths going through each edge. In this algorithm, we visit each node X once and compute the number of shortest paths from X to each of the other nodes that go through each of the edges as shown in the below steps:
+1. First, perform a breadth-first search (BFS) of the graph, starting at node X.
+2. Next, label each node by the number of shortest paths that reach it from the root. Start by labelling the root 1. Then, from the top down, label each node Y by the sum of the labels of its parents.
+3. Calculate for each edge e, the sum over all nodes Y (of the fraction) of the shortest paths from the root X to Y that go through edge e.
+
+To complete the betweenness calculation, we have to repeat this calculation for every node as the root and sum the contributions. Finally, we must divide by 2 to get the true betweenness, since every shortest path will be discovered twice, once for each of its endpoints.
+
+### Community Detection
+We use [modularity](https://en.wikipedia.org/wiki/Modularity_(networks)) to identify the communities by taking the graph and all its edges, and then removing edges with the highest betweenness, until the graph has broken into a suitable number of connected components. Thus, we divide the graph into suitable communities, which reaches the global highest modularity.
+
+The formula of modularity is shown below:
+
+    𝑸 = ∑s∈S [(# edges within group s) – (expected # edges within group s)]
+
+    𝑸(𝑮,𝑺) = (1/(2 * m)) * ∑𝒔∈𝑺 ∑𝒊∈𝒔 ∑𝒋∈𝒔 (Aij − ((ki * kj)/(2 * m)))
+
+According to the Girvan-Newman algorithm, after removing one edge, we should re-compute the betweenness. The “m” in the formula represents the edge number of the original graph. The “A” in the formula is the adjacent matrix of the original graph where Aij is 1 if i connects j, else it is 0. ki is the node degree of node i. In each remove step, 'm', 'A', 'ki' and 'kj' should not be changed. If the community only has one user node, we still regard it as a valid community. We save the results to an output text file.
+
+## Output file format
+The betweenness calculation output is saved in the path specified by the `<betweenness_output_file_path>` parameter in the execution script. This output is saved in a txt file which contains the betweenness of each edge in the original graph. The format of each line is `(‘user_id1’, ‘user_id2’), betweenness value`. The results are firstly sorted by the betweenness values in the descending order and then the first user_id in the tuple in lexicographical order (the user_id is of type string). The two user_ids in each tuple are also in lexicographical order.
+
+The community detection output is saved in the path specified by the `<community_output_file_path>` parameter in the execution script. The output which is the resulting communities are saved in a txt file. Each line represents one community and the format is: `‘user_id1’, ‘user_id2’, ‘user_id3’, ‘user_id4’, ...`. The results are firstly sorted by the size of communities in the ascending order and then the first user_id in the community in lexicographical order (the user_id is of type string). The user_ids in each community are also in the lexicographical order. If there is only one node in the community, we still regard it as a valid community.
diff --git a/community_detection.py b/community_detection.py
@@ -0,0 +1,219 @@
+from pyspark import SparkConf, SparkContext
+import argparse
+import time
+from itertools import combinations
+from collections import deque
+import copy
+
+# spark-submit community_detection.py 7 ~/Downloads/sample_data.csv ~/Downloads/output_betweenness.txt ~/Downloads/output_community.txt
+
+def create_adjacency_graph(user_edges_rdd):
+    rdd_1 = user_edges_rdd.groupByKey().mapValues(set)
+    rdd_2 = user_edges_rdd.map(lambda x: (x[1], x[0])).groupByKey().mapValues(set)
+    return rdd_1.union(rdd_2).reduceByKey(lambda x, y: x.union(y))
+
+
+def breadth_first_search(start_node, user_adjacency_dict):
+    """
+    Step 1 of Girvan-Newman Algorithm
+    :return: Returns a dictionary which contains the level wise node graph
+    Eg:{0: ['B'], 1: ['A', 'K', 'L', 'M'], 2: ['C', 'Y'], 3: ['D', 'E', 'F'], 4: ['Z']}
+    """
+    visited_nodes = [start_node]
+    nodes_current_level_list = [start_node]
+    level_index = 0
+    bfs_graph_dict = {}
+
+    while len(nodes_current_level_list) != 0:
+        bfs_graph_dict[level_index] = nodes_current_level_list
+        level_index += 1
+        nodes_next_level_list = []
+        for current_node in nodes_current_level_list:
+            current_neighbors = user_adjacency_dict[current_node]
+            for neighbor in current_neighbors:
+                if neighbor not in visited_nodes:
+                    visited_nodes.append(neighbor)
+                    nodes_next_level_list.append(neighbor)
+        nodes_current_level_list = nodes_next_level_list
+    return bfs_graph_dict
+
+
+def generate_node_weights(bfs_graph_dict, user_adjacency_dict):
+    """
+    Step 2 of Girvan-Newman Algorithm
+    :return:
+    """
+    node_weights = {}
+    levels = len(bfs_graph_dict)
+    for root_level_node in bfs_graph_dict[0]:
+        node_weights[root_level_node] = 1.0
+
+    for current_level in range(1, levels):
+        previous_level_nodes = set(bfs_graph_dict[current_level-1])
+        current_level_nodes = set(bfs_graph_dict[current_level])
+        for node in current_level_nodes:
+            node_neighbors = set(user_adjacency_dict[node])
+            parent_nodes = previous_level_nodes.intersection(node_neighbors)
+            sum = 0.0
+            for parent in parent_nodes:
+                sum += node_weights[parent]
+            node_weights[node] = sum
+    return node_weights
+
+
+def generate_edge_weights(node_weights, bfs_graph_dict, user_adjacency_dict):
+    """
+    Step 3 of Girvan-Newman Algorithm
+    :return:
+    """
+    edge_weights = {}
+    node_credits = {}
+    levels = len(bfs_graph_dict)
+    for last_level_node in bfs_graph_dict[levels - 1]:
+        node_credits[last_level_node] = 1
+
+    for current_level in range(levels - 2, -1, -1):
+        current_level_nodes = set(bfs_graph_dict[current_level])
+        next_level_nodes = set(bfs_graph_dict[current_level + 1])
+        for node in current_level_nodes:
+            node_neighbors = set(user_adjacency_dict[node])
+            child_nodes = next_level_nodes.intersection(node_neighbors)
+            sum = 1.0 if current_level != 0 else 0.0
+            for child in child_nodes:
+                value = (node_credits[child]/node_weights[child]) * node_weights[node]
+                edge_weights_sort_key = tuple(sorted([node, child]))
+                edge_weights[edge_weights_sort_key] = value
+                sum += value
+            node_credits[node] = sum
+    return edge_weights
+
+
+def calculate_betweenness(start_node, user_adjacency_dict):
+    bfs_graph_dict = breadth_first_search(start_node, user_adjacency_dict)
+    node_weights = generate_node_weights(bfs_graph_dict, user_adjacency_dict)
+    edge_weights = generate_edge_weights(node_weights, bfs_graph_dict, user_adjacency_dict)
+    return edge_weights.items()
+
+
+def fetch_connected_communities():
+    visited = []
+    connected_components = []
+    for start_node in community_user_adjacency_dict.keys():
+        detected_community = []
+        queue = deque([start_node])
+        while queue:
+            node = queue.popleft()
+            if node not in visited:
+                visited.append(node)
+                detected_community.append(node)
+                node_neighbors = community_user_adjacency_dict[node]
+                for neighbor in node_neighbors:
+                    queue.append(neighbor)
+        if len(detected_community) != 0:
+            detected_community.sort()
+            connected_components.append(detected_community)
+    return connected_components
+
+
+def calculate_modularity(community_list):
+    modularity_sum = 0
+    for community in community_list:
+        if len(community) > 1:
+            for node_i_index in range(0, len(community)):
+                for node_j_index in range(node_i_index, len(community)):
+                    node_i = community[node_i_index]
+                    node_j = community[node_j_index]
+                    modularity_sort_key = tuple(sorted([node_i, node_j]))
+                    adjacent_matrix_value = 1.0 if modularity_sort_key in original_user_edges_list else 0.0
+                    # neighbors_node_i = user_adjacency_dict[node_i]
+                    # if node_j in neighbors_node_i:
+                    #     adjacent_matrix_value = 1.0
+                    # else:
+                    #     adjacent_matrix_value = 0.0
+                    value = adjacent_matrix_value - (node_degrees_dict[node_i] * node_degrees_dict[node_j] * formula_first_part)
+                    modularity_sum += value
+    modularity_sum = modularity_sum * formula_first_part
+    return modularity_sum
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("filter_threshold", type=int, help="Enter the filter threshold to generate edges between "
+                                                           "user nodes")
+    parser.add_argument("input_file_path", type=str, help="Enter the input file path")
+    parser.add_argument("betweenness_output", type=str, help="Enter the path of the betweenness output file")
+    parser.add_argument("community_output", type=str, help="Enter the path of the community output file")
+    args = parser.parse_args()
+
+    start = time.time()
+    conf = SparkConf().setAppName("h4_task1").setMaster("local[*]").set("spark.driver.memory", "4g")\
+        .set("spark.executor.memory", "4g")
+    sc = SparkContext(conf=conf)
+
+    threshold = args.filter_threshold
+
+    input_data_rdd = sc.textFile(args.input_file_path)
+    header_line = input_data_rdd.first()
+    input_data_rdd = input_data_rdd.filter(lambda x: x != header_line).map(lambda y: y.split(","))
+    users_rdd = input_data_rdd.map(lambda x: (x[0], x[1])).groupByKey().mapValues(set)
+    users_rdd.persist()
+    users_dict = dict(users_rdd.collect())
+    distinct_users = users_rdd.keys().collect()
+
+    original_user_edges_list = []
+    for temp_user in combinations(distinct_users, 2):
+        if len(users_dict[temp_user[0]].intersection(users_dict[temp_user[1]])) >= threshold:
+            edge_sort_key = tuple(sorted([temp_user[0], temp_user[1]]))
+            original_user_edges_list.append(edge_sort_key)
+
+    num_user_edges_original_graph = len(original_user_edges_list)
+
+    user_edges_rdd = sc.parallelize(original_user_edges_list).persist()
+    user_adjacency_rdd = create_adjacency_graph(user_edges_rdd)
+    user_adjacency_dict = user_adjacency_rdd.collectAsMap()
+    node_degrees_dict = user_adjacency_rdd.map(lambda x: (x[0], len(x[1]))).collectAsMap()
+
+    edge_betweenness_rdd = user_adjacency_rdd.keys().flatMap(lambda x: calculate_betweenness(x, user_adjacency_dict))\
+        .reduceByKey(lambda a, b: a+b).mapValues(lambda y: y/2.0).sortBy(lambda z: (-z[1], z[0]), ascending=True)
+
+    edge_betweenness_values = edge_betweenness_rdd.collect()
+    with open(args.betweenness_output, 'w') as file:
+        for line in edge_betweenness_values:
+            line_write = str(line[0]) + ", " + str(line[1]) + "\n"
+            file.write(line_write)
+
+    community_edges = deque(edge_betweenness_values)
+    community_user_adjacency_dict = copy.deepcopy(user_adjacency_dict)
+
+    formula_first_part = (1 / (2 * num_user_edges_original_graph))
+    global_maximum_modularity = -1.0
+    final_communities = []
+
+    while len(community_edges) != 0:
+        removed_edge = community_edges.popleft()[0]
+        community_user_adjacency_dict[removed_edge[0]].remove(removed_edge[1])
+        community_user_adjacency_dict[removed_edge[1]].remove(removed_edge[0])
+        # node_degrees_dict[removed_edge[0]] -= 1
+        # node_degrees_dict[removed_edge[1]] -= 1
+        connected_communities = fetch_connected_communities()
+        community_modularity = calculate_modularity(connected_communities)
+        if community_modularity > global_maximum_modularity:
+            global_maximum_modularity = community_modularity
+            final_communities = connected_communities
+        community_user_adjacency_rdd = sc.parallelize(community_user_adjacency_dict.items())
+        community_edges_betweenness = community_user_adjacency_rdd.keys().\
+            flatMap(lambda x: calculate_betweenness(x, community_user_adjacency_dict)).reduceByKey(lambda a, b: a + b).\
+            mapValues(lambda y: y / 2.0).sortBy(lambda z: (-z[1], z[0]), ascending=True).collect()
+        community_edges = deque(community_edges_betweenness)
+
+    # print(global_maximum_modularity)
+    # print(len(final_communities))
+
+    final_communities = sorted(final_communities, key=lambda x: (len(x), x))
+    with open(args.community_output, 'w') as file:
+        for community in final_communities:
+            value = str(community).replace('[', '').replace(']', '') + "\n"
+            file.write(value)
+    end = time.time()
+
+    # print("Duration: ", end-start)