Note that EDGE_LIST is a list of lists with the format: [[timestamp1,['hashtagX1','hashtagY1']], [timestamp2,['hashtagX2','hashtagY2']], ...etc]
-
Get tweets:
-
If it's a rate limiting message, ignore it
-
Check timestamp:
-
If timestamp is older than 60s, delete tweet, jump to call calc_average_degree() to end
-
If timestamp is newer than newest, update newest_timestamp value
-
Delete edges that are older than 60 seconds
-
Find hashtags:
-
If tweet has 2 or most hashtags, check and remove all duplicates
- If only 0 or 1 hashtag remains, discard tweet, jump to call calc_average_degree()
-
Create edge entries: (If tweet has 2 or more valid hashtags, create edge entries)
-
Use the combination package that was imported. Eg: list(combinations(['hashtag1','hashtag2','hashtag3'],2)). This outputs a list of tuples.
- Sort each edge entry alphabetically so that we don't have the check the reverse. Do this by converting each tuple into a list and sorting
-
Insert each new edge entry into EDGE_LIST:
-
Check that the edge doesn't already exist, if it does, update timestamp of that edge (no need to check for reverse order, because each edge entry is already sorted)
-
call calc_average_degree()
-
Concatenate the 2 columns of nodes in the EDGE_LIST, and sum (this will be the sum of degrees)
-
Concatenate the 2 columns of nodes in the EDGE_LIST, remove duplicates, and sum (this will be the total number of nodes)
-
Divide the total degrees by total nodes to get average degree count
The following packages are used/imported:
- import time - needed to deal with timestamps
- import sys - for reading the arugments of the run.sh command
- import json - for processing json
- import os - for checking if output.txt already exists, and deleting it if it does
- from itertools import combinations - used to run combinations (order doesn't matter), Taken from: https://rosettacode.org/wiki/Combinations#Python
Note that this repo started off as a clone of https://github.com/InsightDataScience/coding-challenge
For testing, call "./run_tests.sh" from within the insight_testsuite directory