Skip to content

gylu/insight_coding_challenge_spring_2016

Repository files navigation

Insight_coding_challenge

The solution file to this coding challenge is src/average_degree.py

Python 3 is used, therefore the environment needs to have it installed

Overview of what src/average_degree.py does:

Note that steps 2 through 7 is performed for each tweet
Note that EDGE_LIST is a list of lists with the format: [[timestamp1,['hashtagX1','hashtagY1']], [timestamp2,['hashtagX2','hashtagY2']], ...etc]
  1. Get tweets:
    
  2. If it's a rate limiting message, ignore it
    
  3. Check timestamp:
    
  4. If timestamp is older than 60s, delete tweet, jump to call calc_average_degree() to end
    
  5. If timestamp is newer than newest, update newest_timestamp value
    
  6.  Delete edges that are older than 60 seconds
    
  7. Find hashtags:
    
  8. If tweet has 2 or most hashtags, check and remove all duplicates
    
  9. If only 0 or 1 hashtag remains, discard tweet, jump to call calc_average_degree()
  10.  Create edge entries: (If tweet has 2 or more valid hashtags, create edge entries)
    
  11. Use the combination package that was imported. Eg: list(combinations(['hashtag1','hashtag2','hashtag3'],2)). This outputs a list of tuples.
    
  12. Sort each edge entry alphabetically so that we don't have the check the reverse. Do this by converting each tuple into a list and sorting
  13.  Insert each new edge entry into EDGE_LIST:
    
  14. Check that the edge doesn't already exist, if it does, update timestamp of that edge (no need to check for reverse order, because each edge entry is already sorted)
    
  15.  call calc_average_degree()
    
  16. Concatenate the 2 columns of nodes in the EDGE_LIST, and sum (this will be the sum of degrees)
    
  17. Concatenate the 2 columns of nodes in the EDGE_LIST, remove duplicates, and sum (this will be the total number of nodes)
    
  18. Divide the total degrees by total nodes to get average degree count
    

src/average_degree.py already imports all the pacakges it needs.

The following packages are used/imported:

  • import time - needed to deal with timestamps
  • import sys - for reading the arugments of the run.sh command
  • import json - for processing json
  • import os - for checking if output.txt already exists, and deleting it if it does
  • from itertools import combinations - used to run combinations (order doesn't matter), Taken from: https://rosettacode.org/wiki/Combinations#Python

Note that this repo started off as a clone of https://github.com/InsightDataScience/coding-challenge

For testing, call "./run_tests.sh" from within the insight_testsuite directory

About

Insight Data Science Coding Challenge - old, spring 2016

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published