forked from nursnaaz/Deeplearning-and-NLP
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
47 changed files
with
1,448,948 additions
and
0 deletions.
There are no files selected for viewing
Binary file not shown.
Binary file modified
BIN
+2 KB
(120%)
06 - Day - 6 Advanced NLP and its Applications/06 - NLP Applications/.DS_Store
Binary file not shown.
Binary file added
BIN
+8 KB
...dvanced NLP and its Applications/06 - NLP Applications/06 - Text Classification/.DS_Store
Binary file not shown.
File renamed without changes.
File renamed without changes.
7 changes: 7 additions & 0 deletions
7
...Advanced NLP and its Applications/06 - NLP Applications/07 - twitter-scrapping/.gitignore
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
auth.py | ||
*~ | ||
*.pyc | ||
*.json | ||
*.tsv | ||
*.csv | ||
tweepy/ |
339 changes: 339 additions & 0 deletions
339
... 6 Advanced NLP and its Applications/06 - NLP Applications/07 - twitter-scrapping/LICENSE
Large diffs are not rendered by default.
Oops, something went wrong.
4 changes: 4 additions & 0 deletions
4
... 6 Advanced NLP and its Applications/06 - NLP Applications/07 - twitter-scrapping/LOG.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
|
||
DONE! Completed Successfully | ||
|
||
DONE! Completed Successfully |
83 changes: 83 additions & 0 deletions
83
...NLP and its Applications/06 - NLP Applications/07 - twitter-scrapping/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
twitter-python | ||
============== | ||
|
||
Simple example scripts for Twitter data collection with [Tweepy](http://www.github.com/tweepy/tweepy) in Python | ||
|
||
#Getting started | ||
To collect data you need a Twitter account and a Twitter application. Assuming you already have a Twitter account use the following instructions to create a Twitter application | ||
|
||
##Twitter application | ||
|
||
1. Open a web browser and go to https://apps.twitter.com/app/new | ||
2. Sign in with your normal Twitter username and password if you are not already signed in. | ||
3. Enter a name, description, and temporary website (e.g. http://coming-soon.com) | ||
4. Read and accept the terms and conditions – note principally that you agree not to distribute any of the raw tweet data and to delete tweets from your collection if they should be deleted from Twitter in the future. | ||
5. Click "Create your Twitter application" | ||
6. Click on the "API Keys" tab and then click "Create my access token" | ||
7. Wait a minute or two and press your browser's refresh button (or ctrl+r / cmd+r) | ||
8. You should now see new fields labeled "Access token" and "Access token secret" at the bottom of the page. | ||
9. You now have a Twitter application that can act on behalf of your Twitter user to read data from Twitter. | ||
|
||
##Connect your Twitter application to these scripts | ||
1. Download the code in the repository if you haven't already | ||
2. Open the auth_example.py file in a text editor (gedit, kate, notepad, textmate, etc.) | ||
3. Update the following lines with the information displayed in the web browser for your application: | ||
|
||
```python | ||
consumer_key="..." #Note this is now called API key | ||
consumer_secret="..." #Note this is now called API secret | ||
access_token="..." | ||
access_token_secret="...." | ||
#Replace the … with whatever values are shown in your web browser. Be sure to keep the quotation marks. | ||
``` | ||
|
||
4. Save the file as auth.py (not the same name as before) | ||
|
||
You are now ready to run a simple example. | ||
|
||
##First run (streaming_simple.py) | ||
1. Open your terminal/console and go to the folder you saved the files (e.g. cd "/absolute path/to/my/files") | ||
2. Run streaming_simple.py by typing | ||
python streaming_simple.py | ||
3. You should see text of tweets appearing. Press ctrl+c to stop collecting tweets. | ||
|
||
If you do not see tweets appearing, check that you have [Tweepy](http://www.github.com/tweepy/tweepy) installed correctly and that the information in auth.py is correct. | ||
|
||
In addition to printing out the text of tweets, streaming_simple.py also saves the results to output.json | ||
|
||
##Convert json to spreadsheet (data2spreadsheet.py) | ||
Twitter supplies tweets in JSON format. See the [Twitter documentation](https://dev.twitter.com/docs/platform-objects/tweets) for what fields are available. To create a spreadsheet of collected tweets, we can select certain fields and include these. This is what the data2spreadsheet.py file does. Note that this file does not include every possible field in a tweet. You may wish to modify the file if you need to include a particular field that is not currently included. | ||
|
||
The following steps assume you've run either streaming_simple.py or streaming.py and have file(s) of tweets with one tweet per line. | ||
|
||
1. Open the terminal/console and go to the folder with the files | ||
2. Run the following (assuming output.json is name of the file with tweets) | ||
``` | ||
python data2spreadsheet.py output.json | ||
``` | ||
3. This will produce ``output/output_1234.tsv``, where 1234 is the unix timestamp when the file was created (this is the number of seconds since January 1, 1970). The file can be opened in LibreOffice Calc, Excel, etc. If prompted select that columns (fields) are separated with a tab character. | ||
|
||
##Production (streaming.py) | ||
streaming.py is a more production ready file. It does not print tweets as they are recieved, but simply stores them to a files with the name of the day they are recieved on. It starts a new file at midnight every day. It also has additional error checking / recovery code. | ||
|
||
1. Create a directory (folder) to store the tweets in | ||
2. Within the directory, create a file called FILTER (all uppercase) | ||
3. Open FILTER in a text editor and enter the terms you wish to track one per line. Save and close it. | ||
4. Open streaming.py and set the name of the directory you created | ||
5. Copy streaming.py, the output directory (outputDir), tweepy, and anything else to a server that is always on and connected | ||
6. Start collecting tweets with | ||
``` | ||
nohup python streaming.py >> logfile 2>> errorfile | ||
``` | ||
|
||
##Reference | ||
If you use this code in support of an academic publication, please cite: | ||
|
||
Hale, S. A. (2014) Global Connectivity and Multilinguals in the Twitter Network. | ||
In Proceedings of the 2014 ACM Annual Conference on Human Factors in Computing Systems, | ||
ACM (Montreal, Canada). | ||
|
||
|
||
This code is released under the [GPLv2 license](http://www.gnu.org/licenses/gpl-2.0.html). Please [contact me](http://www.scotthale.net/blog/?page_id=9) if you wish to use the code in ways that the GPLv2 license does not permit. | ||
|
||
More details, related code, and the original academic paper using this code is available at http://www.scotthale.net/pubs/?chi2014 . |
11 changes: 11 additions & 0 deletions
11
...ced NLP and its Applications/06 - NLP Applications/07 - twitter-scrapping/auth_example.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
|
||
class TwitterAuth: | ||
# Go to http://dev.twitter.com and create an app. | ||
# The consumer key and secret will be generated for you after | ||
consumer_key="E74RnPRDRDH7jh4S7z9iV7iBV" | ||
consumer_secret="xC3weX8mdgOFZQgTjFvmqjnyZWs1P5T65oBiWozvq55Ij3Vs47" | ||
|
||
# After the step above, you will be redirected to your app's page. | ||
# Create an access token under the the "Your access token" section | ||
access_token="1380496082132631554-D28qYe0UiFobIdZhHJAjPWQF3ldC39" | ||
access_token_secret="Bbc3jBrfjqaOk2r3oD9G2gcIaNzcO647fK9GyBVAZE1JY" |
82 changes: 82 additions & 0 deletions
82
...Applications/06 - NLP Applications/07 - twitter-scrapping/data2metions_retweet_network.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
#Twitter json file(s) to mentions/retweets network with networkx | ||
|
||
import networkx as nx | ||
try: | ||
import json | ||
except ImportError: | ||
import simplejson as json | ||
import codecs | ||
import time | ||
import datetime | ||
import os | ||
import random | ||
import time | ||
import sys | ||
|
||
|
||
outputDir = "output/" #Output directory | ||
os.system("mkdir -p %s"%(outputDir)) #Create directory if doesn't exist | ||
|
||
|
||
fhLog = codecs.open("LOG.txt",'a','UTF-8') | ||
def logPrint(s): | ||
fhLog.write("%s\n"%s) | ||
print(s) | ||
|
||
|
||
def parse(graph,tweet): | ||
author=tweet["user"]["screen_name"] | ||
followers=tweet["user"]["followers_count"] | ||
friends=tweet["user"]["friends_count"] | ||
location=tweet["user"]["location"] if tweet["user"]["location"] else "" | ||
timezone=tweet["user"]["time_zone"] if tweet["user"]["time_zone"] else "" | ||
utc=tweet["user"]["utc_offset"] if tweet["user"]["utc_offset"] else "" | ||
|
||
other_users=[] | ||
if "in_reply_to_screen_name" in tweet: | ||
other_users.append(tweet["in_reply_to_screen_name"]) | ||
|
||
#tweet["entities"]["user_mentions"][*]["screen_name"] | ||
if "entities" in tweet and "user_mentions" in tweet["entities"]: | ||
users=tweet["entities"]["user_mentions"] | ||
for u in users: | ||
sn=u["screen_name"] | ||
if not sn in other_users: | ||
other_users.append(sn) | ||
|
||
try: | ||
graph.node[author]["tweets"]+=1 | ||
except: #if not author in graph.node: | ||
graph.add_node(author,{"followers":followers,"friends":friends,"location":location,"timzone":timezone,"utc_offset":utc,"tweets":1}) | ||
|
||
for target in other_users: | ||
try: | ||
graph[author][target]["weight"]+=1 | ||
except: | ||
graph.add_edge(author,target,weight=1) | ||
|
||
graph=nx.DiGraph() | ||
|
||
for file in sys.argv[1:]: | ||
print(file) | ||
fhb = codecs.open(file,"r") | ||
|
||
firstLine=fhb.readline() | ||
|
||
j=json.loads(firstLine) | ||
if "statuses" in j: | ||
#We have search API. The first (and only line) is a json object | ||
for tweet in j["statuses"]: | ||
parse(graph,tweet) | ||
else: | ||
parse(j) | ||
for line in fhb: | ||
#We have search API, each line is a json object | ||
parse(graph,json.loads(line)) | ||
fhb.close() | ||
|
||
filename=outputDir+"overall_%s.graphml"%int(time.time()) | ||
print("Writing graphml file to {0}...".format(filename)) | ||
nx.write_graphml(graph,filename,prettyprint=True) | ||
print("Done.") | ||
|
152 changes: 152 additions & 0 deletions
152
...NLP and its Applications/06 - NLP Applications/07 - twitter-scrapping/data2spreadsheet.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,152 @@ | ||
try: | ||
import json | ||
except ImportError: | ||
import simplejson as json | ||
import codecs | ||
import time | ||
import datetime | ||
import os | ||
import random | ||
import time | ||
import sys | ||
|
||
|
||
#Just used to highlight matches in tweets. This file does not query anything from Twitter! | ||
queries=["term1","term2"] | ||
|
||
outputDir = "output/" #Output directory | ||
os.system("mkdir -p %s"%(outputDir)) #Create directory if doesn't exist | ||
|
||
|
||
fhLog = codecs.open("LOG.txt",'a','UTF-8') | ||
def logPrint(s): | ||
fhLog.write("%s\n"%s) | ||
print(s) | ||
|
||
|
||
class Tweet: | ||
def __init__(self):# word, url, hits, trackback, score, author, text): | ||
self.keywords=[] | ||
self.links=["","",""] | ||
self.lang="" | ||
self.langConf="" | ||
|
||
@staticmethod | ||
def csvHeader(): | ||
row = "\"\t\"".join(("URL", "Keywords", "Keyword Count", "DateTime", "Favorite Count", "Retweet", "Lang", "LinkCount", "Link1", "Link2", "Link3", "Author", "Text","Followers","Friends","Location","Timezone","UTC Offset")) | ||
row = "\"%s\"\n"%row | ||
return row | ||
|
||
def csvRow(self): | ||
row = "\"\t\"".join(( | ||
str(self.url), | ||
",".join(self.keywords), | ||
str(len(self.keywords)), | ||
#str(datetime.datetime.fromtimestamp(self.date).strftime('%Y-%m-%d %H:%M:%S')), | ||
str(self.date), | ||
str(self.favorite), | ||
str(self.retweet), | ||
str(self.lang), | ||
str(self.urlCount), | ||
self.links[0], | ||
self.links[1], | ||
self.links[2], | ||
self.author, | ||
self.clean_text(), | ||
str(self.followers), | ||
str(self.friends), | ||
self.location, | ||
self.timezone, | ||
str(self.utc) | ||
)) | ||
row = "\"%s\"\n"%row | ||
return row | ||
|
||
def clean_text(self): | ||
t = self.text.replace("\"","") | ||
t = self.text.replace("\n"," ") | ||
return t | ||
|
||
def parse(self,json): | ||
self.url="http://twitter.com/{0}/status/{1}".format(json["user"]["id_str"],json["id_str"]) | ||
self.date=json["created_at"] | ||
self.favorite=json["favorite_count"] | ||
self.retweet=json["retweet_count"] | ||
self.author=json["user"]["screen_name"] | ||
#"%s - Twitter"%json["trackback_author_name"] #This is retweet author's name | ||
self.text=json["text"] | ||
|
||
self.lang=json["lang"] | ||
|
||
self.followers=json["user"]["followers_count"] | ||
self.friends=json["user"]["friends_count"] | ||
self.location=json["user"]["location"] if json["user"]["location"] else "" | ||
self.timezone=json["user"]["time_zone"] if json["user"]["time_zone"] else "" | ||
self.utc=json["user"]["utc_offset"] if json["user"]["utc_offset"] else "" | ||
|
||
#Links | ||
text = self.text | ||
self.urlCount = text.count("http://") + text.count("https://") | ||
|
||
count=0 | ||
words=text.split() | ||
for w in words: | ||
if w.count("http://") or w.count("https://"): | ||
w=w[(w.find("http")):] | ||
w=w.strip("():!?. \t\n\r") | ||
if count>2: | ||
self.links[2]=self.links[2]+","+w | ||
else: | ||
self.links[count]=w | ||
count=count+1 | ||
|
||
def __hash__(self): | ||
return hash(self.url, self.location) | ||
|
||
def __eq__(self, other): | ||
return (self.url)==(self.url) | ||
|
||
allTweets={} | ||
def parse(tweet): | ||
tw=Tweet() | ||
tw.parse(tweet) | ||
|
||
if not (tw.url in allTweets): | ||
txt=tw.text.lower() | ||
for query in queries: | ||
if query in txt: | ||
tw.keywords.append(query) | ||
#if len(tw.keywords)>0: | ||
allTweets[tw.url]=tw | ||
|
||
fhOverall=None | ||
|
||
for file in sys.argv[1:]: | ||
print(file) | ||
fhb = codecs.open(file,"r") | ||
|
||
firstLine=fhb.readline() | ||
|
||
j=json.loads(firstLine) | ||
if "statuses" in j: | ||
#We have search API. The first (and only line) is a json object | ||
for tweet in j["statuses"]: | ||
parse(tweet) | ||
else: | ||
parse(j) | ||
for line in fhb: | ||
#We have search API, each line is a json object | ||
parse(json.loads(line)) | ||
fhb.close() | ||
|
||
fhOverall=codecs.open(outputDir+"overall_%s.tsv"%int(time.time()),"w","UTF-8") | ||
fhOverall.write(Tweet.csvHeader()) | ||
for url in allTweets: | ||
tweet=allTweets[url] | ||
fhOverall.write(tweet.csvRow()) | ||
|
||
fhOverall.close() | ||
|
||
logPrint("\nDONE! Completed Successfully") | ||
|
||
fhLog.close() |
2 changes: 2 additions & 0 deletions
2
...plications/06 - NLP Applications/07 - twitter-scrapping/output/overall_1617995504.graphml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
<?xml version='1.0' encoding='utf-8'?> | ||
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"><graph edgedefault="directed"></graph></graphml> |
1 change: 1 addition & 0 deletions
1
...ed NLP and its Applications/06 - NLP Applications/07 - twitter-scrapping/outputDir/FILTER
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
IPL |
Oops, something went wrong.