-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modified: twitterscraper/tweet.py #100
base: master
Are you sure you want to change the base?
Conversation
class Tweet: | ||
def __init__(self, user, fullname, id, url, timestamp, text, replies, retweets, likes, html): | ||
def __init__(self, user, fullname, id, url, timestamp, text, reply_to_id, replies, retweets, likes, html): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
placing a new argument at this location breaks backward compatibility. I suggest you move it to the end of the list of arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The newly implemented 'reply_to_user' is not passed to the Tweet class and hence will not appear in the output.
twitterscraper/tweet.py
Outdated
@@ -38,7 +39,8 @@ def from_soup(cls, tweet): | |||
'span', 'ProfileTweet-action--favorite u-hiddenVisually').find( | |||
'span', 'ProfileTweet-actionCount')['data-tweet-stat-count'] or '0', | |||
html=str(tweet.find('p', 'tweet-text')) or "", | |||
) | |||
reply_to_id = tweet.findChildren()[0]['data-conversation-id'] or '0', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also be achieved with
reply_to_id = tweet.find('div', 'tweet')['data-conversation-id'] or '0'
twitterscraper/tweet.py
Outdated
@@ -17,6 +17,7 @@ def __init__(self, user, fullname, id, url, timestamp, text, replies, retweets, | |||
self.retweets = retweets | |||
self.likes = likes | |||
self.html = html | |||
self.reply_to_id = reply_to_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.reply_to_id = 0 if id == reply_to_id else reply_to_id
sets it to zero if it is equal to the tweet-id, i.e. if it is not a reply to anyone. Giving the reply_to_id a value even when it is not a reply is misleading and people would have to check the equivalence of id and reply_to_id before they can be sure it is an reply.
I know it is possible to retrieve the contents of a tweet if you know the username and id with "https://twitter.com//status/". One way in which you can find out the username belonging to the original tweet is with the following command: This retrieves a JSON-list containing among other things the username, screen_name and id_str of everyone which has participated in the conversation. If it is not possible to retrieve a tweet by id only, I suggest you also include the username of the original tweet. |
Your suggestions are very useful! I stored tweets in my database so I don't consider the condition where we need to get tweets online just by id. I'll fix it. |
if html_response: | ||
html = response.text | ||
else: | ||
json_resp = response.json() | ||
json_resp = ujson.loads(response.text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between json.loads() and ujson.loads() ? If there is no clear reason for using ujson instead of json, I prefer the usage of json.
limit_per_pool = roundup(limit, poolsize) | ||
else: | ||
limit_per_pool = None | ||
limit_per_pool = limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will result in twitterscraper scraping approximately for P*limit number of tweets (where P is the poolsize) instead of the given limit. Please remove this change.
@@ -38,7 +39,9 @@ def from_soup(cls, tweet): | |||
'span', 'ProfileTweet-action--favorite u-hiddenVisually').find( | |||
'span', 'ProfileTweet-actionCount')['data-tweet-stat-count'] or '0', | |||
html=str(tweet.find('p', 'tweet-text')) or "", | |||
) | |||
reply_to_id = tweet.find('div', 'tweet')['data-conversation-id'] or '0', | |||
reply_to_user = tweet.find('div', 'tweet')['data-mentions'] or "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already implemented in PR #98 . Maybe it is best to remove it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See added comments.
I added a new field in Tweet: reply_to_id.
If tweet A is a reply to tweet B, then reply_to_id = B.id;
If tweet A doesn't reply to any tweet, then reply_to_id = A.id.
This field can let us construct the reply tree of tweets.