Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modified: twitterscraper/tweet.py #100

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

modified: twitterscraper/tweet.py #100

wants to merge 4 commits into from

Conversation

hengruo
Copy link

@hengruo hengruo commented Mar 4, 2018

I added a new field in Tweet: reply_to_id.
If tweet A is a reply to tweet B, then reply_to_id = B.id;
If tweet A doesn't reply to any tweet, then reply_to_id = A.id.

This field can let us construct the reply tree of tweets.

class Tweet:
def __init__(self, user, fullname, id, url, timestamp, text, replies, retweets, likes, html):
def __init__(self, user, fullname, id, url, timestamp, text, reply_to_id, replies, retweets, likes, html):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

placing a new argument at this location breaks backward compatibility. I suggest you move it to the end of the list of arguments.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly implemented 'reply_to_user' is not passed to the Tweet class and hence will not appear in the output.

@@ -38,7 +39,8 @@ def from_soup(cls, tweet):
'span', 'ProfileTweet-action--favorite u-hiddenVisually').find(
'span', 'ProfileTweet-actionCount')['data-tweet-stat-count'] or '0',
html=str(tweet.find('p', 'tweet-text')) or "",
)
reply_to_id = tweet.findChildren()[0]['data-conversation-id'] or '0',
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be achieved with
reply_to_id = tweet.find('div', 'tweet')['data-conversation-id'] or '0'

@@ -17,6 +17,7 @@ def __init__(self, user, fullname, id, url, timestamp, text, replies, retweets,
self.retweets = retweets
self.likes = likes
self.html = html
self.reply_to_id = reply_to_id
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.reply_to_id = 0 if id == reply_to_id else reply_to_id
sets it to zero if it is equal to the tweet-id, i.e. if it is not a reply to anyone. Giving the reply_to_id a value even when it is not a reply is misleading and people would have to check the equivalence of id and reply_to_id before they can be sure it is an reply.

@taspinar
Copy link
Owner

taspinar commented Mar 6, 2018

I know it is possible to retrieve the contents of a tweet if you know the username and id with "https://twitter.com//status/".
And I was wondering if it is possible to retrieve the contents of a tweet by id only. I have looked on the internet and I could not find a good answer.
If this is not possible, and the original tweet is not in the list of scraped tweets, the reply_to_id is not very useful.

One way in which you can find out the username belonging to the original tweet is with the following command:
reply_to_users = json.loads(tweet.find('div', 'tweet')['data-reply-to-users-json']) or []

This retrieves a JSON-list containing among other things the username, screen_name and id_str of everyone which has participated in the conversation.
If the list has a length of 1, the tweet was not a reply to anyone and the list only contains information about the current tweet.
If the list contains more than 1 element, the tweet was a reply, and the last element in the list contains information about the user of the original tweet.

If it is not possible to retrieve a tweet by id only, I suggest you also include the username of the original tweet.

@hengruo
Copy link
Author

hengruo commented Mar 7, 2018

Your suggestions are very useful! I stored tweets in my database so I don't consider the condition where we need to get tweets online just by id. I'll fix it.

if html_response:
html = response.text
else:
json_resp = response.json()
json_resp = ujson.loads(response.text)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between json.loads() and ujson.loads() ? If there is no clear reason for using ujson instead of json, I prefer the usage of json.

limit_per_pool = roundup(limit, poolsize)
else:
limit_per_pool = None
limit_per_pool = limit
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will result in twitterscraper scraping approximately for P*limit number of tweets (where P is the poolsize) instead of the given limit. Please remove this change.

@@ -38,7 +39,9 @@ def from_soup(cls, tweet):
'span', 'ProfileTweet-action--favorite u-hiddenVisually').find(
'span', 'ProfileTweet-actionCount')['data-tweet-stat-count'] or '0',
html=str(tweet.find('p', 'tweet-text')) or "",
)
reply_to_id = tweet.find('div', 'tweet')['data-conversation-id'] or '0',
reply_to_user = tweet.find('div', 'tweet')['data-mentions'] or "",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already implemented in PR #98 . Maybe it is best to remove it here.

Copy link
Owner

@taspinar taspinar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See added comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants