You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/content/blog/2024-12-26-anomaly.md
+10-4Lines changed: 10 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ We believe it was caused by the banning of a massive number of subreddits in 202
11
11
12
12
## Finding the anomaly
13
13
14
-
When we started trying out Reddit's API, we simply generated comments between two IDs, one corresponding to a comment at the beginning of our time range and the other corresponding to a comment at the end of our time range. At the time, we had to find these comments manually, though we now have a script to do it automatically (TODO reference other blog post here).
14
+
When we started trying out Reddit's API, we simply generated comments between two IDs, one corresponding to a comment at the beginning of our time range and the other corresponding to a comment at the end of our time range. At the time, we had to find these comments manually, though we now have a script to do it automatically.
15
15
16
16
The below code uses this approach to obtain a list of IDs for successfully requested comment and a corresponding list of their timestamps. It also creates a list of IDs for inaccessible comments.
17
17
@@ -95,15 +95,21 @@ When these subreddits were banned, their comments became inaccessible through Re
95
95
96
96
After 2020, Reddit seems to have eased off on banning subreddits. Although they did ban thrice as many subreddits in 2021 as they did in 2020, most of these bans were for unmoderated subreddits [^transparency_2021]. There was a decrease in subreddit bans for hateful content and harrassment. This explains why the miss rate came back down in 2021.
97
97
98
-
Why did the miss rate only start spiking around 2019 despite many of the banned subreddits existing well before then? This is probably because most of them did not become popular until 2019. We can observe this using [Subreddit Stats](https://subredditstats.com). As some cherry-picked examples, take r/ChapoTrapHouse, r/DarkHumorAndMemes, r/GenderCritical, r/soyboys, and r/wojak. Although Subreddit Stats does not show comment data before 2019, it does show the number of subscribers to each subreddit over time. In terms of subscribers, all of these subreddits didn't really take off until around 2019, and presumably, many/most of the comments in these subreddits were also posted around 2019.
98
+
It looks like most of these banned subreddits were not particularly particular until around 2019, which explains why the miss rate only started spiking around 2019. We can observe this using [Subreddit Stats](https://subredditstats.com). As some cherry-picked examples, take r/ChapoTrapHouse, r/DarkHumorAndMemes, r/GenderCritical, r/soyboys, and r/wojak. Although Subreddit Stats doesn't show comment data before 2019, it does show the number of subscribers to each subreddit over time. In terms of subscribers, all of these subreddits didn't really take off until around 2019, even though most of them were created well before then. Presumably, many/most of the comments in these subreddits were also posted around 2019.
99
99
100
100
The users in these subreddits are another thing to consider. 15.6% of users from banned subreddits left Reddit after the ban [^great_ban], and these users may have mass-deleted their most recent comments in protest before leaving. This would again contribute to a higher miss rate around 2018-2021. This is just a theory, however.
101
101
102
-
Ultimately, the cause of the anomaly isn't as important as the way we handle it.
102
+
Ultimately, the cause of the anomaly isn't as important as how we handle it.
103
103
104
104
## Dealing with the anomaly
105
105
106
-
todo write this
106
+
We haven't actually figured out yet exactly how we will sample comments.
107
+
108
+
We'll be splitting up the entire time range under study (around 2010-2023) into time periods of a few months. The simplest method would be to sample the same number of comments from each time period, e.g. 100 comments from January-March 2010, 100 from April-June 2010, and so on. This accounts for deleted comments but assumes the Internet as a whole hasn't grown significantly over the years, even if Reddit has. But this is not an assumption we can make.
109
+
110
+
Another approach we can take is to assume that the popularity of the Internet is tied to the popularity of Reddit. This way, rather than sampling the same number of comments from each time period, we can sample the same percent of IDs from each time period. Note that this considers the total number of IDs in a time period, not the number of undeleted comments. This would again account for deleted comments.
111
+
112
+
Lastly, a complicated approach we can take would involve quantifying how popular the Internet and Reddit are. When deciding the number of comments to get from each time period, we would account for what portion of the Internet is made up by Reddit.
0 commit comments