Skip to content

Commit 94f282c

Browse files
committed
blog(anomaly): Finish post
1 parent 764e8e7 commit 94f282c

File tree

1 file changed

+10
-4
lines changed

1 file changed

+10
-4
lines changed

src/content/blog/anomaly.md.shadow renamed to src/content/blog/2024-12-26-anomaly.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ We believe it was caused by the banning of a massive number of subreddits in 202
1111

1212
## Finding the anomaly
1313

14-
When we started trying out Reddit's API, we simply generated comments between two IDs, one corresponding to a comment at the beginning of our time range and the other corresponding to a comment at the end of our time range. At the time, we had to find these comments manually, though we now have a script to do it automatically (TODO reference other blog post here).
14+
When we started trying out Reddit's API, we simply generated comments between two IDs, one corresponding to a comment at the beginning of our time range and the other corresponding to a comment at the end of our time range. At the time, we had to find these comments manually, though we now have a script to do it automatically.
1515

1616
The below code uses this approach to obtain a list of IDs for successfully requested comment and a corresponding list of their timestamps. It also creates a list of IDs for inaccessible comments.
1717

@@ -95,15 +95,21 @@ When these subreddits were banned, their comments became inaccessible through Re
9595

9696
After 2020, Reddit seems to have eased off on banning subreddits. Although they did ban thrice as many subreddits in 2021 as they did in 2020, most of these bans were for unmoderated subreddits [^transparency_2021]. There was a decrease in subreddit bans for hateful content and harrassment. This explains why the miss rate came back down in 2021.
9797

98-
Why did the miss rate only start spiking around 2019 despite many of the banned subreddits existing well before then? This is probably because most of them did not become popular until 2019. We can observe this using [Subreddit Stats](https://subredditstats.com). As some cherry-picked examples, take r/ChapoTrapHouse, r/DarkHumorAndMemes, r/GenderCritical, r/soyboys, and r/wojak. Although Subreddit Stats does not show comment data before 2019, it does show the number of subscribers to each subreddit over time. In terms of subscribers, all of these subreddits didn't really take off until around 2019, and presumably, many/most of the comments in these subreddits were also posted around 2019.
98+
It looks like most of these banned subreddits were not particularly particular until around 2019, which explains why the miss rate only started spiking around 2019. We can observe this using [Subreddit Stats](https://subredditstats.com). As some cherry-picked examples, take r/ChapoTrapHouse, r/DarkHumorAndMemes, r/GenderCritical, r/soyboys, and r/wojak. Although Subreddit Stats doesn't show comment data before 2019, it does show the number of subscribers to each subreddit over time. In terms of subscribers, all of these subreddits didn't really take off until around 2019, even though most of them were created well before then. Presumably, many/most of the comments in these subreddits were also posted around 2019.
9999

100100
The users in these subreddits are another thing to consider. 15.6% of users from banned subreddits left Reddit after the ban [^great_ban], and these users may have mass-deleted their most recent comments in protest before leaving. This would again contribute to a higher miss rate around 2018-2021. This is just a theory, however.
101101

102-
Ultimately, the cause of the anomaly isn't as important as the way we handle it.
102+
Ultimately, the cause of the anomaly isn't as important as how we handle it.
103103

104104
## Dealing with the anomaly
105105

106-
todo write this
106+
We haven't actually figured out yet exactly how we will sample comments.
107+
108+
We'll be splitting up the entire time range under study (around 2010-2023) into time periods of a few months. The simplest method would be to sample the same number of comments from each time period, e.g. 100 comments from January-March 2010, 100 from April-June 2010, and so on. This accounts for deleted comments but assumes the Internet as a whole hasn't grown significantly over the years, even if Reddit has. But this is not an assumption we can make.
109+
110+
Another approach we can take is to assume that the popularity of the Internet is tied to the popularity of Reddit. This way, rather than sampling the same number of comments from each time period, we can sample the same percent of IDs from each time period. Note that this considers the total number of IDs in a time period, not the number of undeleted comments. This would again account for deleted comments.
111+
112+
Lastly, a complicated approach we can take would involve quantifying how popular the Internet and Reddit are. When deciding the number of comments to get from each time period, we would account for what portion of the Internet is made up by Reddit.
107113

108114
## References
109115

0 commit comments

Comments
 (0)