Skip to content

Commit 929fe34

Browse files
authored
Scraping Reddit via their JSON API
1 parent 5afb277 commit 929fe34

File tree

1 file changed

+90
-0
lines changed

1 file changed

+90
-0
lines changed

reddit/scraping-reddit-json.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Scraping Reddit via their JSON API
2+
3+
Reddit have long had an unofficial (I think) API where you can add `.json` to the end of any URL to get back the data for that page as JSON.
4+
5+
I wanted to track new posts on Reddit that mention my domain `simonwillison.net`.
6+
7+
https://www.reddit.com/domain/simonwillison.net/new/ shows recent posts from a specific domain.
8+
9+
https://www.reddit.com/domain/simonwillison.net/new.json is that data as JSON, which looks like this:
10+
11+
```json
12+
{
13+
"kind": "Listing",
14+
"data": {
15+
"modhash": "la6xmexs8u301d6d105d24f94cdaa4457a00a1ea042c95f6e2",
16+
"dist": 25,
17+
"children": [
18+
{
19+
"kind": "t3",
20+
"data": {
21+
"approved_at_utc": null,
22+
"subreddit": "programming",
23+
"selftext": "",
24+
"author_fullname": "t2_2ks9",
25+
"saved": false,
26+
"mod_reason_title": null,
27+
"gilded": 0,
28+
"clicked": false,
29+
"title": "Joining CSV and JSON data with an in-memory SQLite database",
30+
"link_flair_richtext": [],
31+
"subreddit_name_prefixed": "r/programming"
32+
```
33+
Attempting to fetch this data with `curl` shows an error:
34+
```
35+
$ curl 'https://www.reddit.com/domain/simonwillison.net/new.json'
36+
{"message": "Too Many Requests", "error": 429}
37+
```
38+
Turns out this rate limiting is [based on user-agent](https://www.reddit.com/r/redditdev/comments/3qbll8/429_too_many_requests/) - so to avoid it, set a custom user-agent:
39+
40+
```
41+
$ curl --user-agent 'simonw/fetch-reddit' 'https://www.reddit.com/domain/simonwillison.net/new.json'
42+
{"kind": "Listing", "data": ...
43+
```
44+
I used `jq` to tidy this up like so:
45+
46+
```jq
47+
[.data.children[] | .data | {
48+
id: .id,
49+
subreddit: .subreddit,
50+
url: .url,
51+
created_utc: .created_utc | todate,
52+
permalink: .permalink,
53+
num_comments: .num_comments
54+
}]
55+
```
56+
Combined:
57+
```
58+
$ curl \
59+
--user-agent 'simonw/fetch-reddit' \
60+
'https://www.reddit.com/domain/simonwillison.net/new.json' \
61+
| jq '[.data.children[] | .data | {
62+
id: .id,
63+
subreddit: .subreddit,
64+
url: .url,
65+
created_utc: .created_utc | todate,
66+
permalink: .permalink,
67+
num_comments: .num_comments
68+
}]' > simonwillison-net.json
69+
```
70+
Output looks like this:
71+
```json
72+
[
73+
{
74+
"id": "o3tjsx",
75+
"subreddit": "programming",
76+
"url": "https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/",
77+
"created_utc": "2021-06-20T00:25:51Z",
78+
"permalink": "/r/programming/comments/o3tjsx/joining_csv_and_json_data_with_an_inmemory_sqlite/",
79+
"num_comments": 10
80+
},
81+
{
82+
"id": "nnsww6",
83+
"subreddit": "patient_hackernews",
84+
"url": "https://til.simonwillison.net/bash/finding-bom-csv-files-with-ripgrep",
85+
"created_utc": "2021-05-29T18:04:38Z",
86+
"permalink": "/r/patient_hackernews/comments/nnsww6/finding_csv_files_that_start_with_a_bom_using/",
87+
"num_comments": 1
88+
}
89+
]
90+
```

0 commit comments

Comments
 (0)