|
| 1 | +# Scraping Reddit via their JSON API |
| 2 | + |
| 3 | +Reddit have long had an unofficial (I think) API where you can add `.json` to the end of any URL to get back the data for that page as JSON. |
| 4 | + |
| 5 | +I wanted to track new posts on Reddit that mention my domain `simonwillison.net`. |
| 6 | + |
| 7 | +https://www.reddit.com/domain/simonwillison.net/new/ shows recent posts from a specific domain. |
| 8 | + |
| 9 | +https://www.reddit.com/domain/simonwillison.net/new.json is that data as JSON, which looks like this: |
| 10 | + |
| 11 | +```json |
| 12 | +{ |
| 13 | + "kind": "Listing", |
| 14 | + "data": { |
| 15 | + "modhash": "la6xmexs8u301d6d105d24f94cdaa4457a00a1ea042c95f6e2", |
| 16 | + "dist": 25, |
| 17 | + "children": [ |
| 18 | + { |
| 19 | + "kind": "t3", |
| 20 | + "data": { |
| 21 | + "approved_at_utc": null, |
| 22 | + "subreddit": "programming", |
| 23 | + "selftext": "", |
| 24 | + "author_fullname": "t2_2ks9", |
| 25 | + "saved": false, |
| 26 | + "mod_reason_title": null, |
| 27 | + "gilded": 0, |
| 28 | + "clicked": false, |
| 29 | + "title": "Joining CSV and JSON data with an in-memory SQLite database", |
| 30 | + "link_flair_richtext": [], |
| 31 | + "subreddit_name_prefixed": "r/programming" |
| 32 | +``` |
| 33 | +Attempting to fetch this data with `curl` shows an error: |
| 34 | +``` |
| 35 | +$ curl 'https://www.reddit.com/domain/simonwillison.net/new.json' |
| 36 | +{"message": "Too Many Requests", "error": 429} |
| 37 | +``` |
| 38 | +Turns out this rate limiting is [based on user-agent](https://www.reddit.com/r/redditdev/comments/3qbll8/429_too_many_requests/) - so to avoid it, set a custom user-agent: |
| 39 | + |
| 40 | +``` |
| 41 | +$ curl --user-agent 'simonw/fetch-reddit' 'https://www.reddit.com/domain/simonwillison.net/new.json' |
| 42 | +{"kind": "Listing", "data": ... |
| 43 | +``` |
| 44 | +I used `jq` to tidy this up like so: |
| 45 | + |
| 46 | +```jq |
| 47 | +[.data.children[] | .data | { |
| 48 | + id: .id, |
| 49 | + subreddit: .subreddit, |
| 50 | + url: .url, |
| 51 | + created_utc: .created_utc | todate, |
| 52 | + permalink: .permalink, |
| 53 | + num_comments: .num_comments |
| 54 | +}] |
| 55 | +``` |
| 56 | +Combined: |
| 57 | +``` |
| 58 | +$ curl \ |
| 59 | + --user-agent 'simonw/fetch-reddit' \ |
| 60 | + 'https://www.reddit.com/domain/simonwillison.net/new.json' \ |
| 61 | + | jq '[.data.children[] | .data | { |
| 62 | + id: .id, |
| 63 | + subreddit: .subreddit, |
| 64 | + url: .url, |
| 65 | + created_utc: .created_utc | todate, |
| 66 | + permalink: .permalink, |
| 67 | + num_comments: .num_comments |
| 68 | + }]' > simonwillison-net.json |
| 69 | +``` |
| 70 | +Output looks like this: |
| 71 | +```json |
| 72 | +[ |
| 73 | + { |
| 74 | + "id": "o3tjsx", |
| 75 | + "subreddit": "programming", |
| 76 | + "url": "https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/", |
| 77 | + "created_utc": "2021-06-20T00:25:51Z", |
| 78 | + "permalink": "/r/programming/comments/o3tjsx/joining_csv_and_json_data_with_an_inmemory_sqlite/", |
| 79 | + "num_comments": 10 |
| 80 | + }, |
| 81 | + { |
| 82 | + "id": "nnsww6", |
| 83 | + "subreddit": "patient_hackernews", |
| 84 | + "url": "https://til.simonwillison.net/bash/finding-bom-csv-files-with-ripgrep", |
| 85 | + "created_utc": "2021-05-29T18:04:38Z", |
| 86 | + "permalink": "/r/patient_hackernews/comments/nnsww6/finding_csv_files_that_start_with_a_bom_using/", |
| 87 | + "num_comments": 1 |
| 88 | + } |
| 89 | +] |
| 90 | +``` |
0 commit comments