Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -417,8 +417,8 @@ We provide multiple datasets, both as starting points for each of the competitio
- The raw dataset pools for the competition scales can be found by downloading the respective paths within [data/competition_pools/preextracted/](data/competition_pools/preextracted/). This folder contains `.txt` files which list the S3 paths corresponding to the input data pools for each of our [400m-1x](data/competition_pools/preextracted/400m-1x.txt), [1b-1x](data/competition_pools/preextracted/1b-1x.txt), [3b-1x](data/competition_pools/preextracted/3b-1x.txt), [7b-1x](data/competition_pools/preextracted/7b-1x.txt), [7b-2x](data/competition_pools/preextracted/7b-2x.txt) scales. All of these are subsets of out entire raw pool, [DCLM-pool](https://data.commoncrawl.org/contrib/datacomp/DCLM-pool/index.html), which is available via the CommonCrawl S3 bucket. Each subset consists of raw data (that has only undergone text extraction) and can be processed with the steps outlined above. To download any one of these pools, we recommend using [`s5cmd run cmds.txt`](https://github.com/peak/s5cmd/blob/master/README.md#run-multiple-commands-in-parallel) where `cmds.txt` is a commands file that you generate based on your desired destination path `<YOUR_DEST_PATH>` (example shown below for first two shards in the 400m-1x pool).

```bash
cp s3://commoncrawl/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2013-20/1368696382185/CC-MAIN-20130516092622-00084-ip-10-60-113-184.ec2.internal.jsonl.gz <YOUR_DEST_PATH>/crawl=CC-MAIN-2013-20/1368696382185/CC-MAIN-20130516092622-00084-ip-10-60-113-184.ec2.internal.jsonl.gz
cp s3://commoncrawl/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2013-20/1368696382892/CC-MAIN-20130516092622-00074-ip-10-60-113-184.ec2.internal.jsonl.gz <YOUR_DEST_PATH>/crawl=CC-MAIN-2013-20/1368696382892/CC-MAIN-20130516092622-00074-ip-10-60-113-184.ec2.internal.jsonl.gz
cp s3://commoncrawl/contrib/datacomp/DCLM-pool/jsonl/crawl=CC-MAIN-2013-20/1368696382185/CC-MAIN-20130516092622-00084-ip-10-60-113-184.ec2.internal.jsonl.gz <YOUR_DEST_PATH>/crawl=CC-MAIN-2013-20/1368696382185/CC-MAIN-20130516092622-00084-ip-10-60-113-184.ec2.internal.jsonl.gz
cp s3://commoncrawl/contrib/datacomp/DCLM-pool/jsonl/crawl=CC-MAIN-2013-20/1368696382892/CC-MAIN-20130516092622-00074-ip-10-60-113-184.ec2.internal.jsonl.gz <YOUR_DEST_PATH>/crawl=CC-MAIN-2013-20/1368696382892/CC-MAIN-20130516092622-00074-ip-10-60-113-184.ec2.internal.jsonl.gz
```

Expand Down
Loading