|
1 |
| --- average length and occupied storage of WARC records by MIME type |
| 1 | +-- |
| 2 | +-- Calculate the average length and the occupied storage of WARC records by MIME type. |
| 3 | +-- |
| 4 | +-- Update Dec 2019: add histogram counting reasons for payload truncation |
| 5 | +-- Content payload in Common Crawl archives is truncated if the content exceeds a limit of |
| 6 | +-- * 1 MiB in WARC files since 2013 |
| 7 | +-- * 500 kiB in the 2008 – 2012 ARC files |
| 8 | +-- The truncation is required to keep the crawl archives at a limited size and ensure |
| 9 | +-- that a broad sample of web pages is covered. It also avoids that the archives are filled |
| 10 | +-- by accidentally captured video or audio streams. The crawler needs to buffer the content |
| 11 | +-- temporarily and a limit ensures that this is possible with a limited amount of RAM for |
| 12 | +-- many parallel connections. See also |
| 13 | +-- https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-truncated |
| 14 | +-- The column `content_truncated` has been added in November 2019 (CC-MAIN-2019-47) |
| 15 | +-- to the URL indexes to skip over truncated captures instantly. Here the column is used to measure |
| 16 | +-- the impact of the truncation on various document formats (MIME types). |
| 17 | +-- |
2 | 18 | SELECT COUNT(*) as n_pages,
|
3 | 19 | COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as perc_pages,
|
4 | 20 | AVG(warc_record_length) as avg_warc_record_length,
|
5 | 21 | SUM(warc_record_length) as sum_warc_record_length,
|
6 | 22 | SUM(warc_record_length) * 100.0 / SUM(SUM(warc_record_length)) OVER() as perc_warc_storage,
|
7 |
| - content_mime_detected |
| 23 | + content_mime_detected, |
| 24 | + histogram(content_truncated) |
8 | 25 | FROM "ccindex"."ccindex"
|
9 |
| -WHERE crawl = 'CC-MAIN-2018-17' |
| 26 | +WHERE crawl = 'CC-MAIN-2019-47' |
10 | 27 | AND subset = 'warc'
|
11 | 28 | GROUP BY content_mime_detected
|
12 | 29 | ORDER BY n_pages DESC;
|
0 commit comments