Skip to content

Commit 525ab8b

Browse files
Add/update example SQL queries
- get home pages for a given list of domain names - add truncation histogram to metrics "WARC record size by MIME type"
1 parent 15e333f commit 525ab8b

File tree

3 files changed

+60
-4
lines changed

3 files changed

+60
-4
lines changed

README.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -76,8 +76,9 @@ A couple of sample queries are also provided (for the flat schema):
7676
- a single domain: [get-records-of-domain.sql](src/sql/examples/cc-index/get-records-of-domain.sql)
7777
- a specific MIME type: [get-records-of-mime-type.sql](src/sql/examples/cc-index/get-records-of-mime-type.sql)
7878
- a specific language (e.g., Icelandic): [get-records-for-language.sql](src/sql/examples/cc-index/get-records-for-language.sql)
79+
- home pages of a given list of domains: [get-records-home-pages.sql](src/sql/examples/cc-index/get-records-home-pages.sql)
7980
- find similar domain names by Levenshtein distance (few characters changed): [similar-domains.sql](src/sql/examples/cc-index/similar-domains.sql)
80-
- average length and occupied storage of WARC records by MIME type: [average-warc-record-length-by-mime-type.sql](src/sql/examples/cc-index/average-warc-record-length-by-mime-type.sql)
81+
- average length, occupied storage and payload truncation of WARC records by MIME type: [average-warc-record-length-by-mime-type.sql](src/sql/examples/cc-index/average-warc-record-length-by-mime-type.sql)
8182
- count pairs of top-level domain and content language: [count-language-tld.sql](src/sql/examples/cc-index/count-language-tld.sql)
8283
- find correlations between TLD and content language using the log-likelihood ratio: [loglikelihood-language-tld.sql](src/sql/examples/cc-index/loglikelihood-language-tld.sql)
8384
- ... and similar for correlations between content language and character encoding: [correlation-language-charset.sql](src/sql/examples/cc-index/correlation-language-charset.sql)
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,29 @@
1-
-- average length and occupied storage of WARC records by MIME type
1+
--
2+
-- Calculate the average length and the occupied storage of WARC records by MIME type.
3+
--
4+
-- Update Dec 2019: add histogram counting reasons for payload truncation
5+
-- Content payload in Common Crawl archives is truncated if the content exceeds a limit of
6+
-- * 1 MiB in WARC files since 2013
7+
-- * 500 kiB in the 2008 – 2012 ARC files
8+
-- The truncation is required to keep the crawl archives at a limited size and ensure
9+
-- that a broad sample of web pages is covered. It also avoids that the archives are filled
10+
-- by accidentally captured video or audio streams. The crawler needs to buffer the content
11+
-- temporarily and a limit ensures that this is possible with a limited amount of RAM for
12+
-- many parallel connections. See also
13+
-- https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-truncated
14+
-- The column `content_truncated` has been added in November 2019 (CC-MAIN-2019-47)
15+
-- to the URL indexes to skip over truncated captures instantly. Here the column is used to measure
16+
-- the impact of the truncation on various document formats (MIME types).
17+
--
218
SELECT COUNT(*) as n_pages,
319
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as perc_pages,
420
AVG(warc_record_length) as avg_warc_record_length,
521
SUM(warc_record_length) as sum_warc_record_length,
622
SUM(warc_record_length) * 100.0 / SUM(SUM(warc_record_length)) OVER() as perc_warc_storage,
7-
content_mime_detected
23+
content_mime_detected,
24+
histogram(content_truncated)
825
FROM "ccindex"."ccindex"
9-
WHERE crawl = 'CC-MAIN-2018-17'
26+
WHERE crawl = 'CC-MAIN-2019-47'
1027
AND subset = 'warc'
1128
GROUP BY content_mime_detected
1229
ORDER BY n_pages DESC;
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
--
2+
-- Select homepage records for a given list of domains
3+
--
4+
-- * join with domain list table
5+
-- (here Alexa top 1 million ranks are used,
6+
-- see count-domains-alexa-top-1m.sql how to create
7+
-- the table `alexa`)
8+
-- * filter home pages by
9+
-- * a simple pattern on URL path
10+
-- * no/empty URL query
11+
-- * exclude subdomains, i.e. allow only host names
12+
-- * same as domain name or
13+
-- * prefixed by `www.`
14+
-- Note: substr() "positions start with 1",
15+
-- see https://prestodb.io/docs/current/functions/string.html
16+
-- * extract WARC record locations for later processing
17+
-- of home pages
18+
-- * and redirect locations (since CC-MAIN-2019-47)
19+
-- to be able to "follow" redirects
20+
--
21+
SELECT alexa.site,
22+
alexa.rank,
23+
cc.url,
24+
cc.fetch_time,
25+
cc.warc_filename,
26+
cc.warc_record_offset,
27+
cc.warc_record_length,
28+
cc.fetch_redirect
29+
FROM "ccindex"."ccindex" AS cc
30+
RIGHT OUTER JOIN "ccindex"."alexa_top_1m" AS alexa
31+
ON alexa.site = cc.url_host_registered_domain
32+
WHERE cc.crawl = 'CC-MAIN-2019-51'
33+
AND cc.subset = 'warc'
34+
AND regexp_like(cc.url_path, '^/?(?:index\.(?:html?|php))?$')
35+
AND cc.url_query is NULL
36+
AND (length(cc.url_host_name) = length(cc.url_host_registered_domain)
37+
OR (length(cc.url_host_name) = (length(cc.url_host_registered_domain)+4)
38+
AND substr(cc.url_host_name, 1, 4) = 'www.'))

0 commit comments

Comments
 (0)