Add/update example SQL queries

sebastian-nagel · sebastian-nagel · commit 525ab8b16fee · 2019-12-23T14:49:57.000+01:00
- get home pages for a given list of domain names
- add truncation histogram to metrics "WARC record size by MIME type"
diff --git a/README.md b/README.md
@@ -76,8 +76,9 @@ A couple of sample queries are also provided (for the flat schema):
   - a single domain: [get-records-of-domain.sql](src/sql/examples/cc-index/get-records-of-domain.sql)
   - a specific MIME type: [get-records-of-mime-type.sql](src/sql/examples/cc-index/get-records-of-mime-type.sql)
   - a specific language (e.g., Icelandic): [get-records-for-language.sql](src/sql/examples/cc-index/get-records-for-language.sql)
+  - home pages of a given list of domains: [get-records-home-pages.sql](src/sql/examples/cc-index/get-records-home-pages.sql)
 - find similar domain names by Levenshtein distance (few characters changed): [similar-domains.sql](src/sql/examples/cc-index/similar-domains.sql)
-- average length and occupied storage of WARC records by MIME type: [average-warc-record-length-by-mime-type.sql](src/sql/examples/cc-index/average-warc-record-length-by-mime-type.sql)
+- average length, occupied storage and payload truncation of WARC records by MIME type: [average-warc-record-length-by-mime-type.sql](src/sql/examples/cc-index/average-warc-record-length-by-mime-type.sql)
 - count pairs of top-level domain and content language: [count-language-tld.sql](src/sql/examples/cc-index/count-language-tld.sql)
 - find correlations between TLD and content language using the log-likelihood ratio: [loglikelihood-language-tld.sql](src/sql/examples/cc-index/loglikelihood-language-tld.sql)
 - ... and similar for correlations between content language and character encoding: [correlation-language-charset.sql](src/sql/examples/cc-index/correlation-language-charset.sql)
diff --git a/src/sql/examples/cc-index/average-warc-record-length-by-mime-type.sql b/src/sql/examples/cc-index/average-warc-record-length-by-mime-type.sql
@@ -1,12 +1,29 @@
--- average length and occupied storage of WARC records by MIME type
+--
+-- Calculate the average length and the occupied storage of WARC records by MIME type.
+--
+-- Update Dec 2019: add histogram counting reasons for payload truncation
+-- Content payload in Common Crawl archives is truncated if the content exceeds a limit of
+--  * 1 MiB in WARC files since 2013
+--  * 500 kiB in the 2008 – 2012 ARC files
+-- The truncation is required to keep the crawl archives at a limited size and ensure
+-- that a broad sample of web pages is covered. It also avoids that the archives are filled
+-- by accidentally captured video or audio streams. The crawler needs to buffer the content
+-- temporarily and a limit ensures that this is possible with a limited amount of RAM for
+-- many parallel connections. See also
+--   https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-truncated
+-- The column `content_truncated` has been added in November 2019 (CC-MAIN-2019-47)
+-- to the URL indexes to skip over truncated captures instantly. Here the column is used to measure
+-- the impact of the truncation on various document formats (MIME types).
+--
 SELECT COUNT(*) as n_pages,
        COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as perc_pages,
        AVG(warc_record_length) as avg_warc_record_length,
        SUM(warc_record_length) as sum_warc_record_length,
        SUM(warc_record_length) * 100.0 / SUM(SUM(warc_record_length)) OVER() as perc_warc_storage,
-       content_mime_detected
+       content_mime_detected,
+       histogram(content_truncated)
 FROM "ccindex"."ccindex"
-WHERE crawl = 'CC-MAIN-2018-17'
+WHERE crawl = 'CC-MAIN-2019-47'
   AND subset = 'warc'
 GROUP BY content_mime_detected
 ORDER BY n_pages DESC;
diff --git a/src/sql/examples/cc-index/get-records-home-pages.sql b/src/sql/examples/cc-index/get-records-home-pages.sql
@@ -0,0 +1,38 @@
+--
+-- Select homepage records for a given list of domains
+--
+-- * join with domain list table
+--   (here Alexa top 1 million ranks are used,
+--    see count-domains-alexa-top-1m.sql how to create
+--    the table `alexa`)
+-- * filter home pages by
+--    * a simple pattern on URL path
+--    * no/empty URL query
+-- * exclude subdomains, i.e. allow only host names
+--    * same as domain name or
+--    * prefixed by `www.`
+--      Note: substr() "positions start with 1",
+--        see https://prestodb.io/docs/current/functions/string.html
+-- * extract WARC record locations for later processing
+--   of home pages
+-- * and redirect locations (since CC-MAIN-2019-47)
+--   to be able to "follow" redirects
+--
+SELECT alexa.site,
+       alexa.rank,
+       cc.url,
+       cc.fetch_time,
+       cc.warc_filename,
+       cc.warc_record_offset,
+       cc.warc_record_length,
+       cc.fetch_redirect
+FROM "ccindex"."ccindex" AS cc
+  RIGHT OUTER JOIN "ccindex"."alexa_top_1m" AS alexa
+  ON alexa.site = cc.url_host_registered_domain
+WHERE cc.crawl = 'CC-MAIN-2019-51'
+  AND cc.subset = 'warc'
+  AND regexp_like(cc.url_path, '^/?(?:index\.(?:html?|php))?$')
+  AND cc.url_query is NULL
+  AND (length(cc.url_host_name) = length(cc.url_host_registered_domain)
+       OR (length(cc.url_host_name) = (length(cc.url_host_registered_domain)+4)
+           AND substr(cc.url_host_name, 1, 4) = 'www.'))