You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+13-12
Original file line number
Diff line number
Diff line change
@@ -49,18 +49,19 @@ First, the table needs to be imported into Athena:
49
49
2. make Athena recognize the data partitions on `s3://`: `MSCK REPAIR TABLE 'ccindex';` (do not forget to adapt the table name). This step needs to be done again after new data partitions have been added.
50
50
51
51
A couple of sample queries are also provided:
52
-
- page/host/domain counts per top-level domain: [count-by-tld-page-host-domain.sql](src/sql/examples/count-by-tld-page-host-domain.sql)
52
+
- count captures over partitions (crawls and subsets), get a quick overview how many pages are contained in the monthly crawl archives (and are also indexed in the table): [count-by-partition.sql](src/sql/examples/cc-index/count-by-partition.sql)
53
+
- page/host/domain counts per top-level domain: [count-by-tld-page-host-domain.sql](src/sql/examples/cc-index/count-by-tld-page-host-domain.sql)
53
54
- "word" count of
54
-
- host name elements (split host name at `.` into words): [count-hostname-elements.sql](src/sql/examples/count-hostname-elements.sql)
55
-
- URL path elements (separated by `/`): [count-url-path-elements.sql](src/sql/examples/count-url-path-elements.sql)
56
-
- count HTTP status codes: [count-fetch-status.sql](src/sql/examples/count-fetch-status.sql)
57
-
- count the domains of a specific top-level domain: [count-domains-of-tld.sql](src/sql/examples/count-domains-of-tld.sql)
58
-
- compare document MIME types (Content-Type in HTTP response header vs. MIME type detected by [Tika](http://tika.apache.org/): [compare-mime-type-http-vs-detected.sql](src/sql/examples/compare-mime-type-http-vs-detected.sql)
59
-
- distribution/histogram of host name lengths: [host_length_distrib.sql](src/sql/examples/host_length_distrib.sql)
60
-
- count URL paths to robots.txt files [count-robotstxt-url-paths.sql](src/sql/examples/count-robotstxt-url-paths.sql)
55
+
- host name elements (split host name at `.` into words): [count-hostname-elements.sql](src/sql/examples/cc-index/count-hostname-elements.sql)
56
+
- URL path elements (separated by `/`): [count-url-path-elements.sql](src/sql/examples/cc-index/count-url-path-elements.sql)
57
+
- count HTTP status codes: [count-fetch-status.sql](src/sql/examples/cc-index/count-fetch-status.sql)
58
+
- count the domains of a specific top-level domain: [count-domains-of-tld.sql](src/sql/examples/cc-index/count-domains-of-tld.sql)
59
+
- compare document MIME types (Content-Type in HTTP response header vs. MIME type detected by [Tika](http://tika.apache.org/): [compare-mime-type-http-vs-detected.sql](src/sql/examples/cc-index/compare-mime-type-http-vs-detected.sql)
60
+
- distribution/histogram of host name lengths: [host_length_distrib.sql](src/sql/examples/cc-index/host_length_distrib.sql)
61
+
- count URL paths to robots.txt files [count-robotstxt-url-paths.sql](src/sql/examples/cc-index/count-robotstxt-url-paths.sql)
61
62
- export WARC record specs (file, offset, length) for
62
-
- a single domain: [get-records-of-domain.sql](src/sql/examples/get-records-of-domain.sql)
63
-
- a specific MIME type: [get-records-of-mime-type.sql](src/sql/examples/get-records-of-mime-type.sql)
64
-
- find multi-lingual domains by analyzing URL paths: [get_language_translations_url_path.sql](src/sql/examples/get_language_translations_url_path.sql)
65
-
- find similar domain names by Levenshtein distance (few characters changed): [similar-domains.sql](src/sql/examples/similar-domains.sql)
63
+
- a single domain: [get-records-of-domain.sql](src/sql/examples/cc-index/get-records-of-domain.sql)
64
+
- a specific MIME type: [get-records-of-mime-type.sql](src/sql/examples/cc-index/get-records-of-mime-type.sql)
65
+
- find multi-lingual domains by analyzing URL paths: [get_language_translations_url_path.sql](src/sql/examples/cc-index/get_language_translations_url_path.sql)
66
+
- find similar domain names by Levenshtein distance (few characters changed): [similar-domains.sql](src/sql/examples/cc-index/similar-domains.sql)
0 commit comments