You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+35-6
Original file line number
Diff line number
Diff line change
@@ -43,12 +43,13 @@ Columns are defined described in the table schema ([flat](src/main/resources/sch
43
43
44
44
## Query the table in AWS Athena
45
45
46
-
First, the table needs to be imported into Athena:
46
+
First, the table needs to be imported into [AWS Athena](). In the Athena Query Editor:
47
47
48
-
1. edit the "create table" statement ([flat](src/sql/athena/cc-index-create-table-flat.sql) or [nested](src/sql/athena/cc-index-create-table-nested.sql)) and add the correct table name and path to the Parquet/ORC data on `s3://`. Execute the "create table" query.
49
-
2. make Athena recognize the data partitions on `s3://`: `MSCK REPAIR TABLE 'ccindex';` (do not forget to adapt the table name). This step needs to be done again after new data partitions have been added.
48
+
1. create a database `ccindex`: `CREATE DATABASE ccindex` and make sure that it's selected as "DATABASE"
49
+
2. edit the "create table" statement ([flat](src/sql/athena/cc-index-create-table-flat.sql) or [nested](src/sql/athena/cc-index-create-table-nested.sql)) and add the correct table name and path to the Parquet/ORC data on `s3://`. Execute the "create table" query.
50
+
3. make Athena recognize the data partitions on `s3://`: `MSCK REPAIR TABLE ccindex` (do not forget to adapt the table name). This step needs to be repeated every time new data partitions have been added.
50
51
51
-
A couple of sample queries are also provided:
52
+
A couple of sample queries are also provided (for the flat schema):
52
53
- count captures over partitions (crawls and subsets), get a quick overview how many pages are contained in the monthly crawl archives (and are also indexed in the table): [count-by-partition.sql](src/sql/examples/cc-index/count-by-partition.sql)
53
54
- page/host/domain counts per top-level domain: [count-by-tld-page-host-domain.sql](src/sql/examples/cc-index/count-by-tld-page-host-domain.sql)
54
55
- "word" count of
@@ -57,11 +58,39 @@ A couple of sample queries are also provided:
57
58
- count HTTP status codes: [count-fetch-status.sql](src/sql/examples/cc-index/count-fetch-status.sql)
58
59
- count the domains of a specific top-level domain: [count-domains-of-tld.sql](src/sql/examples/cc-index/count-domains-of-tld.sql)
59
60
- compare document MIME types (Content-Type in HTTP response header vs. MIME type detected by [Tika](http://tika.apache.org/): [compare-mime-type-http-vs-detected.sql](src/sql/examples/cc-index/compare-mime-type-http-vs-detected.sql)
60
-
- distribution/histogram of host name lengths: [host_length_distrib.sql](src/sql/examples/cc-index/host_length_distrib.sql)
61
+
- distribution/histogram of host name lengths: [host-length-distrib.sql](src/sql/examples/cc-index/host-length-distrib.sql)
61
62
- count URL paths to robots.txt files [count-robotstxt-url-paths.sql](src/sql/examples/cc-index/count-robotstxt-url-paths.sql)
62
63
- export WARC record specs (file, offset, length) for
63
64
- a single domain: [get-records-of-domain.sql](src/sql/examples/cc-index/get-records-of-domain.sql)
64
65
- a specific MIME type: [get-records-of-mime-type.sql](src/sql/examples/cc-index/get-records-of-mime-type.sql)
65
-
- find multi-lingual domains by analyzing URL paths: [get_language_translations_url_path.sql](src/sql/examples/cc-index/get_language_translations_url_path.sql)
66
66
- find similar domain names by Levenshtein distance (few characters changed): [similar-domains.sql](src/sql/examples/cc-index/similar-domains.sql)
67
+
- find multi-lingual domains by analyzing URL paths: [get-language-translations-url-path.sql](src/sql/examples/cc-index/get-language-translations-url-path.sql)
68
+
69
+
Athena creates results in CSV format. E.g., for the last example, the mining of multi-lingual domains we get:
0 commit comments