Skip to content

Commit 2fd83de

Browse files
Complete description how to register table in Athena,
add sample output of one query
1 parent 00330f5 commit 2fd83de

File tree

3 files changed

+43
-7
lines changed

3 files changed

+43
-7
lines changed

README.md

+35-6
Original file line numberDiff line numberDiff line change
@@ -43,12 +43,13 @@ Columns are defined described in the table schema ([flat](src/main/resources/sch
4343

4444
## Query the table in AWS Athena
4545

46-
First, the table needs to be imported into Athena:
46+
First, the table needs to be imported into [AWS Athena](). In the Athena Query Editor:
4747

48-
1. edit the "create table" statement ([flat](src/sql/athena/cc-index-create-table-flat.sql) or [nested](src/sql/athena/cc-index-create-table-nested.sql)) and add the correct table name and path to the Parquet/ORC data on `s3://`. Execute the "create table" query.
49-
2. make Athena recognize the data partitions on `s3://`: `MSCK REPAIR TABLE 'ccindex';` (do not forget to adapt the table name). This step needs to be done again after new data partitions have been added.
48+
1. create a database `ccindex`: `CREATE DATABASE ccindex` and make sure that it's selected as "DATABASE"
49+
2. edit the "create table" statement ([flat](src/sql/athena/cc-index-create-table-flat.sql) or [nested](src/sql/athena/cc-index-create-table-nested.sql)) and add the correct table name and path to the Parquet/ORC data on `s3://`. Execute the "create table" query.
50+
3. make Athena recognize the data partitions on `s3://`: `MSCK REPAIR TABLE ccindex` (do not forget to adapt the table name). This step needs to be repeated every time new data partitions have been added.
5051

51-
A couple of sample queries are also provided:
52+
A couple of sample queries are also provided (for the flat schema):
5253
- count captures over partitions (crawls and subsets), get a quick overview how many pages are contained in the monthly crawl archives (and are also indexed in the table): [count-by-partition.sql](src/sql/examples/cc-index/count-by-partition.sql)
5354
- page/host/domain counts per top-level domain: [count-by-tld-page-host-domain.sql](src/sql/examples/cc-index/count-by-tld-page-host-domain.sql)
5455
- "word" count of
@@ -57,11 +58,39 @@ A couple of sample queries are also provided:
5758
- count HTTP status codes: [count-fetch-status.sql](src/sql/examples/cc-index/count-fetch-status.sql)
5859
- count the domains of a specific top-level domain: [count-domains-of-tld.sql](src/sql/examples/cc-index/count-domains-of-tld.sql)
5960
- compare document MIME types (Content-Type in HTTP response header vs. MIME type detected by [Tika](http://tika.apache.org/): [compare-mime-type-http-vs-detected.sql](src/sql/examples/cc-index/compare-mime-type-http-vs-detected.sql)
60-
- distribution/histogram of host name lengths: [host_length_distrib.sql](src/sql/examples/cc-index/host_length_distrib.sql)
61+
- distribution/histogram of host name lengths: [host-length-distrib.sql](src/sql/examples/cc-index/host-length-distrib.sql)
6162
- count URL paths to robots.txt files [count-robotstxt-url-paths.sql](src/sql/examples/cc-index/count-robotstxt-url-paths.sql)
6263
- export WARC record specs (file, offset, length) for
6364
- a single domain: [get-records-of-domain.sql](src/sql/examples/cc-index/get-records-of-domain.sql)
6465
- a specific MIME type: [get-records-of-mime-type.sql](src/sql/examples/cc-index/get-records-of-mime-type.sql)
65-
- find multi-lingual domains by analyzing URL paths: [get_language_translations_url_path.sql](src/sql/examples/cc-index/get_language_translations_url_path.sql)
6666
- find similar domain names by Levenshtein distance (few characters changed): [similar-domains.sql](src/sql/examples/cc-index/similar-domains.sql)
67+
- find multi-lingual domains by analyzing URL paths: [get-language-translations-url-path.sql](src/sql/examples/cc-index/get-language-translations-url-path.sql)
68+
69+
Athena creates results in CSV format. E.g., for the last example, the mining of multi-lingual domains we get:
70+
71+
domain |n_lang | n_pages | lang_counts
72+
--------------------------|-------|----------|------------------
73+
vatican.va | 40 | 42795 | {de=3147, ru=20, be=1, fi=3, pt=4036, bg=11, lt=1, hr=395, fr=5677, hu=79, uc=2, uk=17, sk=20, sl=4, sp=202, sq=5, mk=1, ge=204, sr=2, sv=3, or=2243, sw=5, el=5, mt=2, en=7650, it=10776, es=5360, zh=5, iw=2, cs=12, ar=184, vi=1, th=4, la=1844, pl=658, ro=9, da=2, tr=5, nl=57, po=141}
74+
iubilaeummisericordiae.va | 7 | 2916 | {de=445, pt=273, en=454, it=542, fr=422, pl=168, es=612}
75+
osservatoreromano.va | 7 | 1848 | {de=284, pt=42, en=738, it=518, pl=62, fr=28, es=176}
76+
cultura.va | 3 | 1646 | {en=373, it=1228, es=45}
77+
annusfidei.va | 6 | 833 | {de=51, pt=92, en=171, it=273, fr=87, es=159}
78+
pas.va | 2 | 689 | {en=468, it=221}
79+
photogallery.va | 6 | 616 | {de=90, pt=86, en=107, it=130, fr=83, es=120}
80+
im.va | 6 | 325 | {pt=2, en=211, it=106, pl=1, fr=3, es=2}
81+
museivaticani.va | 5 | 266 | {de=63, en=54, it=47, fr=37, es=65}
82+
laici.va | 4 | 243 | {en=134, it=5, fr=51, es=53}
83+
radiovaticana.va | 3 | 220 | {en=5, it=214, fr=1}
84+
casinapioiv.va | 2 | 213 | {en=125, it=88}
85+
vaticanstate.va | 5 | 193 | {de=25, en=76, it=24, fr=25, es=43}
86+
laityfamilylife.va | 5 | 163 | {pt=21, en=60, it=3, fr=78, es=1}
87+
camposanto.va | 1 | 156 | {de=156}
88+
synod2018.va | 3 | 113 | {en=24, it=67, fr=22}
89+
90+
91+
92+
## Process the Table with Spark
93+
94+
tbd.
95+
6796

src/sql/examples/cc-index/get-language-translations-url-path.sql

+8-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,14 @@
99
-- - number of pages per language as map/histogram
1010
-- - only output domains with at least 100 pages and
1111
-- at least one language code in the URL path
12-
SELECT url_host_registered_domain,
12+
--
13+
-- The idea was taken from
14+
-- - Resnik/Smith 2003: The Web as a Parallel Corpus,
15+
-- http://www.aclweb.org/anthology/J03-3002.pdf
16+
-- - Buck 2015: Corpus Acquisition from the Interwebs,
17+
-- http://mt-class.org/jhu-2015/slides/lecture-crawling.pdf
18+
--
19+
SELECT url_host_registered_domain AS domain,
1320
COUNT(DISTINCT(url_path_lang)) as n_lang,
1421
COUNT(*) as n_pages,
1522
histogram(url_path_lang) as lang_counts

0 commit comments

Comments
 (0)