-
Notifications
You must be signed in to change notification settings - Fork 1
Lab 009
Jazz Wang edited this page Dec 24, 2013
·
1 revision
- 巨量資料的應用已經擴及生活中的許多角落,電影「魔球(MONEYBALL)」就是一個例子。在 Sean Lahman 的網站上公佈了一些關於棒球的 CSV 資料,是採創用 CC 3.0 授權(Creative Commons Attribution-ShareAlike 3.0 Unported License),因此,我們就先拿這份資料來當作學習如何將 CSV 格式資料匯入 Hive 的範例。
- 首先,我們下載 2012 年的球賽統計資料,並解壓縮。
user@master ~ $ mkdir baseball user@master ~ $ cd baseball/ user@master ~/baseball $ wget http://seanlahman.com/files/database/lahman2012-csv.zip user@master ~/baseball $ unzip lahman2012-csv.zip
- 接著,我們必須將資料上傳到 HDFS。讓我們在 HDFS 上建立一個名為 baseball 的目錄,並將所有 CSV 檔案上傳到該目錄中。
user@master ~/baseball $ hadoop fs -mkdir baseball user@master ~/baseball $ hadoop fs -put *.csv baseball user@master ~/baseball $ hadoop fs -ls baseball Found 24 items -rw------- 3 user user 198529 2013-12-23 16:27 baseball/AllstarFull.csv -rw------- 3 user user 5730747 2013-12-23 16:28 baseball/Appearances.csv -rw------- 3 user user 7304 2013-12-23 16:28 baseball/AwardsManagers.csv -rw------- 3 user user 240867 2013-12-23 16:28 baseball/AwardsPlayers.csv -rw------- 3 user user 16719 2013-12-23 16:28 baseball/AwardsShareManagers.csv ... 略 ...
- 雖然 CSV 有欄位定義,但 Hive 並無法自動判斷欄位的資料型態,因此我們還是必須根據資料來源的定義,自行建立 Hive 資料表的 Schema。
- 如果想瞭解這份資料的每個 CSV 檔有哪些欄位,請參閱 readme 2012.txt 這個說明檔案。像是有 Open Source Sports 這樣的網站,而他們每年都會公佈各年度的統計資料。說明文件中也有提到另有 MS Access 的格式,若有需要,也可以下載這個版本,就可以比較輕易地進行 MS Access 先轉 MS SQL Server 再轉到 Hive 資料表的轉換,就可以省卻自己定義 Schema 的步驟。
MS Access Versions: lahman2012.mdb 2012readme.txt
- 在 Hive 上匯入 CSV 資料的順序是
- 建立資料庫(database)
- 建立資料表(table)
- 將資料匯入到資料表(import)
- 驗證資料表內容是否正確(verify)
- 首先,就讓我們來學習如何在 Hive 建立一個資料庫的語法。Hive 跟其他資料庫系統很像,也有一個互動式的 Shell 環境。請在命令列中執行指令
hive
就可以進入 Hive 的互動式查詢介面。
user@master ~/baseball $ hive Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-0.8.1-cdh4.0.1.jar!/hive-log4j.properties Hive history file=/tmp/user/hive_job_log_user_201312231714_241463960.txt hive>
- 後面的步驟若是看到
hive>
開頭的,就代表是在 Hive 的互動式介面中執行的指令。 - Hive 的語法,我們稱為 HiveQL,建立資料庫的 HiveQL 語法是
create database <資料庫名稱>
。
hive> create database baseball; OK Time taken: 5.983 seconds
- 接著,讓我們選定某一個想要分析的 CSV 資料,例如:Master.csv,先瞭解 Master 這個檔案的欄位跟資料型態。說明文件中提到
MASTER - Player names, DOB, and biographical info --- 2.1 MASTER table --- lahmanID Unique number assigned to each player playerID A unique code asssigned to each player. The playerID links the data in this file with records in the other files. managerID An ID for individuals who served as managers hofID An ID for individuals who are in teh baseball Hall of Fame birthYear Year player was born birthMonth Month player was born birthDay Day player was born birthCountry Country where player was born birthState State where player was born birthCity City where player was born deathYear Year player died deathMonth Month player died deathDay Day player died deathCountry Country where player died deathState State where player died deathCity City where player died nameFirst Player's first name nameLast Player's last name nameNote Note about player's name (usually signifying that they changed their name or played under two differnt names) nameGiven Player's given name (typically first and middle) nameNick Player's nickname weight Player's weight in pounds height Player's height in inches bats Player's batting hand (left, right, or both) throws Player's throwing hand (left or right) debut Date that player made first major league appearance finalGame Date that player made first major league appearance (blank if still active) college College attended lahman40ID ID used in Lahman Database version 4.0 lahman45ID ID used in Lahman database version 4.5 retroID ID used by retrosheet holtzID ID used by Sean Holtz's Baseball Almanac bbrefID ID used by Baseball Reference website
- HiveQL 建立資料表的語法是
create table <資料庫名稱>.<資料表名稱> ( 欄位1 資料型態1, ...., 欄位N 資料型態N )
hive> create table baseball.Master ( lahmanID INT, playerID INT, managerID INT, hofID INT, birthyear INT, birthMonth INT, birthDay INT, birthCountry STRING, birthState STRING, birthCity STRING, deathYear INT, deathMonth INT, deathDay INT, deathCountry STRING, deathState STRING, deathCity STRING, nameFirst STRING, nameLast STRING, nameNote STRING, nameGive STRING, nameNick STRING, weight INT, height INT, bats STRING, throws STRING, debut INT, finalGame INT, college STRING, lahman40ID INT, lahman45ID INT, retroID INT, holtzID INT, hbrefID INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
- <備註> 關於 Hive 支援的資料型態,請參考 官方文件,並且注意目前您的 Hive 版本是否支援所需的資料型態。
Decimal datatype was introduced in Hive 0.11.0 (HIVE-2693) and revised in Hive 0.13.0 (HIVE-3976).
- HiveQL 自 CSV 檔案匯入資料到 Hive 資料表的語法是 `LOAD DATA LOCAL INPATH <檔案路徑> [OVERWRITE] INTO TABLE <資料表名稱>。
hive> LOAD DATA LOCAL INPATH "Master.csv" OVERWRITE INTO TABLE baseball.Master;
- 首先,我們用 HiveQL 的語法,來檢查是否有正常產生資料庫。語法為
SHOW DATABASES;
。
hive> SHOW DATABASES; OK baseball default Time taken: 0.088 seconds
- 接著,讓我們切換預設的資料庫,變成剛剛產生的 baseball 資料庫。語法為
USE <資料庫名稱>;
。
hive> USE baseball; OK Time taken: 0.014 seconds
- 我們可以用類似 SQL 語法,來查詢目前的資料庫有哪幾個資料表。語法為
SHOW TABLES;
。
hive> SHOW TABLES; OK master Time taken: 0.086 seconds
- 其次,我們可以檢查一下剛剛建立的 baseball.master 資料表,內容是否正常。由於 HiveQL 與 SQL-92 有很高的相容性,因此我們可以用標準的
SELECT * FROM <資料表名稱> WHERE <條件>;
來進行資料的檢索。
hive> SELECT * FROM Master; hive> SELECT lahmanID FROM Master;
- 執行完以上的查詢後,您或許已經發現,當下的條件愈多,Hive 的執行速度就愈久。那是因為 Hive 必須將您下的查詢,轉換成 MapReduce 任務。
hive> SELECT lahmanID FROM Master WHERE birthyear > 1900; hive> SELECT COUNT( * ) FROM Master;
MapReduce Total cumulative CPU time: 2 seconds 900 msec Ended Job = job_201312211330_0022 No encryption was performed by peer. MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Accumulative CPU: 2.9 sec HDFS Read: 0 HDFS Write: 0 SUCESS Total MapReduce CPU Time Spent: 2 seconds 900 msec OK 18126 Time taken: 97.89 seconds
- 若要離開 Hive Shell 的話,請下
quit;
。
hive> quit;
- 有時,如果想直接執行很短的查詢,也可以用
hive -e <HiveQL 查詢語法>
的方式執行查詢。若加上-S
參數,則會抑制 MapReduce 的標準錯誤輸出(STDERR)。
user@master ~/baseball $ hive -S -e "use baseball; SELECT birthyear, lahmanID, nameFirst FROM master WHERE birthyear > 1900 ;"
- "Build a data library with Hive", By Peter J. Jamack, Big Data Analytics Consultant, 21 May 2013, IBM developerWorks