共計 10011 個字符,預計需要花費 26 分鐘才能閱讀完成。
本篇文章為大家展示了 Hive 中的復合數據結構以及函數的用法說明是什么,內容簡明扼要并且容易理解,絕對能使你眼前一亮,通過這篇文章的詳細介紹希望你能有所收獲。
目前 hive 支持的復合數據類型有以下幾種:
map
(key1, value1, key2, value2, …) Creates a map with the given key/value pairs
struct
(val1, val2, val3, …) Creates a struct with the given field values. Struct field names will be col1, col2, …
named_struct
(name1, val1, name2, val2, …) Creates a struct with the given field names and values. (as of Hive 0.8.0)
array
(val1, val2, …) Creates an array with the given elements
create_union
(tag, val1, val2, …) Creates a union type with the value that is being pointed to by the tag parameter
一、map、struct、array 這 3 種的用法:1、Array 的使用
創建數據庫表,以 array 作為數據類型
create table person(name string,work_locations array string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY \t
COLLECTION ITEMS TERMINATED BY ,
biansutao beijing,shanghai,tianjin,hangzhou
linan changchu,chengdu,wuhan
LOAD DATA LOCAL INPATH /home/hadoop/person.txt OVERWRITE INTO TABLE person;
hive select * from person;
biansutao [beijing , shanghai , tianjin , hangzhou]
linan [changchu , chengdu , wuhan]
Time taken: 0.355 seconds
hive select name from person;
linan
biansutao
Time taken: 12.397 seconds
hive select work_locations[0] from person;
changchu
beijing
Time taken: 13.214 seconds
hive select work_locations from person;
[changchu , chengdu , wuhan]
[beijing , shanghai , tianjin , hangzhou]
Time taken: 13.755 seconds
hive select work_locations[3] from person;
hangzhou
Time taken: 12.722 seconds
hive select work_locations[4] from person;
Time taken: 15.958 seconds
2、Map 的使用
創建數據庫表
create table score(name string, score map string,int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY \t
COLLECTION ITEMS TERMINATED BY ,
MAP KEYS TERMINATED BY :
要入庫的數據
biansutao 數學 :80, 語文 :89, 英語 :95
jobs 語文 :60, 數學 :80, 英語 :99
LOAD DATA LOCAL INPATH /home/hadoop/score.txt OVERWRITE INTO TABLE score;
hive select * from score;
biansutao {數學 :80, 語文 :89, 英語 :95}
jobs {語文 :60, 數學 :80, 英語 :99}
Time taken: 0.665 seconds
hive select name from score;
biansutao
Time taken: 19.778 seconds
hive select t.score from score t;
{語文 :60, 數學 :80, 英語 :99}
{數學 :80, 語文 :89, 英語 :95}
Time taken: 19.353 seconds
hive select t.score[語文] from score t;
Time taken: 13.054 seconds
hive select t.score[英語] from score t;
Time taken: 13.769 seconds
3、Struct 的使用
創建數據表
CREATE TABLE test(id int,course struct course:string,score:int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY \t
COLLECTION ITEMS TERMINATED BY ,
1 english,80
2 math,89
3 chinese,95
LOAD DATA LOCAL INPATH /home/hadoop/test.txt OVERWRITE INTO TABLE test;
hive select * from test;
1 {course : english , score :80}
2 {course : math , score :89}
3 {course : chinese , score :95}
Time taken: 0.275 seconds
hive select course from test;
{course : english , score :80}
{course : math , score :89}
{course : chinese , score :95}
Time taken: 44.968 seconds
select t.course.course from test t;
english
chinese
Time taken: 15.827 seconds
hive select t.course.score from test t;
Time taken: 13.235 seconds
4、數據組合(不支持組合的復雜數據類型)
LOAD DATA LOCAL INPATH /home/hadoop/test.txt OVERWRITE INTO TABLE test;
create table test1(id int,a MAP STRING,ARRAY STRING)
row format delimited fields terminated by \t
collection items terminated by ,
MAP KEYS TERMINATED BY :
1 english:80,90,70
2 math:89,78,86
3 chinese:99,100,82
LOAD DATA LOCAL INPATH /home/hadoop/test1.txt OVERWRITE INTO TABLE test1;
二、hive 中的一些不常見函數的用法:
常見的函數就不廢話了,和標準 sql 類似,下面我們要聊到的基本是 HQL 里面專有的函數,
hive 里面的函數大致分為如下幾種:Built-in、Misc.、UDF、UDTF、UDAF
我們就挑幾個標準 SQL 里沒有,但是在 HIVE SQL 在做統計分析常用到的來說吧。
1、array_contains(Collection Functions)
這是內置的對集合進行操作的函數,用法舉例:
create EXTERNAL table IF NOT EXISTS userInfo (id int,sex string, age int, name string, email string,sd string, ed string) ROW FORMAT DELIMITED FIELDS TERMINATED BY \t location /hive/dw
select * from userinfo where sex= male and (id!=1 and id !=2 and id!=3 and id!=4 and id!=5) and age 30;
select * from (select * from userinfo where sex= male and !array_contains(split( 1,2,3,4,5 , ,),cast(id as string))) tb1 where tb1.age 30;
其中建表所用的測試數據你可以用如下鏈接的腳本自動生成:
http://my.oschina.net/leejun2005/blog/76631
2、get_json_object(Misc. Functions)
測試數據:
first {store :{ fruit :[{ weight :8, type : apple},{weight :9, type : pear}], bicycle :{price :19.951, color : red1}}, email : amy@only_for_json_udf_test.net , owner : amy1 } third
first {store :{ fruit :[{ weight :9, type : apple},{weight :91, type : pear}], bicycle :{price :19.952, color : red2}}, email : amy@only_for_json_udf_test.net , owner : amy2 } third
first {store :{ fruit :[{ weight :10, type : apple},{weight :911, type : pear}], bicycle :{price :19.953, color : red3}}, email : amy@only_for_json_udf_test.net , owner : amy3 } third
create external table if not exists t_json(f1 string, f2 string, f3 string) row format delimited fields TERMINATED BY location /test/json
select get_json_object(t_json.f2, $.owner) from t_json;
SELECT * from t_json where get_json_object(t_json.f2, $.store.fruit[0].weight ) = 9;
SELECT get_json_object(t_json.f2, $.non_exist_key) FROM t_json;
這里尤其要注意 UDTF 的問題,官方文檔有說明:
json_tuple
A new json_tuple() UDTF is introduced in hive 0.7. It takes a set of names (keys) and a JSON string, and returns a tuple of values using one function. This is much more efficient than calling GET_JSON_OBJECT to retrieve more than one key from a single JSON string. In any case where a single JSON string would be parsed more than once, your query will be more efficient if you parse it once, which is what JSON_TUPLE is for. As JSON_TUPLE is a UDTF, you will need to use the LATERAL VIEW syntax in order to achieve the same goal.
For example,
select a.timestamp, get_json_object(a.appevents, $.eventid), get_json_object(a.appenvets, $.eventname) from log a;
should be changed to
select a.timestamp, b.*
from log a lateral view json_tuple(a.appevent, eventid , eventname) b as f1, f2;
UDTF(User-Defined Table-Generating Functions) 用來解決 輸入一行輸出多行 (On-to-many maping) 的需求。
通過 Lateral view 可以方便的將 UDTF 得到的行轉列的結果集合在一起提供服務,因為直接在 SELECT 使用 UDTF 會存在限制,即僅僅能包含單個字段,不光是多個 UDTF,僅僅單個 UDTF 加上其他字段也是不可以,hive 提示在 UDTF 中僅僅能有單一的表達式。如下:
hive select my_test(“abcef:aa”) as qq,’abcd’from sunwg01;
FAILED: Error in semantic analysis: Only a single expression in the SELECT clause is supported with UDTF’s
使用 Lateral view 可以實現上面的需求,Lateral view 語法如下:
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (‘,’columnAlias)*
fromClause: FROM baseTable (lateralView)*
hive create table sunwg (a array, b array)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY‘\t’
COLLECTION ITEMS TERMINATED BY‘,’;
OK
Time taken: 1.145 seconds
hive load data local inpath‘/home/hjl/sunwg/sunwg.txt’overwrite into table sunwg;
Copying data from file:/home/hjl/sunwg/sunwg.txt
Loading data to table sunwg
OK
Time taken: 0.162 seconds
hive select * from sunwg;
OK
[10,11] [tom , mary]
[20,21] [kate , tim]
Time taken: 0.069 seconds
hive
SELECT a, name
FROM sunwg LATERAL VIEW explode(b) r1 AS name;
OK
[10,11] tom
[10,11] mary
[20,21] kate
[20,21] tim
Time taken: 8.497 seconds
hive SELECT id, name
FROM sunwg LATERAL VIEW explode(a) r1 AS id
LATERAL VIEW explode(b) r2 AS name;
OK
10 tom
10 mary
11 tom
11 mary
20 kate
20 tim
21 kate
21 tim
Time taken: 9.687 seconds
3、parse_url_tuple
測試數據:
url1 http://facebook.com/path2/p.php?k1=v1 k2=v2#Ref1
url2 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-getjsonobject
url3 https://www.google.com.hk/#hl=zh-CN newwindow=1 safe=strict q=hive+translate+example oq=hive+translate+example gs_l=serp.3…10174.11861.6.12051.8.8.0.0.0.0.132.883.0j7.7.0…0.0…1c.1j4.8.serp.0B9C1T_n0Hs bav=on.2,or. bvm=bv.44770516,d.aGc fp=e13e41a6b9dab3f6 biw=1241 bih=589
create external table if not exists t_url(f1 string, f2 string) row format delimited fields TERMINATED BY location /test/url
SELECT f1, b.* FROM t_url LATERAL VIEW parse_url_tuple(f2, HOST , PATH , QUERY , QUERY:k1) b as host, path, query, query_id;
結果:
url1 facebook.com /path2/p.php k1=v1 k2=v2 v1
url2 cwiki.apache.org /confluence/display/Hive/LanguageManual+UDF NULL NULL
url3 www.google.com.hk / NULL NULL
4、explode
explode 是一個 hive 內置的表生成函數:Built-in Table-Generating Functions (UDTF),主要是解決 1 to N 的問題,即它可以把一行輸入拆成多行,比如一個 array 的每個元素拆成一行,作為一個虛表輸出。它有如下需要注意的地方:
Using the syntax SELECT udtf(col) AS colAlias... has a few limitations:
No other expressions are allowed in SELECT
SELECT pageid, explode(adid_list) AS myCol... is not supported
UDTF s can t be nested
SELECT explode(explode(adid_list)) AS myCol... is not supported
GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not supported
從上面的原理與語法上可知,
select 列中不能 udtf 和其它非 udtf 列混用,
udtf 不能嵌套,
不支持 GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY
還有 select 中出現的 udtf 一定需要列別名,否則會報錯:
SELECT explode(myCol) AS myNewCol FROM myTable;
SELECT explode(myMap) AS (myMapKey, myMapValue) FROM myMapTable;
SELECT posexplode(myCol) AS pos, myNewCol FROM myTable;
5、lateral view
lateral view 是 Hive 中提供給 UDTF 的 conjunction,它可以解決 UDTF 不能添加額外的 select 列的問題。當我們想對 hive 表中某一列進行 split 之后,想對其轉換成 1 to N 的模式,即一行轉多列。hive 不允許我們在 UDTF 函數之外,再添加其它 select 語句。
如下,我們想將登錄某個游戲的用戶 id 放在一個字段 user_ids 里,對每一行數據用 UDTF 后輸出多行。
select game_id, explode(split(user_ids, \\[\\[\\[)) as user_id from login_game_log where dt= 2014-05-15 ;
FAILED: Error in semantic analysis: UDTF s are not supported outside the SELECT clause, nor nested in expressions。
提示語法分析錯誤,UDTF 不支持函數之外的 select 語句,如果我們想支持怎么辦呢?接下來就是 Lateral View 登場的時候了。
Lateral view 其實就是用來和像類似 explode 這種 UDTF 函數聯用的。lateral view 會將 UDTF 生成的結果放到一個虛擬表中,然后這個虛擬表(1 to N)會和輸入行即每個 game_id 進行 join 來達到連接 UDTF 外的 select 字段的目的(源表和拆分的虛表按行做行內 1 join N 的直接連接),這也是為什么 LATERAL VIEW udtf(expression) 后面需要表別名和列別名的原因。
Lateral View Syntax
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (, columnAlias)*
fromClause: FROM baseTable (lateralView)*
可以看出,可以在 2 個地方用 Lateral view:
在 udtf 前面用
在 from baseTable 后面用
例如:
pageid adid_list
front_page [1, 2, 3]
contact_page [3, 4, 5]
SELECT pageid, adid
FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid;
pageid adid
front_page 1
front_page 2
front_page 3
contact_page 3
contact_page 4
contact_page 5
From 語句后可以跟多個 Lateral View。
A FROM clause can have multiple LATERAL VIEW clauses. Subsequent LATERAL VIEWS can reference columns from any of the tables appearing to the left of the LATERAL VIEW.
給定數據:
Array int col1 Array string col2
[1, 2] [a , b , c][3, 4] [d , e , f]
轉換目標:
想同時把第一列和第二列拆開,類似做笛卡爾乘積。
我們可以這樣寫:
SELECT myCol1, myCol2 FROM baseTable
LATERAL VIEW explode(col1) myTable1 AS myCol1
LATERAL VIEW explode(col2) myTable2 AS myCol2;
還有一種情況,如果 UDTF 轉換的 Array 是空的怎么辦呢?
在 Hive0.12 里面會支持 outer 關鍵字,如果 UDTF 的結果是空,默認會被忽略輸出。
如果加上 outer 關鍵字,則會像 left outer join 一樣,還是會輸出 select 出的列,而 UDTF 的輸出結果是 NULL。
Lateral View 通常和 UDTF 一起出現,為了解決 UDTF 不允許在 select 字段的問題。
Multiple Lateral View 可以實現類似笛卡爾乘積。
Outer 關鍵字可以把不輸出的 UDTF 的空結果,輸出成 NULL,防止丟失數據。
上述內容就是 Hive 中的復合數據結構以及函數的用法說明是什么,你們學到知識或技能了嗎?如果還想學到更多技能或者豐富自己的知識儲備,歡迎關注丸趣 TV 行業資訊頻道。