p.空格字符串函数:space
语法: space(int n) 返回值: string 说明:返回长度为n的字符串 hive> select space(10) from tableName; hive> select length(space(10)) from tableName; 10
q.重复字符串函数:repeat
语法: repeat(string str, int n) 返回值: string 说明:返回重复n次后的str字符串 hive> select repeat('abc',5) from tableName; abcabcabcabcabc
r.首字符ascii函数:ascii
语法: ascii(string str) 返回值: int 说明:返回字符串str第一个字符的ascii码 hive> select ascii('abcde') from tableName; 97
s.左补足函数:lpad
语法: lpad(string str, int len, string pad) 返回值: string 说明:将str进行用pad进行左补足到len位 hive> select lpad('abc',10,'td') from tableName; tdtdtdtabc 注意:与GP,ORACLE不同,pad 不能默认
t.右补足函数:rpad
语法: rpad(string str, int len, string pad) 返回值: string 说明:将str进行用pad进行右补足到len位 hive> select rpad('abc',10,'td') from tableName; abctdtdtdt
u.分割字符串函数: split
语法: split(string str, string pat) 返回值: array 说明: 按照pat字符串分割str,会返回分割后的字符串数组 hive> select split('abtcdtef','t') from tableName; ["ab","cd","ef"]
v.集合查找函数: find_in_set
语法: find_in_set(string str, string strList) 返回值: int 说明: 返回str在strlist第一次出现的位置,strlist是用逗号分割的字符串。如果没有找该str字符,则返回0 hive> select find_in_set('ab','ef,ab,de') from tableName; 2 hive> select find_in_set('at','ef,ab,de') from tableName; 0
复合类型构建操作
- Map类型构建: map
语法: map (key1, value1, key2, value2, …) 说明:根据输入的key和value对构建map类型 hive> Create table mapTable as select map('100','tom','200','mary') as t from tableName; hive> describe mapTable; t map<string ,string> hive> select t from tableName; {"100":"tom","200":"mary"}
- Struct类型构建: struct
语法: struct(val1, val2, val3, …) 说明:根据输入的参数构建结构体struct类型 hive> create table struct_table as select struct('tom','mary','tim') as t from tableName; hive> describe struct_table; t struct<col1:string ,col2:string,col3:string> hive> select t from tableName; {"col1":"tom","col2":"mary","col3":"tim"}
b.array类型构建: array
语法: array(val1, val2, …) 说明:根据输入的参数构建数组array类型 hive> create table arr_table as select array("tom","mary","tim") as t from tableName; hive> describe tableName; t array<string> hive> select t from tableName; ["tom","mary","tim"]
复杂类型访问操作
- array类型访问: A[n]
语法: A[n] 操作类型: A为array类型,n为int类型 说明:返回数组A中的第n个变量值。数组的起始下标为0。比如,A是个值为['foo', 'bar']的数组类型,那么A[0]将返回'foo',而A[1]将返回'bar' hive> create table arr_table2 as select array("tom","mary","tim") as t from tableName; hive> select t[0],t[1] from arr_table2; tom mary tim
b.map类型访问: M[key]
语法: M[key] 操作类型: M为map类型,key为map中的key值 说明:返回map类型M中,key值为指定值的value值。比如,M是值为{'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'}的map类型,那么M['all']将会返回'foobar' hive> Create table map_table2 as select map('100','tom','200','mary') as t from tableName; hive> select t['200'],t['100'] from map_table2; mary tom
c.struct类型访问: S.x
语法: S.x 操作类型: S为struct类型 说明:返回结构体S中的x字段。比如,对于结构体struct foobar {int foo, int bar},foobar.foo返回结构体中的foo字段 hive> create table str_table2 as select struct('tom','mary','tim') as t from tableName; hive> describe tableName; t struct<col1:string ,col2:string,col3:string> hive> select t.col1,t.col3 from str_table2; tom tim
复杂类型长度统计函数
- Map类型长度函数: size(Map<k .V>)
语法: size(Map<k .V>) 返回值: int 说明: 返回map类型的长度 hive> select size(t) from map_table2; 2
b.array类型长度函数: size(Array)
语法: size(Array<T>) 返回值: int 说明: 返回array类型的长度 hive> select size(t) from arr_table2; 4
c.类型转换函数 ***
类型转换函数: cast 语法: cast(expr as <type>) 返回值: Expected "=" to follow "type" 说明: 返回转换后的数据类型 hive> select cast('1' as bigint) from tableName; 1
hive当中的lateral view 与 explode以及reflect和窗口函数
使用explode函数将hive表中的Map和Array字段数据进行拆分
lateral view用于和split、explode等UDTF一起使用的,能将一行数据拆分成多行数据,在此基础上可以对拆分的数据进行聚合,lateral view首先为原始表的每行调用UDTF,UDTF会把一行拆分成一行或者多行,lateral view在把结果组合,产生一个支持别名表的虚拟表。
其中explode还可以用于将hive一列中复杂的array或者map结构拆分成多行
需求:现在有数据格式如下
zhangsan child1,child2,child3,child4 k1:v1,k2:v2 lisi child5,child6,child7,child8 k3:v3,k4:v4
字段之间使用\t分割,需求将所有的child进行拆开成为一列
+----------+--+ | mychild | +----------+--+ | child1 | | child2 | | child3 | | child4 | | child5 | | child6 | | child7 | | child8 | +----------+--+
将map的key和value也进行拆开,成为如下结果
+-----------+-------------+--+ | mymapkey | mymapvalue | +-----------+-------------+--+ | k1 | v1 | | k2 | v2 | | k3 | v3 | | k4 | v4 | +-----------+-------------+--+
- 创建hive数据库
创建hive数据库 hive (default)> create database hive_explode; hive (default)> use hive_explode;
b.创建hive表,然后使用explode拆分map和array
hive (hive_explode)> create table t3(name string,children array<string>,address Map<string,string>) row format delimited fields terminated by '\t' collection items terminated by ',' map keys terminated by ':' stored as textFile;
c.加载数据
node03执行以下命令创建表数据文件 mkdir -p /export/servers/hivedatas/ cd /export/servers/hivedatas/ vim maparray 内容如下: zhangsan child1,child2,child3,child4 k1:v1,k2:v2 lisi child5,child6,child7,child8 k3:v3,k4:v4 hive表当中加载数据 hive (hive_explode)> load data local inpath '/export/servers/hivedatas/maparray' into table t3;
d.使用explode将hive当中数据拆开
将array当中的数据拆分开 hive (hive_explode)> SELECT explode(children) AS myChild FROM t3; 将map当中的数据拆分开 hive (hive_explode)> SELECT explode(address) AS (myMapKey, myMapValue) FROM t3;
使用explode拆分json字符串
需求: 需求:现在有一些数据格式如下:
a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
其中字段与字段之间的分隔符是 |
我们要解析得到所有的monthSales对应的值为以下这一列(行转列)
4900
2090
6987
- 创建hive表
hive (hive_explode)> create table explode_lateral_view > (`area` string, > `goods_id` string, > `sale_info` string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '|' > STORED AS textfile;
b.准备数据并加载数据
准备数据如下 cd /export/servers/hivedatas vim explode_json a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}] 加载数据到hive表当中去 hive (hive_explode)> load data local inpath '/export/servers/hivedatas/explode_json' overwrite into table explode_lateral_view;
c.使用explode拆分Array
hive (hive_explode)> select explode(split(goods_id,',')) as goods_id from explode_lateral_view;
d.使用explode拆解Map
hive (hive_explode)> select explode(split(area,',')) as area from explode_lateral_view;
e.拆解json字段
hive (hive_explode)> select explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')) as sale_info from explode_lateral_view; 然后我们想用get_json_object来获取key为monthSales的数据: hive (hive_explode)> select get_json_object(explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')),'$.monthSales') as sale_info from explode_lateral_view; 然后挂了FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions UDTF explode不能写在别的函数内 如果你这么写,想查两个字段,select explode(split(area,',')) as area,good_id from explode_lateral_view; 会报错FAILED: SemanticException 1:40 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'good_id' 使用UDTF的时候,只支持一个字段,这时候就需要LATERAL VIEW出场了
配合LATERAL VIEW使用
配合lateral view查询多个字段
hive (hive_explode)> select goods_id2,sale_info from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2; 其中LATERAL VIEW explode(split(goods_id,','))goods相当于一个虚拟表,与原表explode_lateral_view笛卡尔积关联
也可以多重使用
hive (hive_explode)> select goods_id2,sale_info,area2 from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2 LATERAL VIEW explode(split(area,','))area as area2;也是三个表笛卡尔积的结果
最终,我们可以通过下面的句子,把这个json格式的一行数据,完全转换成二维表的方式展现
hive (hive_explode)> select get_json_object(concat('{',sale_info_1,'}'),'$.source') as source,get_json_object(concat('{',sale_info_1,'}'),'$.monthSales') as monthSales,get_json_object(concat('{',sale_info_1,'}'),'$.userCount') as monthSales,get_json_object(concat('{',sale_info_1,'}'),'$.score') as monthSales from explode_lateral_view LATERAL VIEW explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{'))sale_info as sale_info_1;