hive是给了我们很多内置函数的,比如转大小写,截取字符串等,具体的都在官方文档里面。但是并不是所有的函数都能满足我们的需求,所以hive提供了给我们自定义函数的功能。
1、至于怎么测试hive为我们提供的函数
因为mysql或者oracle中都可以使用伪表,但是hive不行,所以可以使用以下方法
1)、创建表dual,create table dual(id string)
2)、在本地创建文件dual.data,内容为空格或者空一行
3)、将dual.data文件load到表dual
进行测试,比如:字符串截取
0: jdbc:hive2://localhost:10000> select substr('sichuan',1,3) from dual; +------+--+ | _c0 | +------+--+ | sic | +------+--+
当然也可以直接使用 select substr(‘sichuan’,1,3),但是还是习惯用from dual;
2、自定义内置函数
添加maven依赖
<dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>1.2.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-metastore --> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-metastore</artifactId> <version>1.2.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-common --> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-common</artifactId> <version>1.2.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-service --> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-service</artifactId> <version>1.2.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc --> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>1.2.1</version> </dependency>
1)、大写转小写
可以先创建java类继承UDF,重载evaluate方法。
/** * 大写转小写 * @author 12706 */ public class UpperToLowerCase extends UDF { /* * 重载evaluate * 访问限制必须是public */ public String evaluate(String word) { String lowerWord = word.toLowerCase(); return lowerWord; } }
打包上传到hadoop集群(打的jar包名字为hive.jar)。
0: jdbc:hive2://localhost:10000> select * from t5; +--------+-----------+--+ | t5.id | t5.name | +--------+-----------+--+ | 13 | BABY | | 1 | zhangsan | | 2 | lisi | | 3 | wangwu | | 4 | furong | | 5 | fengjie | | 6 | aaa | | 7 | bbb | | 8 | ccc | | 9 | ddd | | 10 | eee | | 11 | fff | | 12 | ggg | +--------+-----------+--+ 13 rows selected (0.221 seconds)
将jar包放到hive的classpath下
0: jdbc:hive2://localhost:10000> add jar /root/hive.jar;
创建临时函数,指定完整类名
0: jdbc:hive2://localhost:10000> create temporary function tolower as 'com.scu.hive.UpperToLowerCase';
到这就可以使用自定义临时函数tolower()了,测试t5表中的name输出小写
0: jdbc:hive2://localhost:10000> select id,tolower(name) from t5; +-----+-----------+--+ | id | _c1 | +-----+-----------+--+ | 13 | baby | | 1 | zhangsan | | 2 | lisi | | 3 | wangwu | | 4 | furong | | 5 | fengjie | | 6 | aaa | | 7 | bbb | | 8 | ccc | | 9 | ddd | | 10 | eee | | 11 | fff | | 12 | ggg | +-----+-----------+--+
根据电话号码显示归属地信息
jave类
/** * 根据电话号码前三位获取归属地 * @author 12706 * */ public class PhoneNumParse extends UDF{ static HashMap<String, String> phoneMap = new HashMap<String, String>(); static{ phoneMap.put("136", "beijing"); phoneMap.put("137", "shanghai"); phoneMap.put("138", "shenzhen"); } public static String evaluate(int phoneNum) { String num = String.valueOf(phoneNum); String province = phoneMap.get(num.substring(0, 3)); return province==null?"foreign":province; } //测试 public static void main(String[] args) { String string = evaluate(136666); System.out.println(string); } }
将工程打包上传到linux,注意:如果名字还是跟上面一样,那么需要重新连接hive服务端了,否则jar包是不会覆盖的,建议打的jar包名字别一样
编辑文件vi prov.data
创建表flow(phonenum int,flow int)
将文件load到flow表
[root@mini1 ~]# vi prov.data; 1367788,1 1367788,10 1377788,80 1377788,97 1387788,98 1387788,99 1387788,100 1555118,99
0: jdbc:hive2://localhost:10000> create table flow(phonenum int,flow int) 0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ','; No rows affected (0.143 seconds) 0: jdbc:hive2://localhost:10000> load data local inpath '/root/prov.data' into table flow; INFO : Loading data to table myhive3.flow from file:/root/prov.data INFO : Table myhive3.flow stats: [numFiles=1, totalSize=88] No rows affected (0.316 seconds) 0: jdbc:hive2://localhost:10000> select * from flow; +----------------+------------+--+ | flow.phonenum | flow.flow | +----------------+------------+--+ | 1367788 | 1 | | 1367788 | 10 | | 1377788 | 80 | | 1377788 | 97 | | 1387788 | 98 | | 1387788 | 99 | | 1387788 | 100 | | 1555118 | 99 | +----------------+------------+--+
classpath下加入jar包,创建临时函数,测试
0: jdbc:hive2://localhost:10000> add jar /root/hive.jar; INFO : Added [/root/hive.jar] to class path INFO : Added resources: [/root/hive.jar] No rows affected (0.236 seconds) 0: jdbc:hive2://localhost:10000> create temporary function getprovince as 'com.scu.hive.PhoneNumParse'; No rows affected (0.038 seconds) 0: jdbc:hive2://localhost:10000> select phonenum,getprovince(phonenum),flow from flow; +-----------+-----------+-------+--+ | phonenum | _c1 | flow | +-----------+-----------+-------+--+ | 1367788 | beijing | 1 | | 1367788 | beijing | 10 | | 1377788 | shanghai | 80 | | 1377788 | shanghai | 97 | | 1387788 | shenzhen | 98 | | 1387788 | shenzhen | 99 | | 1387788 | shenzhen | 100 | | 1555118 | foreign | 99 | +-----------+-----------+-------+--+
Json数据解析UDF开发
有文件,内容一部分如下,里面都是json串,现在需要将它展示输出到表中,并解析对应为4个字段。
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"} {"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"} {"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"} {"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"} {"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"} {"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
java类
public class JsonParse extends UDF{ //{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"} //输出字符串 1193 5 978300760 1 public static String evaluate(String line){ MovieRateBean movieRateBean = JSON.parseObject(line, new TypeReference<MovieRateBean>() {}); return movieRateBean.toString(); } }
public class MovieRateBean { private String movie; private String rate;//评分 private String timeStamp; private String uid; @Override public String toString() { return this.movie+"\t"+this.rate+"\t"+this.timeStamp+"\t"+this.uid; } get、set方法 }
工程打包上传到linux下。
创建表json
create table json(line string);
将文件导入到json表
load data local inpath ‘/root/json.data’ into table json;
0: jdbc:hive2://localhost:10000> select * from json limit 10; +----------------------------------------------------------------+--+ | json.line | +----------------------------------------------------------------+--+ | {"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"} | | {"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"} | | {"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"} | | {"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"} | | {"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"} | | {"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"} | | {"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"} | | {"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"} | | {"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"} | | {"movie":"919","rate":"4","timeStamp":"978301368","uid":"1"} | +----------------------------------------------------------------+--+
0: jdbc:hive2://localhost:10000> add jar /root/hive3.jar; INFO : Added [/root/hive3.jar] to class path INFO : Added resources: [/root/hive3.jar] No rows affected (0.023 seconds) 0: jdbc:hive2://localhost:10000> create temporary function parsejson as 'com.scu.hive.JsonParse'; No rows affected (0.07 seconds) 0: jdbc:hive2://localhost:10000> select parsejson(line) from json limit 10; +---------------------+--+ | _c0 | +---------------------+--+ | 1193 5 978300760 1 | | 661 3 978302109 1 | | 914 3 978301968 1 | | 3408 4 978300275 1 | | 2355 5 978824291 1 | | 1197 3 978302268 1 | | 1287 5 978302039 1 | | 2804 5 978300719 1 | | 594 4 978302268 1 | | 919 4 978301368 1 | +---------------------+--+
到这里发现还有不足的地方,就是没显示字段。可以使用函数来实现重写建表来命名。
0: jdbc:hive2://localhost:10000> create table t_rating as 0: jdbc:hive2://localhost:10000> select split(parsejson(line),'\t')[0]as movieid, 0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[1] as rate, 0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[2] as timestring, 0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[3] as uid 0: jdbc:hive2://localhost:10000> from json limit 10;
0: jdbc:hive2://localhost:10000> select * from t_rating; +-------------------+----------------+----------------------+---------------+--+ | t_rating.movieid | t_rating.rate | t_rating.timestring | t_rating.uid | +-------------------+----------------+----------------------+---------------+--+ | 919 | 4 | 978301368 | 1 | | 594 | 4 | 978302268 | 1 | | 2804 | 5 | 978300719 | 1 | | 1287 | 5 | 978302039 | 1 | | 1197 | 3 | 978302268 | 1 | | 2355 | 5 | 978824291 | 1 | | 3408 | 4 | 978300275 | 1 | | 914 | 3 | 978301968 | 1 | | 661 | 3 | 978302109 | 1 | | 1193 | 5 | 978300760 | 1 | +-------------------+----------------+----------------------+---------------+--+
transform关键字使用
需求,创建新表,内容与t_rating表一致,但是第三个字段时间戳要改为输出周几。
Hive的 TRANSFORM 关键字提供了在SQL中调用自写脚本的功能
适合实现Hive中没有的功能又不想写UDF的情况。
1、编写python脚本(先看看机器有没有python),用来将表时间戳转为周几
2、加入编写的py文件
3、创建新表,字段值为t_rating表传入py函数后输出的字段值
[root@mini1 ~]# python Python 2.6.6 (r266:84292, Feb 21 2013, 23:54:59) [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> print 'hello'; hello >>> quit() [root@mini1 ~]# vi weekday_mapper.py; #import sys import datetime for line in sys.stdin: line = line.strip() movieid, rating, unixtime,userid = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print '\t'.join([movieid, rating, str(weekday),userid])
切换到hive客户端
0: jdbc:hive2://localhost:10000> add FILE /root/weekday_mapper.py; 1 0: jdbc:hive2://localhost:10000> create TABLE u_data_new as 0: jdbc:hive2://localhost:10000> SELECT 0: jdbc:hive2://localhost:10000> TRANSFORM (movieid, rate, timestring,uid) 0: jdbc:hive2://localhost:10000> USING 'python weekday_mapper.py' 0: jdbc:hive2://localhost:10000> AS (movieid, rate, weekday,uid) 0: jdbc:hive2://localhost:10000> FROM t_rating; ... 0: jdbc:hive2://localhost:10000> select * from u_data_new; +---------------------+------------------+---------------------+-----------------+--+ | u_data_new.movieid | u_data_new.rate | u_data_new.weekday | u_data_new.uid | +---------------------+------------------+---------------------+-----------------+--+ | 919 | 4 | 1 | 1 | | 594 | 4 | 1 | 1 | | 2804 | 5 | 1 | 1 | | 1287 | 5 | 1 | 1 | | 1197 | 3 | 1 | 1 | | 2355 | 5 | 7 | 1 | | 3408 | 4 | 1 | 1 | | 914 | 3 | 1 | 1 | | 661 | 3 | 1 | 1 | | 1193 | 5 | 1 | 1 | +---------------------+------------------+---------------------+-----------------+--+