0. 前言
Data Lake Analytics是Serverless化的云上交互式查询分析服务。用户可以使用标准的SQL语句,对存储在OSS、TableStore上的数据无需移动,直接进行查询分析。
目前该产品已经正式登陆阿里云,欢迎大家申请试用,体验更便捷的数据分析服务。
请参考 进行产品开通服务申请。
在上一篇教程中,我们介绍了如何分析CSV格式的TPC-H数据集。除了纯文本文件(例如,CSV,TSV等),用户存储在OSS上的其他格式的数据文件,也可以使用Data Lake Analytics进行查询分析,包括ORC, PARQUET, JSON, RCFILE, AVRO甚至ESRI规范的地理JSON数据,还可以用正则表达式匹配的文件等。
本文详细介绍如何根据存储在OSS上的文件格式使用Data Lake Analytics (下文简称 DLA)进行分析。DLA内置了各种处理文件数据的SerDe(Serialize/Deserilize的简称,目的是用于序列化和反序列化)实现,用户无需自己编写程序,基本上能选用DLA中的一款或多款SerDe来匹配您OSS上的数据文件格式。如果还不能满足您特殊文件格式的处理需求,请联系我们,尽快为您实现。
1. 存储格式与SerDe
用户可以依据存储在OSS上的数据文件进行建表,通过STORED AS 指定数据文件的格式。
例如,
CREATE EXTERNAL TABLE nation (
N_NATIONKEY INT,
N_NAME STRING,
N_REGIONKEY INT,
N_COMMENT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/nation';
建表成功后可以使用SHOW CREATE TABLE语句查看原始建表语句。
mysql> show create table nation;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Result |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| CREATE EXTERNAL TABLE `nation`(
`n_nationkey` int,
`n_name` string,
`n_regionkey` int,
`n_comment` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS `TEXTFILE`
LOCATION
'oss://test-bucket-julian-1/tpch_100m/nation'|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (1.81 sec)
下表中列出了目前DLA已经支持的文件格式,当针对下列格式的文件建表时,可以直接使用STORED AS,DLA会选择合适的SERDE/INPUTFORMAT/OUTPUTFORMAT。
在指定了STORED AS 的同时,还可以根据具体文件的特点,指定SerDe (用于解析数据文件并映射到DLA表),特殊的列分隔符等。
后面的部分会做进一步的讲解。
2. 示例
2.1 CSV文件
CSV文件,本质上还是纯文本文件,可以使用STORED AS TEXTFILE。
列与列之间以逗号分隔,可以通过ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 表示。
普通CSV文件
例如,数据文件oss://bucket-for-testing/oss/text/cities/city.csv的内容为
Beijing,China,010
ShangHai,China,021
Tianjin,China,022
建表语句可以为
CREATE EXTERNAL TABLE city (
city STRING,
country STRING,
code INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'oss://bucket-for-testing/oss/text/cities';
文件的编码格式为GBK
数据文件的编码格式为非UTF-8时,需要在建表语句中使用指定SERDE和编码格式,即
SERDE 和 WITH SERDEPROPERTIES ('serialization.encoding'='gbk')。
例如,编码格式为GBK,则建表语句为
CREATE EXTERNAL TABLE `gbk_test` (
`col_1` string,
`col_2` string,
`col_3` string
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES('serialization.encoding'='gbk', 'field.delim'=',')
STORED AS TEXTFILE
LOCATION 'oss://mybucket/datasets/test/test_create/create_table/gbk/gbk.csv'
TBLPROPERTIES (
'skip.header.line.count' = '0'
)
使用OpenCSVSerde处理引号引用的字段
OpenCSVSerde在使用时需要注意以下几点:
- 用户可以为行的字段指定字段分隔符、字段内容引用符号和转义字符,例如:WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\" );
- 不支持字段内嵌入的行分割符;
- 所有字段定义STRING类型;
- 其他数据类型的处理,可以在SQL中使用函数进行转换。
例如,CREATE EXTERNAL TABLE test_csv_opencsvserde ( id STRING, name STRING, location STRING, create_date STRING, create_timestamp STRING, longitude STRING, latitude STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties( 'separatorChar'=',', 'quoteChar'='"', 'escapeChar'='\\' ) STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/test_csv_serde_1';
自定义分隔符
需要自定义列分隔符(FIELDS TERMINATED BY),转义字符(ESCAPED BY),行结束符(LINES TERMINATED BY)。
需要在建表语句中指定
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
忽略CSV文件中的HEADER
在csv文件中,有时会带有HEADER信息,需要在数据读取时忽略掉这些内容。这时需要在建表语句中定义skip.header.line.count。
例如,数据文件oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl的内容如下:
N_NATIONKEY|N_NAME|N_REGIONKEY|N_COMMENT
0|ALGERIA|0| haggle. carefully final deposits detect slyly agai|
1|ARGENTINA|1|al foxes promise slyly according to the regular accounts. bold requests alon|
2|BRAZIL|1|y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special |
3|CANADA|1|eas hang ironic, silent packages. slyly regular packages are furiously over the tithes. fluffily bold|
4|EGYPT|4|y above the carefully unusual theodolites. final dugouts are quickly across the furiously regular d|
5|ETHIOPIA|0|ven packages wake quickly. regu|
相应的建表语句为:
CREATE EXTERNAL TABLE nation_header (
N_NATIONKEY INT,
N_NAME STRING,
N_REGIONKEY INT,
N_COMMENT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl'
TBLPROPERTIES ("skip.header.line.count"="1");
skip.header.line.count的取值x和数据文件的实际行数n有如下关系:
- 当x<=0时,DLA在读取文件时,不会过滤掉任何信息,即全部读取;
- 当0<x<n时,DLA在读取文件时,会过滤掉前x行,从第x+1行开始读取;
- 当x>=n时,DLA在读取文件时,会过滤掉所有的文件内容。
2.2 TSV文件
与CSV文件类似,TSV格式的文件也是纯文本文件,列与列之间的分隔符为Tab。
例如,数据文件oss://bucket-for-testing/oss/text/cities/city.tsv的内容为
Beijing China 010
ShangHai China 021
Tianjin China 022
建表语句可以为
CREATE EXTERNAL TABLE city (
city STRING,
country STRING,
code INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 'oss://bucket-for-testing/oss/text/cities';
2.3 多字符数据字段分割符文件
假设您的数据字段的分隔符包含多个字符,可采用如下示例建表语句,其中每行的数据字段分割符为“||”,可以替换为您具体的分割符字符串。
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
with serdeproperties(
"field.delim"="||"
)
示例:
CREATE EXTERNAL TABLE test_csv_multidelimit (
id STRING,
name STRING,
location STRING,
create_date STRING,
create_timestamp STRING,
longitude STRING,
latitude STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
with serdeproperties(
"field.delim"="||"
)
STORED AS TEXTFILE LOCATION 'oss://bucket-for-testing/oss/text/cities/';
2.4 JSON文件
DLA可以处理的JSON文件通常以纯文本的格式存储,数据文件的编码方式需要是UTF-8。
在JSON文件中,每行必须是一个完整的JSON对象。
例如,下面的文件格式是不被接受的
{
"id": 123, "name": "jack",
"c3": "2001-02-03 12:34:56"}
{
"id": 456, "name": "rose", "c3": "1906-04-18 05:12:00"}
{
"id": 789, "name": "tom", "c3": "2001-02-03 12:34:56"}
{
"id": 234, "name": "alice", "c3": "1906-04-18 05:12:00"}
需要改写成:
{
"id": 123, "name": "jack", "c3": "2001-02-03 12:34:56"}
{
"id": 456, "name": "rose", "c3": "1906-04-18 05:12:00"}
{
"id": 789, "name": "tom", "c3": "2001-02-03 12:34:56"}
{
"id": 234, "name": "alice", "c3": "1906-04-18 05:12:00"}
不含嵌套的JSON数据
建表语句可以写
CREATE EXTERNAL TABLE t1 (id int, name string, c3 timestamp)
STORED AS JSON
LOCATION 'oss://path/to/t1/directory';
含有嵌套的JSON文件
使用struct和array结构定义嵌套的JSON数据。
例如,用户原始数据(注意:无论是否嵌套,一条完整的JSON数据都只能放在一行上,才能被Data Lake Analytics处理):
{
"DocId": "Alibaba", "User_1": {
"Id": 1234, "Username": "bob1234", "Name": "Bob", "ShippingAddress": {
"Address1": "969 Wenyi West St.", "Address2": null, "City": "Hangzhou", "Province": "Zhejiang" }, "Orders": [{
"ItemId": 6789, "OrderDate": "11/11/2017" }, {
"ItemId": 4352, "OrderDate": "12/12/2017" } ] } }
使用在线JSON格式化工具格式化后,数据内容如下:
{
"DocId": "Alibaba",
"User_1": {
"Id": 1234,
"Username": "bob1234",
"Name": "Bob",
"ShippingAddress": {
"Address1": "969 Wenyi West St.",
"Address2": null,
"City": "Hangzhou",
"Province": "Zhejiang"
},
"Orders": [
{
"ItemId": 6789,
"OrderDate": "11/11/2017"
},
{
"ItemId": 4352,
"OrderDate": "12/12/2017"
}
]
}
}
则建表语句可以写成如下(注意:LOCATION中指定的路径必须是JSON数据文件所在的目录,该目录下的所有JSON文件都能被识别为该表的数据):
CREATE EXTERNAL TABLE json_table_1 (
docid string,
user_1 struct<
id:INT,
username:string,
name:string,
shippingaddress:struct<
address1:string,
address2:string,
city:string,
province:string
>,
orders:array<
struct<
itemid:INT,
orderdate:string
>
>
>
)
STORED AS JSON
LOCATION 'oss://xxx/test/json/hcatalog_serde/table_1/';
对该表进行查询:
select * from json_table_1;
+---------+----------------------------------------------------------------------------------------------------------------+
| docid | user_1 |
+---------+----------------------------------------------------------------------------------------------------------------+
| Alibaba | [1234, bob1234, Bob, [969 Wenyi West St., null, Hangzhou, Zhejiang], [[6789, 11/11/2017], [4352, 12/12/2017]]] |
+---------+----------------------------------------------------------------------------------------------------------------+
对于struct定义的嵌套结构,可以通过“.”进行层次对象引用,对于array定义的数组结构,可以通过“[数组下标]”(注意:数组下标从1开始)进行对象引用。
select DocId,
User_1.Id,
User_1.ShippingAddress.Address1,
User_1.Orders[1].ItemId
from json_table_1
where User_1.Username = 'bob1234'
and User_1.Orders[2].OrderDate = '12/12/2017';
+---------+------+--------------------+-------+
| DocId | id | address1 | _col3 |
+---------+------+--------------------+-------+
| Alibaba | 1234 | 969 Wenyi West St. | 6789 |
+---------+------+--------------------+-------+
使用JSON函数处理数据
例如,把“value_string”的嵌套JSON值作为字符串存储:
{
"data_key":"com.taobao.vipserver.domains.meta.biz.alibaba.com","ts":1524550275112,"value_string":"{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"张三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}"}
使用在线JSON格式化工具格式化后,数据内容如下:
{
"data_key": "com.taobao.vipserver.domains.meta.biz.alibaba.com",
"ts": 1524550275112,
"value_string": "{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"张三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}"
}
建表语句为
CREATE external TABLE json_table_2 (
data_key string,
ts bigint,
value_string string
)
STORED AS JSON
LOCATION 'oss://xxx/test/json/hcatalog_serde/table_2/';
表建好后,可进行查询:
select * from json_table_2;
+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| data_key | ts | value_string |
+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| com.taobao.vipserver.domains.meta.biz.alibaba.com | 1524550275112 | {
"appName":"","apps":[],"checksum":"50fa0540b430904ee78dff07c7350e1c","clusterMap":{
"DEFAULT":{
"defCkport":80,"defIPPort":80,"healthCheckTask":null,"healthChecker":{
"checkCode":200,"curlHost":"","curlPath":"/status.taobao","type":"HTTP"},"name":"DEFAULT","nodegroup":"","sitegroup":"","submask":"0.0.0.0/0","syncConfig":{
"appName":"trade-ma","nodegroup":"tradema","pubLevel":"publish","role":"","site":""},"useIPPort4Check":true}},"disabledSites":[],"enableArmoryUnit":false,"enableClientBeat":false,"enableHealthCheck":true,"enabled":true,"envAndSites":"","invalidThreshold":0.6,"ipDeleteTimeout":1800000,"lastModifiedMillis":1524550275107,"localSiteCall":true,"localSiteThreshold":0.8,"name":"biz.alibaba.com","nodegroup":"","owners":["junlan.zx","张三","李四","cui.yuanc"],"protectThreshold":0,"requireSameEnv":false,"resetWeight":false,"symmetricCallType":null,"symmetricType":"warehouse","tagName":"ipGroup","tenantId":"","tenants":[],"token":"1cf0ec0c771321bb4177182757a67fb0","useSpecifiedURL":false} |
+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
下面SQL示例json_parse,json_extract_scalar,json_extract等常用JSON函数的使用方式:
mysql> select json_extract_scalar(json_parse(value), '$.owners[1]') from json_table_2;
+--------+
| _col0 |
+--------+
| 张三 |
+--------+
mysql> select json_extract_scalar(json_obj.json_col, '$.DEFAULT.submask')
from (
select json_extract(json_parse(value), '$.clusterMap') as json_col from json_table_2
) json_obj
where json_extract_scalar(json_obj.json_col, '$.DEFAULT.healthChecker.curlPath') = '/status.taobao';
+-----------+
| _col0 |
+-----------+
| 0.0.0.0/0 |
+-----------+
mysql> with json_obj as (select json_extract(json_parse(value), '$.clusterMap') as json_col from json_table_2)
select json_extract_scalar(json_obj.json_col, '$.DEFAULT.submask')
from json_obj
where json_extract_scalar(json_obj.json_col, '$.DEFAULT.healthChecker.curlPath') = '/status.taobao';
+-----------+
| _col0 |
+-----------+
| 0.0.0.0/0 |
+-----------+
2.5 ORC文件
Optimized Row Columnar(ORC)是Apache开源项目Hive支持的一种优化的列存储文件格式。与CSV文件相比,不仅可以节省存储空间,还可以得到更好的查询性能。
对于ORC文件,只需要在建表时指定 STORED AS ORC。
例如,
CREATE EXTERNAL TABLE orders_orc_date (
O_ORDERKEY INT,
O_CUSTKEY INT,
O_ORDERSTATUS STRING,
O_TOTALPRICE DOUBLE,
O_ORDERDATE DATE,
O_ORDERPRIORITY STRING,
O_CLERK STRING,
O_SHIPPRIORITY INT,
O_COMMENT STRING
)
STORED AS ORC
LOCATION 'oss://bucket-for-testing/datasets/tpch/1x/orc_date/orders_orc';
2.6 PARQUET文件
Parquet是Apache开源项目Hadoop支持的一种列存储的文件格式。
使用DLA建表时,需要指定STORED AS PARQUET即可。
例如,
CREATE EXTERNAL TABLE orders_parquet_date (
O_ORDERKEY INT,
O_CUSTKEY INT,
O_ORDERSTATUS STRING,
O_TOTALPRICE DOUBLE,
O_ORDERDATE DATE,
O_ORDERPRIORITY STRING,
O_CLERK STRING,
O_SHIPPRIORITY INT,
O_COMMENT STRING
)
STORED AS PARQUET
LOCATION 'oss://bucket-for-testing/datasets/tpch/1x/parquet_date/orders_parquet';
2.7 RCFILE文件
Record Columnar File (RCFile), 列存储文件,可以有效地将关系型表结构存储在分布式系统中,并且可以被高效地读取和处理。
DLA在建表时,需要指定STORED AS RCFILE。
例如,
CREATE EXTERNAL TABLE lineitem_rcfile_date (
L_ORDERKEY INT,
L_PARTKEY INT,
L_SUPPKEY INT,
L_LINENUMBER INT,
L_QUANTITY DOUBLE,
L_EXTENDEDPRICE DOUBLE,
L_DISCOUNT DOUBLE,
L_TAX DOUBLE,
L_RETURNFLAG STRING,
L_LINESTATUS STRING,
L_SHIPDATE DATE,
L_COMMITDATE DATE,
L_RECEIPTDATE DATE,
L_SHIPINSTRUCT STRING,
L_SHIPMODE STRING,
L_COMMENT STRING
)
STORED AS RCFILE
LOCATION 'oss://bucke-for-testing/datasets/tpch/1x/rcfile_date/lineitem_rcfile'
2.8 AVRO文件
DLA针对AVRO文件建表时,需要指定STORED AS AVRO,并且定义的字段需要符合AVRO文件的schema。
如果不确定可以通过使用Avro提供的工具,获得schema,并根据schema建表。
在Apache Avro官网下载avro-tools-.jar到本地,执行下面的命令获得Avro文件的schema:
java -jar avro-tools-1.8.2.jar getschema /path/to/your/doctors.avro
{
"type" : "record",
"name" : "doctors",
"namespace" : "testing.hive.avro.serde",
"fields" : [ {
"name" : "number",
"type" : "int",
"doc" : "Order of playing the role"
}, {
"name" : "first_name",
"type" : "string",
"doc" : "first name of actor playing role"
}, {
"name" : "last_name",
"type" : "string",
"doc" : "last name of actor playing role"
} ]
}
建表语句如下,其中fields中的name对应表中的列名,type需要参考本文档中的表格转成DLA支持的类型
CREATE EXTERNAL TABLE doctors(
number int,
first_name string,
last_name string)
STORED AS AVRO
LOCATION 'oss://mybucket-for-testing/directory/to/doctors';
大多数情况下,Avro的类型可以直接转换成DLA中对应的类型。如果该类型在DLA不支持,则会转换成接近的类型。具体请参照下表:
Avro类型 | 对应DLA类型 |
---|---|
null | void |
boolean | boolean |
int | int |
long | bigint |
float | float |
double | double |
bytes | binary |
string | string |
record | struct |
map | map |
list | array |
union | union |
enum | string |
fixed | binary |
2.9 可以用正则表达式匹配的文件
通常此类型的文件是以纯文本格式存储在OSS上的,每一行代表表中的一条记录,并且每行可以用正则表达式匹配。
例如,Apache WebServer日志文件就是这种类型的文件。
某日志文件的内容为:
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
127.0.0.1 - - [26/May/2009:00:00:00 +0000] "GET /someurl/?track=Blabla(Main) HTTP/1.1" 200 5864 - "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19"
每行文件可以用下面的正则表达式表示,列之间使用空格分隔:
([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?
针对上面的文件格式,建表语句可以表示为:
CREATE EXTERNAL TABLE serde_regex(
host STRING,
identity STRING,
userName STRING,
time STRING,
request STRING,
status STRING,
size INT,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?"
)
STORED AS TEXTFILE
LOCATION 'oss://bucket-for-testing/datasets/serde/regex';
查询结果
mysql> select * from serde_regex;
+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
| host | identity | userName | time | request | status | size | referer | agent |
+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
| 127.0.0.1 | - | frank | [10/Oct/2000:13:55:36 -0700] | "GET /apache_pb.gif HTTP/1.0" | 200 | 2326 | NULL | NULL |
| 127.0.0.1 | - | - | [26/May/2009:00:00:00 +0000] | "GET /someurl/?track=Blabla(Main) HTTP/1.1" | 200 | 5864 | - | "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19" |
+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
2.10 Esri ArcGIS的地理JSON数据文件
DLA支持Esri ArcGIS的地理JSON数据文件的SerDe处理,关于这种地理JSON数据格式说明,可以参考:https://github.com/Esri/spatial-framework-for-hadoop/wiki/JSON-Formats
示例:
CREATE EXTERNAL TABLE IF NOT EXISTS california_counties
(
Name string,
BoundaryShape binary
)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.EsriJsonSerDe'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedEsriJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'oss://test_bucket/datasets/geospatial/california-counties/'
3. 总结
通过以上例子可以看出,DLA可以支持大部分开源存储格式的文件。对于同一份数据,使用不同的存储格式,在OSS中存储文件的大小,DLA的查询分析速度上会有较大的差别。推荐使用ORC格式进行文件的存储和查询。
为了获得更快的查询速度,DLA还在不断的优化中,后续也会支持更多的数据源,为用户带来更好的大数据分析体验。
(本文由 @金络 @君澜 共同完成,有问题可随时联系,欢迎骚扰)。