截止目前,Data Lake Analytics(下面简称 DLA)已经和 OSS 有很多深度的集成,详情可以参见 DLA 的官方文档:DLA 和 OSS 整合。
为了能够提供更好的使用体验,目前可以通过 DLA 自动探测 OSS 上的多种文件格式,并自动生成建表语句,大大减少建表的使用成本。之前我们支持探测的文件格式包括 JSON、AVRO、Parquet, ORC 等,具体使用可以参见 Data Lake Analytics: 基于OSS文件自动推断建表。为了满足更多客户的需求,我们新加了一种文件格式的探测,也就是我们常见的 CSV 格式。本文将介绍如何使用这个功能。
在介绍之前,我们需要先创建一个基于 OSS 的 Database,如下:
CREATE SCHEMA dla_oss_test with DBPROPERTIES(
catalog='oss',
location= 'oss://dla-test/'
);
我们这里用到的 OSS Bucket 名称为 dla-test,大家在具体使用的时候换成自己对于的 OSS Bucket 即可。
通过 DLA 创建基于 OSS 表并探测文件格式的建表语句基本语法为
CREATE EXTERNAL TABLE table_name LIKE MAPPING ('oss://path/to/target/dir')
下面我们通过几个例子来说明这个功能的使用。
例子1
很多时候,我们的 CSV 文件是没有文件头的,针对这种情况,DLA 自动探测能够识别这种情况,并自动生成每列的字段名。比如我们准备以下的 CSV 文件(1.csv):
1,dla,28,beijing
2,spark on dla,30,hangzhou
3,presto,40,shenzhen
4,aliyun,30,shanghai
我们把 1.csv 文件上传到 dla-test Bucket 的 oss://dla-test/dla_csv_oss/dla_csv_noheader 路径中,然后我们的建表语句如下:
CREATE EXTERNAL TABLE dla_oss_test.create_table_like_csv like mapping('oss://dla-test/dla_csv_oss/dla_csv_noheader/');
执行完之后, DLA 结合 OSS 的路径,自动推断出这个路径下的 CSV 文件的格式,以及文件的类型,我们使用 show create table dla_oss_test.create_table_like_csv
语句就可以看到推断出来的建表语句:
show create table dla_oss_test.create_table_like_csv
CREATE EXTERNAL TABLE `dla_oss_test`.`create_table_like_csv` (
`_col1` tinyint,
`_col2` string,
`_col3` tinyint,
`_col4` string
)
STORED AS `TEXTFILE`
LOCATION 'oss://dla-test/dla_csv_oss/dla_csv_noheader'
TBLPROPERTIES (
'create.table.like.file' = 'oss://dla-test/dla_csv_oss/dla_csv_noheader/',
'escape.delim' = '\\',
'field.delim' = ','
)
可以看出,DLA 已经自动为我们推断出 CSV 文件有四列,因为这个 CSV 文件没有头部信息,所以使用默认的列名当做表的字段名称,同时结合数据,推断出每列的数据类型。在 TBLPROPERTIES 里面,推断出 CSV 文件的分隔符以及转义字符等。
现在我们可以查看下这张表的数据:
select * from dla_oss_test.create_table_like_csv
结果如下:
可以看到,已经正确读出我们要的数据。
例子2
如果 CSV 文件有文件头部信息,DLA 也可以自动推断出来,比如我们有如下内容的 CSV 文件(2.csv)
show_id,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,type
81193313,Chocolate,,"Ha Ji-won, Yoon Kye-sang, Jang Seung-jo, Kang Bu-ja, Lee Jae-ryong, Min Jin-woong, Kim Won-hae, Yoo Teo",South Korea,"November 30, 2019",2019,TV-14,1 Season,"International TV Shows, Korean TV Shows, Romantic TV Shows","Brought together by meaningful meals in the past and present, a doctor and a chef are reacquainted when they begin working at a hospice ward.",TV Show
81197050,Guatemala: Heart of the Mayan World,"Luis Ara, Ignacio Jaunsolo",Christian Morales,,"November 30, 2019",2019,TV-G,67 min,"Documentaries, International Movies","From Sierra de las Minas to Esquipulas, explore Guatemala's cultural and geological wealth, including ancient Mayan cities and other natural wonders.",Movie
81213894,The Zoya Factor,Abhishek Sharma,"Sonam Kapoor, Dulquer Salmaan, Sanjay Kapoor, Sikander Kher, Angad Bedi, Koel Purie, Pooja Bhamrah, Manu Rishi Chadha",India,"November 30, 2019",2019,TV-14,135 min,"Comedies, Dramas, International Movies","A goofy copywriter unwittingly convinces the Indian cricket team that she’s their lucky mascot, to the dismay of their superstition-shunning captain.",Movie
81082007,Atlantics,Mati Diop,"Mama Sane, Amadou Mbow, Ibrahima Traore, Nicole Sougou, Amina Kane, Mariama Gassama, Coumba Dieng, Ibrahima Mbaye, Diankou Sembene","France, Senegal, Belgium","November 29, 2019",2019,TV-14,106 min,"Dramas, Independent Movies, International Movies","Arranged to marry a rich man, young Ada is crushed when her true love goes missing at sea during a migration attempt – until a miracle reunites them.",Movie
80213643,Chip and Potato,,"Abigail Oliver, Andrea Libman, Briana Buckmaster, Brian Dobson, Chance Hurstfield, Dominic Good, Emma Jayne Maas, Evan Byarushengo, Scotia Anderson, Alessandro Juliani","Canada, United Kingdom",,2019,TV-Y,2 Seasons,Kids' TV,"Lovable pug Chip starts kindergarten, makes new friends and tries new things – with a little help from Potato, her secret mouse pal.",TV Show
81172754,Crazy people,Moses Inwang,"Ramsey Nouah, Chigul, Sola Sobowale, Ireti Doyle, Ben Touitou, Francis Onwochei, Ememobong Nkana, Emem Inwang, Patrick Onyeke",Nigeria,"November 29, 2019",2018,TV-14,107 min,"Comedies, International Movies, Thrillers",Nollywood star Ramsey Nouah learns that someone is impersonating him and breaks out of a mental institution to expose the imposter.,Movie
81120982,I Lost My Body,Jérémy Clapin,"Hakim Faris, Victoire Du Bois, Patrick d'Assumçao, Dev Patel, Alia Shawkat, George Wendt",France,"November 29, 2019",2019,TV-MA,81 min,"Dramas, Independent Movies, International Movies","Romance, mystery and adventure intertwine as a young man falls in love and a severed hand scours Paris for its owner in this mesmerizing animated film.",Movie
81227195,Kalushi: The Story of Solomon Mahlangu,Mandla Dube,"Thabo Rametsi, Thabo Malema, Welile Nzuza, Jafta Mamabolo, Louw Venter, Pearl Thusi",South Africa,"November 29, 2019",2016,TV-MA,107 min,"Dramas, International Movies","The life and times of iconic South African liberation fighter Solomon Mahlangu, who battled the forces of apartheid, come into focus.",Movie
70205672,La Reina del Sur,,"Kate del Castillo, Cristina Urgel, Alberto Jiménez, Juan José Arjona, Humberto Zurita, Dagoberto Gama, Christian Tappán, Miguel de Miguel, Salvador Zerboni, Carmen Navarro, Santiago Meléndez, Juan Carlos Solarte","United States, Spain, Colombia, Mexico",,2019,TV-14,2 Seasons,"Crime TV Shows, International TV Shows, Spanish-Language TV Shows","This compelling show tells the story of the legendary Teresa Mendoza, a courageous woman who is perceived as conquering the world of drug trafficking.",TV Show
81172841,Lagos Real Fake Life,Mike Ezuruonye,"Nonso Diobi, Mike Ezuruonye, Mercy Aigbe, Rex Nosa, Annie Macaulay Idibia, Ik Ogbonna, Nedu Wazobia, Uzee Usman, Odunlade Adekola, Mr Jollof, Efe Irele, Josh 2 Funny, Haillie Sumney, Emmanuella, MC Lively",,"November 29, 2019",2018,TV-14,118 min,"Comedies, International Movies","Two mooching friends vie for the attention of wealthy, beautiful women only to discover that their lavish lifestyles are bogus.",Movie
81172899,Payday,Cheta Chukwu,"Baaj Adebule, Ebiye Victor, Meg Otanwa, Bisola Aiyeola, Mawuli Gavor, Zack Orji, Segilola Ogidan",Nigeria,"November 29, 2019",2018,TV-MA,110 min,"Comedies, Independent Movies, International Movies","After an expensive night out, two flatmates get tangled in an overnight misadventure to recover their rent money to pay their dead landlord's daughter.",Movie
我们把这个文件上传到 OSS 的 oss://dla-test/dla_csv_oss/dla_csv_header/ 路径下,然后建表语句如下:
CREATE EXTERNAL TABLE dla_oss_test.create_table_like_header like mapping('oss://dla-test/dla_csv_oss/dla_csv_header/');
执行这个语句的时候,DLA 能够自动推断出文件的 header 信息,以及类型。执行完这个语句之后,DLA 给我们推断出来的建表语句如下:
CREATE EXTERNAL TABLE `dla_oss_test`.`create_table_like_header` (
`show_id` string COMMENT 'from deserializer',
`title` string COMMENT 'from deserializer',
`director` string COMMENT 'from deserializer',
`cast` string COMMENT 'from deserializer',
`country` string COMMENT 'from deserializer',
`date_added` string COMMENT 'from deserializer',
`release_year` string COMMENT 'from deserializer',
`rating` string COMMENT 'from deserializer',
`duration` string COMMENT 'from deserializer',
`listed_in` string COMMENT 'from deserializer',
`description` string COMMENT 'from deserializer',
`type` string COMMENT 'from deserializer'
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'oss://dla-test/dla_csv_oss/dla_csv_header'
TBLPROPERTIES (
'create.table.like.file' = 'oss://dla-test/dla_csv_oss/dla_csv_header/',
'escapeChar' = '"',
'quoteChar' = '"',
'separatorChar' = ',',
'skip.header.line.count' = '1'
)
可以看出,表的列名使用 CSV 文件中的 Header 信息,我们可以查看一下这张表的数据:
小结
DLA 自动推断 OSS 表上的常见文件格式可以大大减少我们的使用成本,在建表之前,我们不需要事先去找业务方去了解文件的格式,字段名等等,DLA 自动为我们做了这些事。
当然,目前自动推断支持的文件格式比较少,后期我们会根据具体情况,会增加更多的文件格式。
Data Lake Analytics简介
Data Lake Analytics(简称DLA)是阿里云上一款Serverless化的低成本大数据处理和分析服务。主打高性价比、打通和联合云上数据生态,以及高易用性的BI兼容能力。
- 高性价比:世界领先的并行SQL计算引擎能力,为用户提供千亿级数据的交互式分析能力和万亿级数据的批量处理能力,并按量付费,零资源维护和保有成本;
- 云上数据生态:提供阿里云上覆盖最广泛的数据联邦、融合分析、处理、回流能力,用统一的数据视角盘活、打通OSS、Table Store、日志服务数据湖投递、DBS数据库的数据湖备份、关系型数据库和NoSQL数据库的分析、处理生态;
- 易用的BI兼容能力:以MySQL用户接口的形态兼容几乎所有具备MySQL连接能力的BI套件和可视化分析工具,让用户轻松点击即可分析、洞见云数据价值。
- 产品链接:https://www.aliyun.com/product/datalakeanalytics
- 新客户1元试用:https://common-buy.aliyun.com/?commodityCode=openanalytics_post
欢迎关注数据湖技术社区
数据湖开发者社区由 阿里云开发者社区 与 阿里云Data Lake Analytics团队 共同发起,致力于推广数据湖相关技术,包括hudi、delta、spark、presto、oss、元数据、存储加速、格式发现等,学习如何构建数据湖分析系统,打造适合业务的数据架构。