使用Apache Impala(CDH6)查询OSS的数据

本文涉及的产品
对象存储 OSS,20GB 3个月
阿里云盘企业版 CDE,企业版用户数5人 500GB空间
云备份 Cloud Backup,100GB 3个月
简介: CDH6 目前CDH的最新版本是6.0.1,支持Hadoop 3.0.0,本文将介绍如何使CDH6的相关组件(Hadoop/Hive/Spark/Impala等)能够读写OSS。CDH5对OSS的支持在这篇文章介绍。

CDH6

目前CDH的最新版本是6.0.1,支持Hadoop 3.0.0,本文将介绍如何使CDH6的相关组件(Hadoop/Hive/Spark/Impala等)能够读写OSS。CDH5对OSS的支持在这篇文章介绍。

CDH6支持读写OSS

搭建CDH集群

首先根据官方文档搭建好CDH6集群

_2018_10_29_10_03_43

增加OSS配置

通过CM来增加配置(对于没有CM管理的集群,可以通过修改core-site.xml来达到)
这里以CM为例,需要增加如下配置:
_2018_10_28_2_41_06

配置项 说明
fs.oss.endpoint 如 oss-cn-zhangjiakou-internal.aliyuncs.com 要连接的endpoint
fs.oss.accessKeyId access key id
fs.oss.accessKeySecret access key secret
fs.oss.impl org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem hadoop oss文件系统实现类,目前固定为这个
fs.oss.buffer.dir /tmp/oss 临时文件目录
fs.oss.connection.secure.enabled false 是否enable https, 根据需要来设置,enable https会影响性能
fs.oss.connection.maximum 2048 与oss的连接数,根据需要设置

相关参数的解释可以在这里找到

重启集群,验证读写OSS

增加配置后,根据CM提示重启集群,重启后,可以测试

# 测试写
hadoop fs -mkdir oss://{your-bucket-name}/hadoop-test
# 测试读
hadoop fs -ls oss://{your-bucket-name}/

Apache Impala查询OSS的数据

关于Apache Impala的介绍可以查看官方文档,下面主要介绍使用Apache Impala查询OSS,并且运行TPC-DS的那些查询语句。

虽然CDH6天然支持OSS,但是它里面的Impala组件却默认没有将OSS的支持放到它的CLASSPATH里面去,因此我们需要在所有的Impala节点执行如下命令
下面的步骤需要在所有的Impala节点执行
进入到$CDH_HOME/lib/impala目录, 执行如下命令

[root@cdh-master impala]# cd lib/
[root@cdh-master lib]# ln -s ../../../jars/hadoop-aliyun-3.0.0-cdh6.0.1.jar hadoop-aliyun.jar
[root@cdh-master lib]# ln -s ../../../jars/aliyun-sdk-oss-2.8.3.jar aliyun-sdk-oss-2.8.3.jar
[root@cdh-master lib]# ln -s ../../../jars/jdom-1.1.jar jdom-1.1.jar

进入到$CDH_HOME/bin目录,修改impalad/statestored/catalogd这三个文件,在文件最后一行exec命令前,增加如下一行

export CLASSPATH=$CLASSPATH:${IMPALA_HOME}/lib/hadoop-aliyun.jar:${IMPALA_HOME}/lib/aliyun-sdk-oss-2.8.3.jar:${IMPALA_HOME}/lib/jdom-1.1.jar

重启所有节点的impala相关进程,这样impala就可以查询OSS的数据。

运行TPC-DS查询

TPC-DS查询在这里,https://github.com/hortonworks/hive-testbench/tree/hive14
基于Hive,因为Hive QL与Impala SQl兼容性比较高,所以我们用这个作为实验(有几个query因为兼容性问题不能运行)

执行如下命令,生成sample数据到OSS

[root@cdh-master ~]# git clone https://github.com/hortonworks/hive-testbench.git
[root@cdh-master ~]# cd hive-testbench
[root@cdh-master hive-testbench]# git checkout hive14
[root@cdh-master hive-testbench]# ./tpcds-build.sh
[root@cdh-master hive-testbench]# FORMAT=textfile ./tpcds-setup.sh 50 oss://{your-bucket-name}/

这个benchmark的建表语句与Apache Impala的建表语句兼容,复制ddl-tpcds/text/alltables.sql中的建表语句,修改${LOCATION}即可,例如:

[root@cdh-master hive-testbench]# impala-shell -i cdh-slave01 -d default
Starting Impala Shell without Kerberos authentication
Connected to cdh-slave01:21000
Server version: impalad version 3.0.0-cdh6.0.1 RELEASE (build 9a74a5053de5f7b8dd983802e6d75e58d31472db)
***********************************************************************************
Welcome to the Impala shell.
(Impala Shell v3.0.0-cdh6.0.1 (9a74a50) built on Wed Sep 19 11:27:37 PDT 2018)

Want to know what version of Impala you're connected to? Run the VERSION command to
find out!
***********************************************************************************
Query: use `default`
[cdh-slave01:21000] default> create external table call_center(
                           >       cc_call_center_sk         bigint
                           > ,     cc_call_center_id         string
                           > ,     cc_rec_start_date        string
                           > ,     cc_rec_end_date          string
                           > ,     cc_closed_date_sk         bigint
                           > ,     cc_open_date_sk           bigint
                           > ,     cc_name                   string
                           > ,     cc_class                  string
                           > ,     cc_employees              int
                           > ,     cc_sq_ft                  int
                           > ,     cc_hours                  string
                           > ,     cc_manager                string
                           > ,     cc_mkt_id                 int
                           > ,     cc_mkt_class              string
                           > ,     cc_mkt_desc               string
                           > ,     cc_market_manager         string
                           > ,     cc_division               int
                           > ,     cc_division_name          string
                           > ,     cc_company                int
                           > ,     cc_company_name           string
                           > ,     cc_street_number          string
                           > ,     cc_street_name            string
                           > ,     cc_street_type            string
                           > ,     cc_suite_number           string
                           > ,     cc_city                   string
                           > ,     cc_county                 string
                           > ,     cc_state                  string
                           > ,     cc_zip                    string
                           > ,     cc_country                string
                           > ,     cc_gmt_offset             double
                           > ,     cc_tax_percentage         double
                           > )
                           > row format delimited fields terminated by '|'
                           > location 'oss://{your-bucket-name}/50/call_center';
Query: create external table call_center(
      cc_call_center_sk         bigint
,     cc_call_center_id         string
,     cc_rec_start_date        string
,     cc_rec_end_date          string
,     cc_closed_date_sk         bigint
,     cc_open_date_sk           bigint
,     cc_name                   string
,     cc_class                  string
,     cc_employees              int
,     cc_sq_ft                  int
,     cc_hours                  string
,     cc_manager                string
,     cc_mkt_id                 int
,     cc_mkt_class              string
,     cc_mkt_desc               string
,     cc_market_manager         string
,     cc_division               int
,     cc_division_name          string
,     cc_company                int
,     cc_company_name           string
,     cc_street_number          string
,     cc_street_name            string
,     cc_street_type            string
,     cc_suite_number           string
,     cc_city                   string
,     cc_county                 string
,     cc_state                  string
,     cc_zip                    string
,     cc_country                string
,     cc_gmt_offset             double
,     cc_tax_percentage         double
)
row format delimited fields terminated by '|'
location 'oss://{your-bucket-name}/50/call_center'
+-------------------------+
| summary                 |
+-------------------------+
| Table has been created. |
+-------------------------+
Fetched 1 row(s) in 4.10s

注意将建表语句中的{your-bucket-name}替换为你自己的bucket
所有的表都建好后,我们可以查看一下

[cdh-slave01:21000] default> show tables;
Query: show tables
+------------------------+
| name                   |
+------------------------+
| call_center            |
| catalog_page           |
| catalog_returns        |
| catalog_sales          |
| customer               |
| customer_address       |
| customer_demographics  |
| date_dim               |
| household_demographics |
| income_band            |
| inventory              |
| item                   |
| promotion              |
| reason                 |
| ship_mode              |
| store                  |
| store_returns          |
| store_sales            |
| time_dim               |
| warehouse              |
| web_page               |
| web_returns            |
| web_sales              |
| web_site               |
+------------------------+
Fetched 24 row(s) in 0.03s

下面我们就可以在Impala上执行TPC-DS的查询,查询的SQL文件在sample-queries-tpcds目录下

[root@cdh-master hive-testbench]# ls sample-queries-tpcds
query12.sql  query20.sql  query27.sql  query39.sql  query46.sql  query54.sql  query64.sql  query71.sql  query7.sql   query87.sql  query93.sql  README.md
query13.sql  query21.sql  query28.sql  query3.sql   query48.sql  query55.sql  query65.sql  query72.sql  query80.sql  query88.sql  query94.sql  testbench.settings
query15.sql  query22.sql  query29.sql  query40.sql  query49.sql  query56.sql  query66.sql  query73.sql  query82.sql  query89.sql  query95.sql  testbench-withATS.settings
query17.sql  query24.sql  query31.sql  query42.sql  query50.sql  query58.sql  query67.sql  query75.sql  query83.sql  query90.sql  query96.sql
query18.sql  query25.sql  query32.sql  query43.sql  query51.sql  query60.sql  query68.sql  query76.sql  query84.sql  query91.sql  query97.sql
query19.sql  query26.sql  query34.sql  query45.sql  query52.sql  query63.sql  query70.sql  query79.sql  query85.sql  query92.sql  query98.sql

我们执行query13.sql

[root@cdh-master hive-testbench]# impala-shell -i cdh-slave01 -d default -f sample-queries-tpcds/query13.sql
Starting Impala Shell without Kerberos authentication
Connected to cdh-slave01:21000
Server version: impalad version 3.0.0-cdh6.0.1 RELEASE (build 9a74a5053de5f7b8dd983802e6d75e58d31472db)
Query: use `default`
Query: select avg(ss_quantity)
       ,avg(ss_ext_sales_price)
       ,avg(ss_ext_wholesale_cost)
       ,sum(ss_ext_wholesale_cost)
 from store_sales
     ,store
     ,customer_demographics
     ,household_demographics
     ,customer_address
     ,date_dim
 where store.s_store_sk = store_sales.ss_store_sk
 and  store_sales.ss_sold_date_sk = date_dim.d_date_sk and date_dim.d_year = 2001
 and((store_sales.ss_hdemo_sk=household_demographics.hd_demo_sk
  and customer_demographics.cd_demo_sk = store_sales.ss_cdemo_sk
  and customer_demographics.cd_marital_status = 'M'
  and customer_demographics.cd_education_status = '4 yr Degree'
  and store_sales.ss_sales_price between 100.00 and 150.00
  and household_demographics.hd_dep_count = 3
     )or
     (store_sales.ss_hdemo_sk=household_demographics.hd_demo_sk
  and customer_demographics.cd_demo_sk = store_sales.ss_cdemo_sk
  and customer_demographics.cd_marital_status = 'D'
  and customer_demographics.cd_education_status = 'Primary'
  and store_sales.ss_sales_price between 50.00 and 100.00
  and household_demographics.hd_dep_count = 1
     ) or
     (store_sales.ss_hdemo_sk=household_demographics.hd_demo_sk
  and customer_demographics.cd_demo_sk = ss_cdemo_sk
  and customer_demographics.cd_marital_status = 'U'
  and customer_demographics.cd_education_status = 'Advanced Degree'
  and store_sales.ss_sales_price between 150.00 and 200.00
  and household_demographics.hd_dep_count = 1
     ))
 and((store_sales.ss_addr_sk = customer_address.ca_address_sk
  and customer_address.ca_country = 'United States'
  and customer_address.ca_state in ('KY', 'GA', 'NM')
  and store_sales.ss_net_profit between 100 and 200
     ) or
     (store_sales.ss_addr_sk = customer_address.ca_address_sk
  and customer_address.ca_country = 'United States'
  and customer_address.ca_state in ('MT', 'OR', 'IN')
  and store_sales.ss_net_profit between 150 and 300
     ) or
     (store_sales.ss_addr_sk = customer_address.ca_address_sk
  and customer_address.ca_country = 'United States'
  and customer_address.ca_state in ('WI', 'MO', 'WV')
  and store_sales.ss_net_profit between 50 and 250
     ))
Query submitted at: 2018-10-30 11:44:47 (Coordinator: http://cdh-slave01:25000)
Query progress can be monitored at: http://cdh-slave01:25000/query_plan?query_id=ff4b3157eddfc3c4:8a31c6500000000
+-------------------+-------------------------+----------------------------+----------------------------+
| avg(ss_quantity)  | avg(ss_ext_sales_price) | avg(ss_ext_wholesale_cost) | sum(ss_ext_wholesale_cost) |
+-------------------+-------------------------+----------------------------+----------------------------+
| 30.87106918238994 | 2352.642327044025       | 2162.600911949685          | 687707.09                  |
+-------------------+-------------------------+----------------------------+----------------------------+
Fetched 1 row(s) in 353.16s

这个查询涉及到6张表,store/store_sales/customer_demographics/household_demographics/customer_address/date_dim,分别看一下它们的状态

[cdh-slave01:21000] default> show table STATS store;
Query: show table STATS store
+-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------+
| #Rows | #Files | Size    | Bytes Cached | Cache Replication | Format | Incremental stats | Location                                  |
+-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------+
| -1    | 1      | 37.56KB | NOT CACHED   | NOT CACHED        | TEXT   | false             | oss://{your-bucket-name}/50/store |
+-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------+
Fetched 1 row(s) in 0.01s
[cdh-slave01:21000] default> show table STATS store_sales;
Query: show table STATS store_sales
+-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------------+
| #Rows | #Files | Size    | Bytes Cached | Cache Replication | Format | Incremental stats | Location                                        |
+-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------------+
| -1    | 50     | 18.75GB | NOT CACHED   | NOT CACHED        | TEXT   | false             | oss://{your-bucket-name}/50/store_sales |
+-------+--------+---------+--------------+-------------------+--------+-------------------+-------------------------------------------------+
Fetched 1 row(s) in 0.01s
[cdh-slave01:21000] default> show table STATS customer_demographics;
Query: show table STATS customer_demographics
+-------+--------+---------+--------------+-------------------+--------+-------------------+-----------------------------------------------------------+
| #Rows | #Files | Size    | Bytes Cached | Cache Replication | Format | Incremental stats | Location                                                  |
+-------+--------+---------+--------------+-------------------+--------+-------------------+-----------------------------------------------------------+
| -1    | 50     | 76.92MB | NOT CACHED   | NOT CACHED        | TEXT   | false             | oss://{your-bucket-name}/50/customer_demographics |
+-------+--------+---------+--------------+-------------------+--------+-------------------+-----------------------------------------------------------+
Fetched 1 row(s) in 0.01s
[cdh-slave01:21000] default> show table STATS household_demographics;
Query: show table STATS household_demographics
+-------+--------+----------+--------------+-------------------+--------+-------------------+------------------------------------------------------------+
| #Rows | #Files | Size     | Bytes Cached | Cache Replication | Format | Incremental stats | Location                                                   |
+-------+--------+----------+--------------+-------------------+--------+-------------------+------------------------------------------------------------+
| -1    | 1      | 148.10KB | NOT CACHED   | NOT CACHED        | TEXT   | false             | oss://{your-bucket-name}/50/household_demographics |
+-------+--------+----------+--------------+-------------------+--------+-------------------+------------------------------------------------------------+
Fetched 1 row(s) in 0.01s
[cdh-slave01:21000] default> show table STATS customer_address;
Query: show table STATS customer_address
+-------+--------+---------+--------------+-------------------+--------+-------------------+------------------------------------------------------+
| #Rows | #Files | Size    | Bytes Cached | Cache Replication | Format | Incremental stats | Location                                             |
+-------+--------+---------+--------------+-------------------+--------+-------------------+------------------------------------------------------+
| -1    | 1      | 40.54MB | NOT CACHED   | NOT CACHED        | TEXT   | false             | oss://{your-bucket-name}/50/customer_address |
+-------+--------+---------+--------------+-------------------+--------+-------------------+------------------------------------------------------+
Fetched 1 row(s) in 0.01s
[cdh-slave01:21000] default> show table STATS date_dim;
Query: show table STATS date_dim
+-------+--------+--------+--------------+-------------------+--------+-------------------+----------------------------------------------+
| #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format | Incremental stats | Location                                     |
+-------+--------+--------+--------------+-------------------+--------+-------------------+----------------------------------------------+
| -1    | 1      | 9.84MB | NOT CACHED   | NOT CACHED        | TEXT   | false             | oss://{your-bucket-name}/50/date_dim |
+-------+--------+--------+--------------+-------------------+--------+-------------------+----------------------------------------------+
Fetched 1 row(s) in 0.01s

参考文章

https://yq.aliyun.com/articles/658473?spm=a2c4e.11155435.0.0.2f8b33125xbe9H
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aliyun/src/site/markdown/tools/hadoop-aliyun/index.md
https://yq.aliyun.com/articles/658471?spm=a2c4e.11155435.0.0.2f8b33125xbe9H
https://yq.aliyun.com/articles/292792?spm=a2c4e.11155435.0.0.7ccba82fbDwfhK

相关实践学习
借助OSS搭建在线教育视频课程分享网站
本教程介绍如何基于云服务器ECS和对象存储OSS,搭建一个在线教育视频课程分享网站。
目录
相关文章
|
23天前
|
存储 SQL 数据挖掘
数据无界、湖仓无界, Apache Doris 湖仓一体解决方案全面解读(上篇)
湖仓一体架构融合了数据湖的低成本、高扩展性,以及数据仓库的高性能、强数据治理能力,高效应对大数据时代的挑战。为助力企业实现湖仓一体的建设,Apache Doris 提出了数据无界和湖仓无界核心理念,并结合自身特性,助力企业加速从 0 到 1 构建湖仓体系,降低转型过程中的风险和成本。本文将对湖仓一体演进及 Apache Doris 湖仓一体方案进行介绍。
数据无界、湖仓无界, Apache Doris 湖仓一体解决方案全面解读(上篇)
|
27天前
|
存储 运维 监控
从 ClickHouse 到 Apache Doris:在网易云音乐日增万亿日志数据场景下的落地
日志数据已成为企业洞察系统状态、监控网络安全及分析业务动态的宝贵资源。网易云音乐引入 Apache Doris 作为日志库新方案,替换了 ClickHouse。解决了 ClickHouse 运维复杂、不支持倒排索引的问题。目前已经稳定运行 3 个季度,规模达到 50 台服务器, 倒排索引将全文检索性能提升7倍,2PB 数据,每天新增日志量超过万亿条,峰值写入吞吐 6GB/s 。
从 ClickHouse 到 Apache Doris:在网易云音乐日增万亿日志数据场景下的落地
|
12天前
|
存储 Ubuntu 数据管理
使用s3cmd 2.x 与 Cyberduck 管理在 DigitalOcean Spaces 对象存储中的数据
通过 `s3cmd` 2.x 和 Cyberduck,你可以轻松管理 DigitalOcean Spaces 中的数据。`s3cmd` 提供了强大的命令行操作能力,适合脚本化和自动化任务,而 Cyberduck 提供了直观的图形界面,便于日常手动操作。掌握这两种工具的使用方法,将极大提高你的数据管理效率。希望本文能帮助你更好地使用 DigitalOcean Spaces。
20 7
|
7月前
|
物联网 数据管理 Apache
拥抱IoT浪潮,Apache IoTDB如何成为你的智能数据守护者?解锁物联网新纪元的数据管理秘籍!
【8月更文挑战第22天】随着物联网技术的发展,数据量激增对数据库提出新挑战。Apache IoTDB凭借其面向时间序列数据的设计,在IoT领域脱颖而出。相较于传统数据库,IoTDB采用树形数据模型高效管理实时数据,具备轻量级结构与高并发能力,并集成Hadoop/Spark支持复杂分析。在智能城市等场景下,IoTDB能处理如交通流量等数据,为决策提供支持。IoTDB还提供InfluxDB协议适配器简化迁移过程,并支持细致的权限管理确保数据安全。综上所述,IoTDB在IoT数据管理中展现出巨大潜力与竞争力。
172 1
|
8月前
|
机器学习/深度学习 人工智能 专有云
人工智能平台PAI使用问题之怎么将DLC的数据写入到另一个阿里云主账号的OSS中
阿里云人工智能平台PAI是一个功能强大、易于使用的AI开发平台,旨在降低AI开发门槛,加速创新,助力企业和开发者高效构建、部署和管理人工智能应用。其中包含了一系列相互协同的产品与服务,共同构成一个完整的人工智能开发与应用生态系统。以下是对PAI产品使用合集的概述,涵盖数据处理、模型开发、训练加速、模型部署及管理等多个环节。
|
6月前
|
存储 JSON 物联网
查询性能提升 10 倍、存储空间节省 65%,Apache Doris 半结构化数据分析方案及典型场景
本文我们将聚焦企业最普遍使用的 JSON 数据,分别介绍业界传统方案以及 Apache Doris 半结构化数据存储分析的三种方案,并通过图表直观展示这些方案的优势与不足。同时,结合具体应用场景,分享不同需求场景下的使用方式,帮助用户快速选择最合适的 JSON 数据存储及分析方案。
查询性能提升 10 倍、存储空间节省 65%,Apache Doris 半结构化数据分析方案及典型场景
|
5月前
|
SQL 消息中间件 大数据
大数据-159 Apache Kylin 构建Cube 准备和测试数据(一)
大数据-159 Apache Kylin 构建Cube 准备和测试数据(一)
141 1
|
5月前
|
SQL 大数据 Apache
大数据-159 Apache Kylin 构建Cube 准备和测试数据(二)
大数据-159 Apache Kylin 构建Cube 准备和测试数据(二)
110 1
|
5月前
|
分布式计算 监控 大数据
大数据-148 Apache Kudu 从 Flink 下沉数据到 Kudu
大数据-148 Apache Kudu 从 Flink 下沉数据到 Kudu
133 1
|
7月前
|
存储 消息中间件 人工智能
AI大模型独角兽 MiniMax 基于阿里云数据库 SelectDB 版内核 Apache Doris 升级日志系统,PB 数据秒级查询响应
早期 MiniMax 基于 Grafana Loki 构建了日志系统,在资源消耗、写入性能及系统稳定性上都面临巨大的挑战。为此 MiniMax 开始寻找全新的日志系统方案,并基于阿里云数据库 SelectDB 版内核 Apache Doris 升级了日志系统,新系统已接入 MiniMax 内部所有业务线日志数据,数据规模为 PB 级, 整体可用性达到 99.9% 以上,10 亿级日志数据的检索速度可实现秒级响应。
AI大模型独角兽 MiniMax 基于阿里云数据库 SelectDB 版内核 Apache Doris 升级日志系统,PB 数据秒级查询响应

相关产品

  • 对象存储
  • 推荐镜像

    更多