▐ DQL
- SELECT语句
1. SELECT语法
[with <cte>[, ...] ]SELECT [all | distinct] <SELECT_expr>[, <except_expr>][, <replace_expr>] ... from <table_reference> [where <where_condition>] [group by {<col_list>|rollup(<col_list>)}] [having <having_condition>] [window <window_clause>] [order by <order_condition>] [distribute by <distribute_condition> [sort by <sort_condition>]|[ cluster by <cluster_condition>] ] [limit <number>]
下面将介绍SELECT命令格式及如何实现嵌套查询、分组查询、排序等操作。
2. SELECT语序
--语法顺序from <table_reference>[where <where_condition>][group by <col_list>][having <having_condition>][window <window_name> AS (<window_definition>)][qualify <expression>]select [all | distinct] <select_expr>, <select_expr>, ...[order by <order_condition>][distribute by <distribute_condition> [sort by <sort_condition>] ][limit <number>]
场景1:from->where->group by->having->select->order by->limit
场景2:from->where->select->distribute by->sort by
3. WITH子句
with A as (SELECT 1 as C),B as (SELECT * from A) SELECT * from B;
在同一WITH子句中的CTE必须具有唯一的名字。
在WITH子句中定义的CTE仅对在同一WITH子句中的其他CTE可以使用。
4. 列表达式
------------------------------------------------------------------------------例子 --读取表xxxx_detail的列shop_nameSELECT xxxx_name from xxxx_detail;--查询表xxxx_detail中region列数据,如果有重复值时仅显示一条。SELECT distinct region from xxxx_detail;--选出xxxx_detail表中列名不为xxxx_name的所有列SELECT `(xxxx_name)?+.+` from xxxx_detail;--去重多列时,distinct的作用域是SELECT的列集合,不是单个列。SELECT distinct region, xxxx_date from xxxx_detail;
用列名指定要读取的列。
用星号(*)代表查询所有的列。
可以使用正则表达式。
在选取的列名前可以使用distinct去掉重复字段,只返回去重后的值。
5. 排除列
--读取xxxx_detail表的数据,并排除region列的数据。------------------------------------------------------------------------------例子SELECT * except(region) from xxxx_detail;
当希望读取表内大多数列的数据,同时要排除表中少数列的数据时。
表示读取表数据时会排除指定列(col1、col2)的数据。
6. WHERE
--配合关系运算符,筛选满足指定条件的数据。关系运算符包含: >、<、=、>=、<=、<> like、rlike in、not in between…and------------------------------------------------------------------------------例子SELECT * from xxxx_detailwhere xxxx_date >= '2008' and xxxx_date <= '2014';--等价于如下语句。SELECT * from xxxx_detail where xxxx_date between '2008' and '2014';
where子句为过滤条件。如果表是分区表,可以实现列裁剪。
7. GROUP BY
------------------------------------------------------------------------------例子 --直接使用输入表列名region作为group by的列,即以region值分组SELECT region from xxxx_detail group by region; --以region值分组,返回每一组的销售额总量。SELECT sum(xxxx_price) from xxxx_detail group by region; --以region值分组,返回每一组的region值(组内唯一)及销售额总量。SELECT region, sum (xxxx_price) from xxxx_detail group by region;
group by操作优先级高于SELECT操作,因此group by的取值是SELECT输入表的列名或由输入表的列构成的表达式。需要注意的是:
group by取值为正则表达式时,必须使用列的完整表达式。
SELECT语句中没有使用聚合函数的列必须出现在GROUP BY中。
8. HAVING
------------------------------------------------------------------------------例子--为直观展示数据呈现效果,向sale_detail表中追加数据。insert into sale_detail partition (sale_date='2014', region='shanghai') values ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);--使用having子句配合聚合函数实现过滤。SELECT region,sum(total_price) from sale_detail group by region having sum(total_price)<305;
通常HAVING子句与聚合函数一起使用,实现过滤。
9. ORDER BY
------------------------------------------------------------------------------例子 --查询表xxxx_detail的信息,并按照xxxx_price升序排列前2条。SELECT * from xxxx_detail order by xxxx_price limit 2; --将表xxx_detail按照xxxx_price升序排序后,输出从第3行开始的3行数据。SELECT xxxx_id,xxxx_price from xxxx_detail order by xxxx_price limit 3 offset 2;
默认对数据进行升序排序,如果降序排序,需要使用desc关键字。
order by默认要求带limit数据行数限制,没有limit会返回报错。
10. DISTRIBUTE BY哈希分片
------------------------------------------------------------------------------例子 --查询表xxxx_detail中的列region值并按照region值进行哈希分片。SELECT region from xxxx_detail distribute by region;--等价于如下语句。SELECT region as r from xxxx_detail distribute by region;SELECT region as r from xxxx_detail distribute by r;
distribute by控制Map(读数据)的输出在Reducer中是如何划分的,如果不希望Reducer的内容存在重叠,或需要对同一分组的数据一起处理,可以使用distribute by来保证同组数据分发到同一个Reducer中。
11. SORT BY局部排序
------------------------------------------------------------------------------例子 --查询表xxxx_detail中的列region和xxxx_price的值并按照region值进行哈希分片,然后按照xxxx_price对哈希分片结果进行局部升序排序。SELECT region,xxxx_price from xxxx_detail distribute by region sort by xxxx_price; --查询表xxxx_detail中的列region和xxxx_price的值并按照region值进行哈希分片,然后按照xxxx_price对哈希分片结果进行局部降序排序。SELECT region,xxxx_price from xxxx_detail distribute by region sort by xxxx_price desc; --如果sort by语句前没有distribute by,sort by会对每个Reduce中的数据进行局部排序。SELECT region,xxxx_price from xxxx_detail sort by xxxx_price desc;
sort by默认对数据进行升序排序,如果降序排序,需要使用desc关键字。
如果sort by语句前有distribute by,sort by会对distribute by的结果按照指定的列进行排序。
12. LIMIT限制输出行数
SELECT * FROM xxxxx.xxxx_xxxx_xxxxWHERE ds = 20240520LIMIT 100;
limit 中的number是常数,用于限制输出行数,取值范围为int32位取值范围。
- 子查询
1. 基础子查询
--格式1select <select_expr> from (<select_statement>) [<sq_alias_name>]; --格式2select (<select_statement>) from <table_name>;
普通查询操作的对象是目标表,但是查询的对象也可以是另一个select语句,这种查询为子查询。在from子句中,子查询可以被当作一张表,与其他表或子查询进行join操作。
2. IN SUBQUERY
--in subquery与left semi join用法类似--格式一select<select_expr1>from<table_name1>where<select_expr2> in(select<select_expr3>from<table_name2>);--等效于leftsemijoin如下语句。select<select_expr1>from<table_name1><alias_name1>leftsemijoin<table_name2><alias_name2> on<alias_name1>.<select_expr2>=<alias_name2>.<select_expr3>; --格式二select<select_expr1>from<table_name1>where<select_expr2> in(select<select_expr3>from<table_name2>where <table_name1>.<col_name>=<table_name2>.<col_name>); ------------------------------------------------------------------------------例子set odps.sql.allow.fullscan=true;select * from xxxx_detail where xxxx_price in (select xxxx_price from shop); set odps.sql.allow.fullscan=true;select * from xxxx_detail where xxxx_price in (select xxxx_price from shop where xxxx_id = shop.xxxx_id);
select_expr1:必填。格式为col1_name, col2_name, 正则表达式,...,表示待查询的普通列、分区列或正则表达式。
table_name1、table_name2:必填。表的名称。
select_expr2、select_expr3:必填。表示table_name1和table_name2互相映射的列名。
col_name:必填。表的列名。
3. NOT IN SUBQUERY
--如果查询目标表的指定列名中有任意一行为NULL,则not in表达式值为NULL,导致where条件不成立,无数据返回select <select_expr1> from <table_name1> where <select_expr2> not in (select <select_expr2> from <table_name2>);--等效于left anti join如下语句。select <select_expr1> from <table_name1> <alias_name1> left anti join <table_name2> <alias_name2> on <alias_name1>.<select_expr1> = <alias_name2>.<select_expr2>; ------------------------------------------------------------------------------例子--创建一张新表shop1并追加数据。create table shop1 as select xxxx_name,xxxx_id,xxxx_price from xxxx_detail;insert into shop1 values ('s8','c1',100.1);select * from shop1 where xxxx_name not in (select xxxx_name from xxxx_detail); set odps.sql.allow.fullscan=true;select * from shop1 where xxxx_name not in (select xxxx_name from xxxx_detail where xxxx_id = shop1.xxxx_id);
select_expr1:必填。格式为col1_name, col2_name, 正则表达式,...,表示待查询的普通列、分区列或正则表达式。
table_name1、table_name2:必填。表的名称。
select_expr2、select_expr3:必填。表示table_name1和table_name2互相映射的列名。
col_name:必填。表的列名。
4. EXISTS SUBQUERY
--使用exists subquery时,当子查询中有至少一行数据时,返回True,否则返回False。select <select_expr> from <table_name1> where exists (select <select_expr> from <table_name2> where <table_name2_colname> = <table_name1>.<colname> ); ------------------------------------------------------------------------------例子set odps.sql.allow.fullscan=true;select * from xxxx_detail where exists (select * from shop where customer_id = xxxx_detail.xxxx_id);--等效于以下语句。select * from xxxx_detail a left semi join shop b on a.xxxx_id = b.xxxx_id;
select_expr:必填。格式为col1_name, col2_name, 正则表达式,...,表示待查询的普通列、分区列或正则表达式。
table_name1、table_name2:必填。表的名称。
col_name:必填。表的列名。
5. NOT EXISTS SUBQUERY
--当子查询中无数据时,返回True,否则返回Falseselect <select_expr> from <table_name1> where not exists (select <select_expr> from <table_name2> where <table_name2_colname> = <table_name1>.<colname>); ------------------------------------------------------------------------------例子set odps.sql.allow.fullscan=true;select * from xxxx_detail where not exists (select * from shop where xxxx_name = xxxx_detail.xxxx_name);--等效于以下语句。select * from xxxx_detail a left anti join shop b on a.shop_name = b.xxxx_name;
select_expr:必填。格式为col1_name, col2_name, 正则表达式,...,表示待查询的普通列、分区列或正则表达式。
table_name1、table_name2:必填。表的名称。
col_name:必填。表的列名。
6. SCALAR SUBQUERY
--当子查询的输出结果为单行单列时,可以做为标量使用,即可以参与标量运算。select <select_expr> from <table_name1> where (<select count(*) from <table_name2> where <table_name2_colname> = <table_name1>.<colname>) <标量运算符> <scalar_value>; ------------------------------------------------------------------------------例子set odps.sql.allow.fullscan=true;select * from shop where (select count(*) from xxxx_detail where xxxx_detail.xxxx_name = shop.xxxx_name) >= 1;
select_expr:必填。格式为col1_name, col2_name, 正则表达式。
table_name1、table_name2:必填。表的名称。
col_name:必填。表的列名。
标量运算符:必填。例如大于(>)、小于(<)、等于(=)。
scalar_value:必填。标量值
- 交集,并集和补集
1. 交集
--取交集不去重。<select_statement1> intersect all <select_statement2>;--取交集并去重。intersect效果等同于intersect distinct。<select_statement1> intersect [distinct] <select_statement2>; ------------------------------------------------------------------------------例子select * from values (1, 2), (1, 2), (3, 4), (5, 6) t(a, b) intersect all select * from values (1, 2), (1, 2), (3, 4), (5, 7) t(a, b);--结果+------------+------------+| a | b |+------------+------------+| 1 | 2 || 3 | 4 |+------------+------------+
select_statement1、select_statement2:必填。
distinct:可选。对两个数据集取交集的结果去重。
2. 并集
--取并集不去重。<select_statement1> union all <select_statement2>;--取并集并去重。<select_statement1> union [distinct] <select_statement2>; ------------------------------------------------------------------------------例子select * from values (1, 2), (1, 2), (3, 4) t(a, b) union all select * from values (1, 2), (1, 4) t(a, b);--结果+------------+------------+| a | b |+------------+------------+| 1 | 2 || 1 | 2 || 3 | 4 || 1 | 2 || 1 | 4 |+------------+------------+
select_statement1、select_statement2:必填。select语句。
distinct:可选。对两个数据集取并集的结果去重。
3. 补集
--取补集不去重。<select_statement1> except all <select_statement2>;<select_statement1> minus all <select_statement2>;--取补集并去重。<select_statement1> except [distinct] <select_statement2>;<select_statement1> minus [distinct] <select_statement2>; ------------------------------------------------------------------------------例子select * from values (1, 2), (1, 2), (3, 4), (3, 4), (5, 6), (7, 8) t(a, b)except all select * from values (3, 4), (5, 6), (5, 6), (9, 10) t(a, b);--结果+------------+------------+| a | b |+------------+------------+| 1 | 2 || 1 | 2 || 3 | 4 || 7 | 8 |+------------+------------+
select_statement1、select_statement2:必填。select语句。
distinct:可选。对取补集的结果去重。