ODPS开发大全：入门篇（2）-阿里云开发者社区

▐ DQL

SELECT语句

1. SELECT语法

[with <cte>[, ...] ]SELECT [all | distinct] <SELECT_expr>[, <except_expr>][, <replace_expr>] ...       from <table_reference>       [where <where_condition>]       [group by {<col_list>|rollup(<col_list>)}]       [having <having_condition>]       [window <window_clause>]       [order by <order_condition>]       [distribute by <distribute_condition> [sort by <sort_condition>]|[ cluster by <cluster_condition>] ]       [limit <number>]

下面将介绍SELECT命令格式及如何实现嵌套查询、分组查询、排序等操作。

2. SELECT语序

--语法顺序from <table_reference>[where <where_condition>][group by <col_list>][having <having_condition>][window <window_name> AS (<window_definition>)][qualify <expression>]select [all | distinct] <select_expr>, <select_expr>, ...[order by <order_condition>][distribute by <distribute_condition> [sort by <sort_condition>] ][limit <number>]

场景1：from->where->group by->having->select->order by->limit

场景2：from->where->select->distribute by->sort by

3. WITH子句

with A as (SELECT 1 as C),B as (SELECT * from A) SELECT * from B;

在同一WITH子句中的CTE必须具有唯一的名字。

在WITH子句中定义的CTE仅对在同一WITH子句中的其他CTE可以使用。

4. 列表达式

------------------------------------------------------------------------------例子 --读取表xxxx_detail的列shop_nameSELECT xxxx_name from xxxx_detail;--查询表xxxx_detail中region列数据，如果有重复值时仅显示一条。SELECT distinct region from xxxx_detail;--选出xxxx_detail表中列名不为xxxx_name的所有列SELECT `(xxxx_name)?+.+` from xxxx_detail;--去重多列时，distinct的作用域是SELECT的列集合，不是单个列。SELECT distinct region, xxxx_date from xxxx_detail;

用列名指定要读取的列。

用星号（*）代表查询所有的列。

可以使用正则表达式。

在选取的列名前可以使用distinct去掉重复字段，只返回去重后的值。

5. 排除列

--读取xxxx_detail表的数据，并排除region列的数据。------------------------------------------------------------------------------例子SELECT * except(region) from xxxx_detail;

当希望读取表内大多数列的数据，同时要排除表中少数列的数据时。

表示读取表数据时会排除指定列（col1、col2）的数据。

6. WHERE

--配合关系运算符，筛选满足指定条件的数据。关系运算符包含：  >、<、=、>=、<=、<>  like、rlike  in、not in  between…and------------------------------------------------------------------------------例子SELECT * from xxxx_detailwhere xxxx_date >= '2008' and xxxx_date <= '2014';--等价于如下语句。SELECT * from xxxx_detail where xxxx_date between '2008' and '2014';

where子句为过滤条件。如果表是分区表，可以实现列裁剪。

7. GROUP BY

------------------------------------------------------------------------------例子
--直接使用输入表列名region作为group by的列，即以region值分组SELECT region from xxxx_detail group by region;
--以region值分组，返回每一组的销售额总量。SELECT sum(xxxx_price) from xxxx_detail group by region;
--以region值分组，返回每一组的region值（组内唯一）及销售额总量。SELECT region, sum (xxxx_price) from xxxx_detail group by region;

group by操作优先级高于SELECT操作，因此group by的取值是SELECT输入表的列名或由输入表的列构成的表达式。需要注意的是：

group by取值为正则表达式时，必须使用列的完整表达式。

SELECT语句中没有使用聚合函数的列必须出现在GROUP BY中。

8. HAVING

------------------------------------------------------------------------------例子--为直观展示数据呈现效果，向sale_detail表中追加数据。insert into sale_detail partition (sale_date='2014', region='shanghai')   values ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);--使用having子句配合聚合函数实现过滤。SELECT region,sum(total_price) from sale_detail group by region having sum(total_price)<305;

通常HAVING子句与聚合函数一起使用，实现过滤。

9. ORDER BY

------------------------------------------------------------------------------例子
--查询表xxxx_detail的信息，并按照xxxx_price升序排列前2条。SELECT * from xxxx_detail order by xxxx_price limit 2;
--将表xxx_detail按照xxxx_price升序排序后，输出从第3行开始的3行数据。SELECT xxxx_id,xxxx_price from xxxx_detail order by xxxx_price limit 3 offset 2;

默认对数据进行升序排序，如果降序排序，需要使用desc关键字。

order by默认要求带limit数据行数限制，没有limit会返回报错。

10. DISTRIBUTE BY哈希分片

------------------------------------------------------------------------------例子
--查询表xxxx_detail中的列region值并按照region值进行哈希分片。SELECT region from xxxx_detail distribute by region;--等价于如下语句。SELECT region as r from xxxx_detail distribute by region;SELECT region as r from xxxx_detail distribute by r;

distribute by控制Map（读数据）的输出在Reducer中是如何划分的，如果不希望Reducer的内容存在重叠，或需要对同一分组的数据一起处理，可以使用distribute by来保证同组数据分发到同一个Reducer中。

11. SORT BY局部排序

------------------------------------------------------------------------------例子
--查询表xxxx_detail中的列region和xxxx_price的值并按照region值进行哈希分片，然后按照xxxx_price对哈希分片结果进行局部升序排序。SELECT region,xxxx_price from xxxx_detail distribute by region sort by xxxx_price;
--查询表xxxx_detail中的列region和xxxx_price的值并按照region值进行哈希分片，然后按照xxxx_price对哈希分片结果进行局部降序排序。SELECT region,xxxx_price from xxxx_detail distribute by region sort by xxxx_price desc;
--如果sort by语句前没有distribute by，sort by会对每个Reduce中的数据进行局部排序。SELECT region,xxxx_price from xxxx_detail sort by xxxx_price desc;

sort by默认对数据进行升序排序，如果降序排序，需要使用desc关键字。

如果sort by语句前有distribute by，sort by会对distribute by的结果按照指定的列进行排序。

12. LIMIT限制输出行数

SELECT * FROM xxxxx.xxxx_xxxx_xxxxWHERE ds = 20240520LIMIT 100;

limit 中的number是常数，用于限制输出行数，取值范围为int32位取值范围。

子查询

1. 基础子查询

--格式1select <select_expr> from (<select_statement>) [<sq_alias_name>];
--格式2select (<select_statement>) from <table_name>;

普通查询操作的对象是目标表，但是查询的对象也可以是另一个select语句，这种查询为子查询。在from子句中，子查询可以被当作一张表，与其他表或子查询进行join操作。

2. IN SUBQUERY

--in subquery与left semi join用法类似--格式一select<select_expr1>from<table_name1>where<select_expr2>  in(select<select_expr3>from<table_name2>);--等效于leftsemijoin如下语句。select<select_expr1>from<table_name1><alias_name1>leftsemijoin<table_name2><alias_name2>  on<alias_name1>.<select_expr2>=<alias_name2>.<select_expr3>;
--格式二select<select_expr1>from<table_name1>where<select_expr2>  in(select<select_expr3>from<table_name2>where     <table_name1>.<col_name>=<table_name2>.<col_name>);
------------------------------------------------------------------------------例子set odps.sql.allow.fullscan=true;select * from xxxx_detail where xxxx_price in (select xxxx_price from shop);
set odps.sql.allow.fullscan=true;select * from xxxx_detail where xxxx_price   in (select xxxx_price from shop where xxxx_id = shop.xxxx_id);

select_expr1：必填。格式为col1_name, col2_name, 正则表达式,...，表示待查询的普通列、分区列或正则表达式。