一、简述
Apache Doris中运行EXPLAIN + SQL就可以得到SQL对应的Query Plan,再结合Apche Doris的Profile可以了解Doris是如何处理SQL语句,用于分析查询语句或是结构的性能瓶颈,从而帮助选择更好的索引和写出更优化的查询语句。
二、Plan分析
2.1 sql准备
tpcds query96.sql为例
explain -- explain graph 生成对应执行计划图表 select count(*) from store_sales ,household_demographics ,time_dim , store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 8 and time_dim.t_minute >= 30 and household_demographics.hd_dep_count = 5 and store.s_store_name = 'ese' order by count(*) limit 100;
2.2 explain结果分析
Query Plan可以分为逻辑执行计划(Logical Query Plan)和物理执行计划(Physical Query Plan),当前讲述的Query Plan默认指逻辑执行计划;tpcds query96.sql对应的Query Plan展示如下。
-- graph ┌───────────────┐ │[8: ResultSink]│ │[Fragment: 4] │ │RESULT SINK │ └───────────────┘ │ │ ┌─────────────┐ │[8: TOP-N] │ │[Fragment: 4]│ └─────────────┘ │ │ ┌────────────────────────────────┐ │[13: AGGREGATE (merge finalize)]│ │[Fragment: 4] │ └────────────────────────────────┘ │ │ ┌──────────────┐ │[12: EXCHANGE]│ │[Fragment: 4] │ └──────────────┘ │ │ ┌────────────────────┐ │[12: DataStreamSink]│ │[Fragment: 0] │ │STREAM DATA SINK │ │ EXCHANGE ID: 12 │ │ UNPARTITIONED │ └────────────────────┘ │ │ ┌─────────────────────────────────┐ │[7: AGGREGATE (update serialize)]│ │[Fragment: 0] │ └─────────────────────────────────┘ │ │ ┌───────────────────────────────┐ │[6: HASH JOIN] │ │[Fragment: 0] │ │join op: INNER JOIN (BROADCAST)│ └───────────────────────────────┘ ┌───────────┴─────────────────────────────────────┐ │ │ ┌───────────────────────────────┐ ┌──────────────┐ │[4: HASH JOIN] │ │[11: EXCHANGE]│ │[Fragment: 0] │ │[Fragment: 0] │ │join op: INNER JOIN (BROADCAST)│ └──────────────┘ └───────────────────────────────┘ │ ┌───────────────┴─────────────────────┐ │ │ │ ┌────────────────────┐ ┌───────────────────────────────┐ ┌──────────────┐ │[11: DataStreamSink]│ │[2: HASH JOIN] │ │[10: EXCHANGE]│ │[Fragment: 3] │ │[Fragment: 0] │ │[Fragment: 0] │ │STREAM DATA SINK │ │join op: INNER JOIN (BROADCAST)│ └──────────────┘ │ EXCHANGE ID: 11 │ └───────────────────────────────┘ │ │ UNPARTITIONED │ ┌─────────┴──────────┐ │ └────────────────────┘ │ │ ┌────────────────────┐ ┌┘ ┌──────────────────┐ ┌─────────────┐ │[10: DataStreamSink]│ │ │[0: OlapScanNode] │ │[9: EXCHANGE]│ │[Fragment: 2] │ ┌─────────────────┐ │[Fragment: 0] │ │[Fragment: 0]│ │STREAM DATA SINK │ │[5: OlapScanNode]│ │TABLE: store_sales│ └─────────────┘ │ EXCHANGE ID: 10 │ │[Fragment: 3] │ └──────────────────┘ │ │ UNPARTITIONED │ │TABLE: store │ │ └────────────────────┘ └─────────────────┘ ┌───────────────────┐ │ │[9: DataStreamSink]│ │ │[Fragment: 1] │ ┌─────────────────────────────┐ │STREAM DATA SINK │ │[3: OlapScanNode] │ │ EXCHANGE ID: 09 │ │[Fragment: 2] │ │ UNPARTITIONED │ │TABLE: household_demographics│ └───────────────────┘ └─────────────────────────────┘ │ │ ┌─────────────────┐ │[1: OlapScanNode]│ │[Fragment: 1] │ │TABLE: time_dim │ └─────────────────┘
-- 非graph PLAN FRAGMENT 0 OUTPUT EXPRS:<slot 11> <slot 10> count(*) PARTITION: UNPARTITIONED RESULT SINK 8:TOP-N | order by: <slot 11> <slot 10> count(*) ASC | offset: 0 | limit: 100 | 13:AGGREGATE (merge finalize) | output: count(<slot 10> count(*)) | group by: | cardinality=-1 | 12:EXCHANGE PLAN FRAGMENT 1 OUTPUT EXPRS: PARTITION: HASH_PARTITIONED: `default_cluster:tpcds`.`store_sales`.`ss_item_sk`, `default_cluster:tpcds`.`store_sales`.`ss_ticket_number` STREAM DATA SINK EXCHANGE ID: 12 UNPARTITIONED 7:AGGREGATE (update serialize) | output: count(*) | group by: | cardinality=1 | 6:HASH JOIN | join op: INNER JOIN (BROADCAST) | hash predicates: | colocate: false, reason: Tables are not in the same group | equal join conjunct: `ss_store_sk` = `s_store_sk` | runtime filters: RF000[in] <- `s_store_sk` | cardinality=2880403 | |----11:EXCHANGE | 4:HASH JOIN | join op: INNER JOIN (BROADCAST) | hash predicates: | colocate: false, reason: Tables are not in the same group | equal join conjunct: `ss_hdemo_sk` = `household_demographics`.`hd_demo_sk` | runtime filters: RF001[in] <- `household_demographics`.`hd_demo_sk` | cardinality=2880403 | |----10:EXCHANGE | 2:HASH JOIN | join op: INNER JOIN (BROADCAST) | hash predicates: | colocate: false, reason: Tables are not in the same group | equal join conjunct: `ss_sold_time_sk` = `time_dim`.`t_time_sk` | runtime filters: RF002[in] <- `time_dim`.`t_time_sk` | cardinality=2880403 | |----9:EXCHANGE | 0:OlapScanNode TABLE: store_sales PREAGGREGATION: OFF. Reason: conjunct on `ss_sold_time_sk` which is StorageEngine value column PREDICATES: `default_cluster:tpcds.store_sales`.`__DORIS_DELETE_SIGN__` = 0 runtime filters: RF000[in] -> `ss_store_sk`, RF001[in] -> `ss_hdemo_sk`, RF002[in] -> `ss_sold_time_sk` partitions=1/1 rollup: store_sales tabletRatio=3/3 tabletList=20968,20972,20976 cardinality=2880403 avgRowSize=67.95811 numNodes=3 PLAN FRAGMENT 2 OUTPUT EXPRS: PARTITION: HASH_PARTITIONED: `default_cluster:tpcds`.`store`.`s_store_sk` STREAM DATA SINK EXCHANGE ID: 11 UNPARTITIONED 5:OlapScanNode TABLE: store PREAGGREGATION: OFF. Reason: null PREDICATES: `store`.`s_store_name` = 'ese', `default_cluster:tpcds.store`.`__DORIS_DELETE_SIGN__` = 0 partitions=1/1 rollup: store tabletRatio=3/3 tabletList=20773,20777,20781 cardinality=23 avgRowSize=1798.8695 numNodes=3 PLAN FRAGMENT 3 OUTPUT EXPRS: PARTITION: HASH_PARTITIONED: `default_cluster:tpcds`.`household_demographics`.`hd_demo_sk` STREAM DATA SINK EXCHANGE ID: 10 UNPARTITIONED 3:OlapScanNode TABLE: household_demographics PREAGGREGATION: OFF. Reason: null PREDICATES: `household_demographics`.`hd_dep_count` = 5, `default_cluster:tpcds.household_demographics`.`__DORIS_DELETE_SIGN__` = 0 partitions=1/1 rollup: household_demographics tabletRatio=3/3 tabletList=20848,20852,20856 cardinality=14399 avgRowSize=2.8781166 numNodes=3 PLAN FRAGMENT 4 OUTPUT EXPRS: PARTITION: HASH_PARTITIONED: `default_cluster:tpcds`.`time_dim`.`t_time_sk` STREAM DATA SINK EXCHANGE ID: 09 UNPARTITIONED 1:OlapScanNode TABLE: time_dim PREAGGREGATION: OFF. Reason: null PREDICATES: `time_dim`.`t_hour` = 8, `time_dim`.`t_minute` >= 30, `default_cluster:tpcds.time_dim`.`__DORIS_DELETE_SIGN__` = 0 partitions=1/1 rollup: time_dim tabletRatio=3/3 tabletList=20713,20717,20721 cardinality=172799 avgRowSize=11.671202 numNodes=3
2.2.1 常见属性说明
Colocate Join 适合几张表按照相同字段分桶,并高频根据相同字段 Join 的场景,比如电商的不少应用都按照商家 Id 分桶,并高频按照商家 Id 进行 Join。
2.2.2 plan分析
- Query96的Query Plan分为五个Plan Fragment,编号从0~4
- 分析Query Plan可以采用自底向上的方式进行,逐个进行分析
- 最底部的Plan Fragment为Fragment 4分析
- 主要负责扫描time_dim表,并提前执行相关查询条件,即谓词下推
- 对于聚合表(Aggregate Key),doris会根据不同查询选择是否开启PREAGGREGATION,上图中time_dim的预聚合为关闭状态,关闭状态之下会读取time_dim的全部维度列,当表中维度列多的时候,这个可能会成为影响性能的一个关键因素
- 如果time_dim表有选择Range Partition进行数据划分,Query Plan中的partitions会表征查询命中几个分区,无关分区被自动过滤会有效减少扫描数据量
- 如果有物化视图,doris会根据查询去自动选择物化视图,如果没有物化视图,那么查询自动命中base table,也就是上图中展示的rollup: time_dim,可参考doris测试物化视图
- 当time_dim数据扫描完成之后,Fragment 4的执行过程也就随之结束,此时它将扫描得到的数据传递给你其它Fragment,EXCHANGE ID : 09表示数据传递给了标号为9的接收节点,可通过graph查看
- 对于Query96的Query Plan而言,Fragment 2, 3, 4功能类似,只是负责扫描的表不同;具体到查询中的Order/Aggregation/Join算子,都在Fragment 1中进行,着重分析Fragment 1
- Fragment 1集成了三个Join算子的执行,采用默认的BROADCAST方式进行执行,也就是小表向大表广播的方式进行,如果两个Join的表都是大表,建议采用SHUFFLE的方式进行
- 目前doris只支持HASH JOIN,也就是采用哈希算法进行Join
- 其中有一个colocate字段,这个用来表述两张Join表采用同样的分区/分桶方式,如此执行Join的过程中可以直接在本地执行,不用进行数据的移动
- Join执行完成之后,就是执行上成的Aggregation, Order by和TOP-N的算子
三、Doris-Profile简述
可通过8030页面的QueryProfile模块查看任务执行详情,以下为query96.sql实际执行的QueryProfile部分内容,各指标名详情可参考:Apache Doris查询分析
Query: Summary: - Query ID: 7dd4ba245012441c-b0aadbed39f80f20 - Start Time: 2022-04-15 15:52:22 - End Time: 2022-04-15 15:52:22 - Total: 611ms - Query Type: Query - Query State: EOF - Doris Version: 0.15.0-rc04 - User: root - Default Db: default_cluster:tpcds - Sql Statement: /* ApplicationName=DBeaver Enterprise 7.0.0 - SQLEditor <20220321常用命令-doris.sql> */ select count(*) from store_sales ,household_demographics ,time_dim , store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 8 and time_dim.t_minute >= 30 and household_demographics.hd_dep_count = 5 and store.s_store_name = 'ese' order by count(*) limit 100 - Is Cached: No Execution Summary: - Analysis Time: 636.648us - Plan Time: 19.230ms - Schedule Time: 125.121ms - Wait and Fetch Result Time: 466.30ms Execution Profile 7dd4ba245012441c-b0aadbed39f80f20:(Active: 611.44ms, % non-child: 100.00%) Fragment 0: Instance 7dd4ba245012441c-b0aadbed39f80f2d (host=TNetworkAddress(hostname:10.192.119.70, port:9060)):(Active: 586.950ms, % non-child: 0.00%) - FragmentCpuTime: 756.962us - MemoryLimit: 2.00 GB - PeakMemoryUsage: 48.01 KB - PeakReservation: 0.00 - PeakUsedReservation: 0.00 - RowsProduced: 1 BlockMgr: - BlockWritesOutstanding: 0 - BlocksCreated: 0 - BlocksRecycled: 0 - BufferedPins: 0 - BytesWritten: 0.00 - MaxBlockSize: 8.00 MB - TotalBufferWaitTime: 0ns - TotalEncryptionTime: 0ns - TotalIntegrityCheckTime: 0ns - TotalReadBlockTime: 0ns DataBufferSender (dst_fragment_instance_id=7dd4ba245012441c-b0aadbed39f80f2d): - AppendBatchTime: 124.481us - ResultSendTime: 119.257us - TupleConvertTime: 4.217us - NumSentRows: 1 SORT_NODE (id=8):(Active: 587.36ms, % non-child: 0.01%) - PeakMemoryUsage: 16.00 KB - RowsReturned: 1 - RowsReturnedRate: 1 AGGREGATION_NODE (id=13):(Active: 586.958ms, % non-child: 0.10%) - Probe Method: HashTable Linear Probing - BuildTime: 10.533us - GetResultsTime: 0ns - HTResize: 0 - HTResizeTime: 0ns - HashBuckets: 0 - HashCollisions: 0 - HashFailedProbe: 0 - HashFilledBuckets: 0 - HashProbe: 0 - HashTravelLength: 0 - LargestPartitionPercent: 0 - MaxPartitionLevel: 0 - NumRepartitions: 0 - PartitionsCreated: 0 - PeakMemoryUsage: 28.00 KB - RowsProcessed: 0 - RowsRepartitioned: 0 - RowsReturned: 1 - RowsReturnedRate: 1 - SpilledPartitions: 0 EXCHANGE_NODE (id=12):(Active: 586.364ms, % non-child: 95.96%) - BytesReceived: 32.00 B - ConvertRowBatchTime: 7.320us - DataArrivalWaitTime: 586.282ms - DeserializeRowBatchTimer: 22.637us - FirstBatchArrivalWaitTime: 349.530ms - PeakMemoryUsage: 12.01 KB - RowsReturned: 3 - RowsReturnedRate: 5 - SendersBlockedTotalTimer(*): 0ns Fragment 1: Instance 7dd4ba245012441c-b0aadbed39f80f23 (host=TNetworkAddress(hostname:10.192.119.68, port:9060)):(Active: 472.511ms, % non-child: 0.10%) - FragmentCpuTime: 5.714ms - MemoryLimit: 2.00 GB - PeakMemoryUsage: 610.00 KB - PeakReservation: 0.00 - PeakUsedReservation: 0.00 - RowsProduced: 1 BlockMgr: - BlockWritesOutstanding: 0 - BlocksCreated: 0 - BlocksRecycled: 0 - BufferedPins: 0 - BytesWritten: 0.00 - MaxBlockSize: 8.00 MB - TotalBufferWaitTime: 0ns - TotalEncryptionTime: 0ns - TotalIntegrityCheckTime: 0ns - TotalReadBlockTime: 0ns DataStreamSender (dst_id=12, dst_fragments=[7dd4ba245012441c-b0aadbed39f80f2d]):(Active: 186.357us, % non-child: 0.03%) - BytesSent: 16.00 B - IgnoreRows: 0 - LocalBytesSent: 0.00 - OverallThroughput: 83.84375 KB/sec - PeakMemoryUsage: 16.00 KB - SerializeBatchTime: 7.0us - UncompressedRowBatchSize: 16.00 B AGGREGATION_NODE (id=7):(Active: 471.713ms, % non-child: 0.14%) - Probe Method: HashTable Linear Probing - BuildTime: 45.223us - GetResultsTime: 0ns - HTResize: 0 - HTResizeTime: 0ns - HashBuckets: 0 - HashCollisions: 0 - HashFailedProbe: 0 - HashFilledBuckets: 0 - HashProbe: 0 - HashTravelLength: 0 - LargestPartitionPercent: 0 - MaxPartitionLevel: 0 - NumRepartitions: 0 - PartitionsCreated: 0 - PeakMemoryUsage: 280.00 KB - RowsProcessed: 0 - RowsRepartitioned: 0 - RowsReturned: 1 - RowsReturnedRate: 2 - SpilledPartitions: 0 HASH_JOIN_NODE (id=6):(Active: 470.881ms, % non-child: 0.08%) - ExecOption: Hash Table Built Asynchronously - BuildBuckets: 1.024K (1024) - BuildRows: 1 - BuildTime: 1.129ms - HashTableMaxList: 1 - HashTableMinList: 1 - LoadFactor: 4562146422526312400.00 - PeakMemoryUsage: 308.00 KB - ProbeRows: 341 - ProbeTime: 34.697us - PushDownComputeTime: 156.171us - PushDownTime: 4.423us - RowsReturned: 341 - RowsReturnedRate: 724
- Active:表示该节点(包含其所有子节点)的执行时间
- BuildTime:扫描右表并构建hash表的时间
- ProbeTime:获取左表并搜索hashtable进行匹配并输出的时间