Doris提供了一个图形化的命令帮助用户方便分析一个具体的查询或者导入操作,在使用过程中的性能问题,本文主要介绍如何使用改功能
1. 查询计划树
SQL 是一个描述性语言,用户通过一个 SQL 来描述想获取的数据。而一个 SQL 的具体执行方式依赖于数据库的实现。而查询规划器就是用来决定数据库如何具体执行一个 SQL 的。 比如用户指定了一个 Join 算子,则查询规划器需要决定具体的 Join 算法,比如是 Hash Join,还是 Merge Sort Join;是使用 Shuffle 还是 Broadcast;Join 顺序是否需要调整以避免笛卡尔积;以及确定最终的在哪些节点执行等等。
Doris 的查询规划过程是先将一个 SQL 语句转换成一个单机执行计划树。
之后,查询规划器会根据具体的算子执行方式、数据的具体分布,将单机查询计划转换为分布式查询计划。分布式查询计划是由多个 Fragment 组成的,每个 Fragment 负责查询计划的一部分,各个 Fragment 直接会通过 ExchangeNode 算子进行数据的传输。
如上图,我们将单机计划分成了两个 Fragment:F1 和 F2。两个 Fragment 之间通过一个 ExchangeNode 节点传输数据。
而一个 Fragment 会进一步的划分为多个 Instance。Instance 是最终具体的执行实例。划分成多个 Instance 有助于充分利用机器资源,提升一个 Fragment 的执行并发度。
查看查询计划
可以通过以下两种命令查看一个 SQL 的执行计划。
EXPLAIN GRAPH select ...; EXPLAIN select ...;
其中第一个命令以图形化的方式展示一个查询计划,这个命令可以比较直观的展示查询计划的树形结构,以及 Fragment 的划分情况:
mysql> desc graph SELECT SUM(lo_revenue), d_year, p_brand -> FROM lineorder, date, part, supplier -> WHERE lo_orderdate = d_datekey -> AND lo_partkey = p_partkey -> AND lo_suppkey = s_suppkey -> AND p_brand BETWEEN 'MFGR#2221' -> AND 'MFGR#2228' -> AND s_region = 'ASIA' -> GROUP BY d_year, p_brand -> ORDER BY d_year, p_brand; +------------------------------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------------------------------+ | ┌────────────────┐ | | │[12: ResultSink]│ | | │[Fragment: 4] │ | | │RESULT SINK │ | | └────────────────┘ | | │ | | │ | | ┌──────────────────────┐ | | │[12: MERGING-EXCHANGE]│ | | │[Fragment: 4] │ | | └──────────────────────┘ | | │ | | │ | | ┌────────────────────┐ | | │[12: DataStreamSink]│ | | │[Fragment: 0] │ | | │STREAM DATA SINK │ | | │ EXCHANGE ID: 12 │ | | │ UNPARTITIONED │ | | └────────────────────┘ | | │ | | │ | | ┌─────────────┐ | | │[8: TOP-N] │ | | │[Fragment: 0]│ | | └─────────────┘ | | │ | | │ | | ┌────────────────────────────────┐ | | │[7: AGGREGATE (update finalize)]│ | | │[Fragment: 0] │ | | └────────────────────────────────┘ | | │ | | │ | | ┌───────────────────────────────┐ | | │[6: HASH JOIN] │ | | │[Fragment: 0] │ | | │join op: INNER JOIN (BROADCAST)│ | | └───────────────────────────────┘ | | ┌───────────┴────────────────────────────────┐ | | │ │ | | ┌───────────────────────────────┐ ┌──────────────┐ | | │[4: HASH JOIN] │ │[11: EXCHANGE]│ | | │[Fragment: 0] │ │[Fragment: 0] │ | | │join op: INNER JOIN (BROADCAST)│ └──────────────┘ | | └───────────────────────────────┘ │ | | ┌──────────┴─────────────────────┐ │ | | │ │ ┌────────────────────┐ | | ┌───────────────────────────────┐ ┌──────────────┐ │[11: DataStreamSink]│ | | │[2: HASH JOIN] │ │[10: EXCHANGE]│ │[Fragment: 3] │ | | │[Fragment: 0] │ │[Fragment: 0] │ │STREAM DATA SINK │ | | │join op: INNER JOIN (BROADCAST)│ └──────────────┘ │ EXCHANGE ID: 11 │ | | └───────────────────────────────┘ │ │ UNPARTITIONED │ | | ┌──────────┴─────────┐ │ └────────────────────┘ | | │ │ ┌────────────────────┐ ┌┘ | | ┌─────────────────┐ ┌─────────────┐ │[10: DataStreamSink]│ │ | | │[0: OlapScanNode]│ │[9: EXCHANGE]│ │[Fragment: 2] │ ┌─────────────────┐ | | │[Fragment: 0] │ │[Fragment: 0]│ │STREAM DATA SINK │ │[5: OlapScanNode]│ | | │TABLE: lineorder │ └─────────────┘ │ EXCHANGE ID: 10 │ │[Fragment: 3] │ | | └─────────────────┘ │ │ UNPARTITIONED │ │TABLE: supplier │ | | │ └────────────────────┘ └─────────────────┘ | | ┌───────────────────┐ ┌┘ | | │[9: DataStreamSink]│ │ | | │[Fragment: 1] │ ┌─────────────────┐ | | │STREAM DATA SINK │ │[3: OlapScanNode]│ | | │ EXCHANGE ID: 09 │ │[Fragment: 2] │ | | │ UNPARTITIONED │ │TABLE: part │ | | └───────────────────┘ └─────────────────┘ | | │ | | │ | | ┌─────────────────┐ | | │[1: OlapScanNode]│ | | │[Fragment: 1] │ | | │TABLE: date │ | | └─────────────────┘ | +------------------------------------------------------------------------------------------------------------+ 75 rows in set (0.04 sec)
从图中可以看出,查询计划树被分为了5个 Fragment:0、1、2、3、4。如 OlapScanNode
节点上的 [Fragment: 0]
表示这个节点属于 Fragment 0。每个Fragment之间都通过 DataStreamSink 和 ExchangeNode 进行数据传输。
图形命令仅展示简化后的节点信息,如果需要查看更具体的节点信息,如下推到节点上的过滤条件等,则需要通过第二个命令查看更详细的文字版信息:
mysql> desc SELECT SUM(lo_revenue), d_year, p_brand FROM lineorder, date, part, supplier WHERE lo_orderdate = d_datekey AND lo_partkey = p_partkey AND lo_suppkey = s_suppkey AND p_brand BETWEEN 'MFGR#2221' AND 'MFGR#2228' AND s_region = 'ASIA' GROUP BY d_year, p_brand ORDER BY d_year, p_brand; +--------------------------------------------------------------------------------------------------------------+ | Explain String | +--------------------------------------------------------------------------------------------------------------+ | PLAN FRAGMENT 0 | | OUTPUT EXPRS:<slot 15> <slot 12> sum(`lo_revenue`) | <slot 13> <slot 10> `d_year` | <slot 14> <slot 11> `p_brand` | | PARTITION: UNPARTITIONED | | | | RESULT SINK | | | | 12:MERGING-EXCHANGE | | limit: 65535 | | | | PLAN FRAGMENT 1 | | OUTPUT EXPRS: | | PARTITION: RANDOM | | | | STREAM DATA SINK | | EXCHANGE ID: 12 | | UNPARTITIONED | | | | 8:TOP-N | | | order by: <slot 13> <slot 10> `d_year` ASC, <slot 14> <slot 11> `p_brand` ASC | | | offset: 0 | | | limit: 65535 | | | | | 7:AGGREGATE (update finalize) | | | output: sum(`lo_revenue`) | | | group by: `d_year`, `p_brand` | | | cardinality=-1 | | | | | 6:HASH JOIN | | | join op: INNER JOIN (BROADCAST) | | | hash predicates: | | | colocate: false, reason: Tables are not in the same group | | | equal join conjunct: `lo_suppkey` = `s_suppkey` | | | runtime filters: RF000[in_or_bloom] <- `s_suppkey` | | | cardinality=0 | | | | | |----11:EXCHANGE | | | | | 4:HASH JOIN | | | join op: INNER JOIN (BROADCAST) | | | hash predicates: | | | colocate: false, reason: Tables are not in the same group | | | equal join conjunct: `lo_partkey` = `p_partkey` | | | runtime filters: RF001[in_or_bloom] <- `p_partkey` | | | cardinality=0 | | | | | |----10:EXCHANGE | | | | | 2:HASH JOIN | | | join op: INNER JOIN (BROADCAST) | | | hash predicates: | | | colocate: false, reason: Tables are not in the same group | | | equal join conjunct: `lo_orderdate` = `d_datekey` | | | runtime filters: RF002[in_or_bloom] <- `d_datekey` | | | cardinality=0 | | | | | |----9:EXCHANGE | | | | | 0:OlapScanNode | | TABLE: lineorder | | PREAGGREGATION: OFF. Reason: conjunct on `lo_orderdate` which is StorageEngine value column | | runtime filters: RF000[in_or_bloom] -> `lo_suppkey`, RF001[in_or_bloom] -> `lo_partkey`, RF002[in_or_bloom] -> `lo_orderdate` | | partitions=0/7 | | rollup: null | | tabletRatio=0/0 | | tabletList= | | cardinality=0 | | avgRowSize=20.0 | | numNodes=1 | | | | PLAN FRAGMENT 2 | | OUTPUT EXPRS: | | PARTITION: HASH_PARTITIONED: `default_cluster:demo`.`supplier`.`s_suppkey` | | | | STREAM DATA SINK | | EXCHANGE ID: 11 | | UNPARTITIONED | | | | 5:OlapScanNode | | TABLE: supplier | | PREAGGREGATION: OFF. Reason: null | | PREDICATES: `s_region` = 'ASIA' | | partitions=0/1 | | rollup: null | | tabletRatio=0/0 | | tabletList= | | cardinality=0 | | avgRowSize=20.0 | | numNodes=1 | | | | PLAN FRAGMENT 3 | | OUTPUT EXPRS: | | PARTITION: HASH_PARTITIONED: `default_cluster:demo`.`part`.`p_partkey` | | | | STREAM DATA SINK | | EXCHANGE ID: 10 | | UNPARTITIONED | | | | 3:OlapScanNode | | TABLE: part | | PREAGGREGATION: OFF. Reason: null | | PREDICATES: `p_brand` >= 'MFGR#2221', `p_brand` <= 'MFGR#2228' | | partitions=0/1 | | rollup: null | | tabletRatio=0/0 | | tabletList= | | cardinality=0 | | avgRowSize=20.0 | | numNodes=1 | | | | PLAN FRAGMENT 4 | | OUTPUT EXPRS: | | PARTITION: HASH_PARTITIONED: `default_cluster:demo`.`date`.`d_datekey` | | | | STREAM DATA SINK | | EXCHANGE ID: 09 | | UNPARTITIONED | | | | 1:OlapScanNode | | TABLE: date | | PREAGGREGATION: OFF. Reason: null | | partitions=0/1 | | rollup: null | | tabletRatio=0/0 | | tabletList= | | cardinality=0 | | avgRowSize=8.0 | | numNodes=1 | +--------------------------------------------------------------------------------------------------------------+ 127 rows in set (0.01 sec)
2. 查看查询 Profile
用户可以通过以下命令打开会话变量
set enable_profile = true
然后执行查询,则 Doris 会产生该查询的一个 Profile。Profile 包含了一个查询各个节点的具体执行情况,有助于我们分析查询瓶颈。
执行完查询后,我们可以通过如下命令先获取 Profile 列表:
mysql> show query profile "/"\G *************************** 1. row *************************** QueryId: 8676cf2c28ac4081-95ead4b1b734b0b3 User: root DefaultDb: default_cluster:demo SQL: SELECT SUM(lo_extendedprice*lo_discount) AS REVENUE FROM lineorder, date WHERE lo_orderdate = d_datekey AND d_weeknuminyear= 6 AND d_year = 1994 AND lo_discount BETWEEN 5 AND 7 AND lo_quantity BETWEEN 26 AND 35 QueryType: Query StartTime: 2022-05-13 11:47:57 EndTime: 2022-05-13 11:47:57 TotalTime: 19ms QueryState: EOF 1 row in set (0.01 sec)
这个命令会列出当前保存的所有 Profile。每行对应一个查询。我们可以选择我们想看的 Profile 对应的 QueryId,查看具体情况。
查看一个Profile分为3个步骤:
2.1 查看整体执行计划树
这一步主要用于从整体分析执行计划,并查看每个Fragment的执行耗时。
mysql> show query profile "/26da5b3dbd5e42d8-8b5c39768b5b78f4"\G *************************** 1. row *************************** Fragments: ┌────────────────────────┐ │[-1: DataBufferSender] │ │Fragment: 0 │ │MaxActiveTime: 355.973us│ └────────────────────────┘ ┌┘ │ ┌─────────────────────┐ │[3: AGGREGATION_NODE]│ │Fragment: 0 │ └─────────────────────┘ │ │ ┌───────────────────┐ │[2: HASH_JOIN_NODE]│ │Fragment: 0 │ └───────────────────┘ ┌───────────┴────────────────────────────┐ │ │ ┌───────────────────┐ ┌──────────────────┐ │[0: OLAP_SCAN_NODE]│ │[4: EXCHANGE_NODE]│ │Fragment: 0 │ │Fragment: 0 │ └───────────────────┘ └──────────────────┘ ┌─────────┴─────────────────┐ │ │ │ │ ┌─────────────────────────────────┐ ┌─────────────┐ ┌──────────────────────┐ │[RuntimeFilter:in_or_bloomfilter]│ │[OlapScanner]│ │[4: DataStreamSender] │ │Fragment: 0 │ │Fragment: 0 │ │Fragment: 1 │ └─────────────────────────────────┘ └─────────────┘ │MaxActiveTime: 4.219ms│ │ └──────────────────────┘ │ ┌┘ ┌─────────────────┐ │ │[SegmentIterator]│ ┌───────────────────┐ │Fragment: 0 │ │[1: OLAP_SCAN_NODE]│ └─────────────────┘ │Fragment: 1 │ └───────────────────┘ │ ┌─────────────┐ │[OlapScanner]│ │Fragment: 1 │ └─────────────┘ │ │ ┌─────────────────┐ │[SegmentIterator]│ │Fragment: 1 │ └─────────────────┘ 1 row in set (0.00 sec)
如上图,每个节点都标注了自己所属的 Fragment,并且在每个 Fragment 的 Sender节点,标注了该 Fragment 的执行耗时(MaxActiveTime)。这个耗时,是Fragment下所有 Instance 执行耗时中最长的一个。这个有助于我们从整体角度发现最耗时的 Fragment。
2.2 查看具体 Fragment 下的 Instance 列表
比如我们发现 Fragment 1 耗时最长,则可以继续查看 Fragment 1 的 Instance 列表:
mysql> show query profile "/8676cf2c28ac4081-95ead4b1b734b0b3/1" ; +-----------------------------------+-------------------+------------+ | Instances | Host | ActiveTime | +-----------------------------------+-------------------+------------+ | 8676cf2c28ac4081-95ead4b1b734b0b5 | 172.28.7.230:9060 | 4.219ms | +-----------------------------------+-------------------+------------+ 1 row in set (0.00 sec)
这里展示了 Fragment 1 上所有的 1 个 Instance 所在的执行节点和耗时。
2.3 查看具体 Instance
我们可以继续查看某一个具体的 Instance 上各个算子的详细 Profile:
mysql> show query profile "/8676cf2c28ac4081-95ead4b1b734b0b3/1/8676cf2c28ac4081-95ead4b1b734b0b5" ; ┌───────────────────────────────────────┐ │[4: DataStreamSender] │ │(Active: 16.473us, non-child: 0.09) │ │ - Counters: │ │ - BytesSent: 0.00 │ │ - IgnoreRows: 0 │ │ - LocalBytesSent: 0.00 │ │ - OverallThroughput: 0.0 /sec │ │ - PeakMemoryUsage: 10.66 KB │ │ - SerializeBatchTime: 0ns │ │ - UncompressedRowBatchSize: 0.00 │ └───────────────────────────────────────┘ │ │ ┌─────────────────────────────────────┐ │[1: OLAP_SCAN_NODE] │ │(Active: 23.934us, non-child: 0.12) │ │ - Counters: │ │ - BatchQueueWaitTime: 225ns │ │ - BytesRead: 0.00 │ │ - NumDiskAccess: 0 │ │ - NumScanners: 0 │ │ - PeakMemoryUsage: 0.00 │ │ - RowsRead: 0 │ │ - RowsReturned: 0 │ │ - RowsReturnedRate: 0 │ │ - ScannerBatchWaitTime: 0ns │ │ - ScannerWorkerWaitTime: 0ns │ │ - TabletCount : 0 │ │ - TotalReadThroughput: 0.0 /sec│ └─────────────────────────────────────┘ ┌┘ │ ┌─────────────────────────────────┐ │[OlapScanner] │ │(Active: 0ns, non-child: 0.00) │ │ - Counters: │ │ - BlockConvertTime: 0ns │ │ - BlockFetchTime: 0ns │ │ - ReaderInitTime: 0ns │ │ - RowsDelFiltered: 0 │ │ - RowsPushedCondFiltered: 0│ │ - ScanCpuTime: 0ns │ │ - ScanTime: 0ns │ │ - ShowHintsTime_V1: 0ns │ └─────────────────────────────────┘ └┐ │ ┌────────────────────────────────────┐ │[SegmentIterator] │ │(Active: 0ns, non-child: 0.00) │ │ - Counters: │ │ - BitmapIndexFilterTimer: 0ns │ │ - BlockLoadTime: 0ns │ │ - BlockSeekCount: 0 │ │ - BlockSeekTime: 0ns │ │ - BlocksLoad: 0 │ │ - CachedPagesNum: 0 │ │ - CompressedBytesRead: 0.00 │ │ - DecompressorTimer: 0ns │ │ - IOTimer: 0ns │ │ - IndexLoadTime_V1: 0ns │ │ - NumSegmentFiltered: 0 │ │ - NumSegmentTotal: 0 │ │ - RawRowsRead: 0 │ │ - RowsBitmapIndexFiltered: 0 │ │ - RowsBloomFilterFiltered: 0 │ │ - RowsConditionsFiltered: 0 │ │ - RowsKeyRangeFiltered: 0 │ │ - RowsStatsFiltered: 0 │ │ - RowsVectorPredFiltered: 0 │ │ - TotalPagesNum: 0 │ │ - UncompressedBytesRead: 0.00 │ │ - VectorPredEvalTime: 0ns │ └────────────────────────────────────┘
上图展示了 Fragment 1 中,Instance 8676cf2c28ac4081-95ead4b1b734b0b5 的各个算子的具体 Profile。
通过以上3个步骤,我们可以逐步排查一个SQL的性能瓶颈。