Linux 时钟精度 与 PostgreSQL auto_explain (explain timing 时钟开销估算)

本文涉及的产品
云数据库 Tair(兼容Redis),内存型 2GB
云数据库 MongoDB,独享型 2核8GB
推荐场景:
构建全方位客户视图
Redis 开源版,标准版 2GB
推荐场景:
搭建游戏排行榜
简介:

标签

PostgreSQL , auto_explain , pg_test_timing , 时钟 , tsc , hpet , acpi , acpi_pm , Linux


背景

我们在诊断SQL的执行计划时,通常会用explain analyze,analyze有几个开关,其中一个是timing,它会帮你记录下SQL每个NODE的执行时间。

但是这部分是有一定的性能开销的,而且这个开销与操作系统的时钟获取接口有关。

有时,你会发现explain analyze的执行时间远大于真实的执行时间,这是为什么呢?

时钟硬件与时间精确度

常见时钟方法精度 tsc > hpet ( 100 纳秒(十亿分之一秒) ) > acpi_pm ( 300 纳秒(十亿分之一秒) )

Clock hardware and timing accuracy

Collecting accurate timing information is normally done on computers using hardware clocks with various levels of accuracy.   
With some hardware the operating systems can pass the system clock time almost directly to programs.   
A system clock can also be derived from a chip that simply provides timing interrupts, periodic ticks at some known time interval.   
In either case, operating system kernels provide a clock source that hides these details.   
But the accuracy of that clock source and how quickly it can return results varies based on the underlying hardware.  

Inaccurate time keeping can result in system instability. Test any change to the clock source very carefully.   
Operating system defaults are sometimes made to favor reliability over best accuracy. And if you are using a virtual machine, look into the recommended time sources compatible with it.   
Virtual hardware faces additional difficulties when emulating timers, and there are often per operating system settings suggested by vendors.  

The Time Stamp Counter (TSC) clock source is the most accurate one available on current generation CPUs.   
It's the preferred way to track the system time when it's supported by the operating system and the TSC clock is reliable.   
There are several ways that TSC can fail to provide an accurate timing source, making it unreliable.   
Older systems can have a TSC clock that varies based on the CPU temperature, making it unusable for timing.   

以前就有遇到过机器时钟跳来跳去的问题,我记得是IBMX3950的堆叠服务器  
Trying to use TSC on some older multicore CPUs can give a reported time that's inconsistent among multiple cores.   
This can result in the time going backwards, a problem this program checks for. And even the newest systems can fail to provide accurate TSC timing with very aggressive power saving configurations.  

Newer operating systems may check for the known TSC problems and switch to a slower, more stable clock source when they are seen.   
If your system supports TSC time but doesn't default to that, it may be disabled for a good reason.   
And some operating systems may not detect all the possible problems correctly, or will allow using TSC even in situations where it's known to be inaccurate.  

The High Precision Event Timer (HPET) is the preferred timer on systems where it's available and TSC is not accurate.   
The timer chip itself is programmable to allow up to 100 nanosecond resolution, but you may not see that much accuracy in your system clock.  

Advanced Configuration and Power Interface (ACPI) provides a Power Management (PM) Timer, which Linux refers to as the acpi_pm.   
The clock derived from acpi_pm will at best provide 300 nanosecond resolution.  

Timers used on older PC hardware include the 8254 Programmable Interval Timer (PIT),   
the real-time clock (RTC), the Advanced Programmable Interrupt Controller (APIC) timer,   
and the Cyclone timer.   
These timers aim for millisecond resolution.  

explain analyze代码

当开启了explain analyze timing开关时,会设置instrument_option |= INSTRUMENT_TIMER;

src/backend/commands/explain.c

ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,  
                           const char *queryString, ParamListInfo params,  
                           const instr_time *planduration)  
{  
        DestReceiver *dest;  
        QueryDesc  *queryDesc;  
        instr_time      starttime;  
        double          totaltime = 0;  
        int                     eflags;  
        int                     instrument_option = 0;  

        if (es->analyze && es->timing)  
                instrument_option |= INSTRUMENT_TIMER;  
        else if (es->analyze)  
                instrument_option |= INSTRUMENT_ROWS;  

        if (es->buffers)  
                instrument_option |= INSTRUMENT_BUFFERS;  

这个设置会直接影响EXECUTE时的时钟统计,这部分还不是很友好,其实没有必要每条TUPLE都统计这部分时间,只要统计进出NODE的时候的时间即可

src/backend/executor/instrument.c

/* Entry to a plan node */  
void  
InstrStartNode(Instrumentation *instr)  
{  
        if (instr->need_timer)  
        {  
                if (INSTR_TIME_IS_ZERO(instr->starttime))  
                        INSTR_TIME_SET_CURRENT(instr->starttime);  
                else  
                        elog(ERROR, "InstrStartNode called twice in a row");  
        }  

        /* save buffer usage totals at node entry, if needed */  
        if (instr->need_bufusage)  
                instr->bufusage_start = pgBufferUsage;  
}  

/* Exit from a plan node */  
void  
InstrStopNode(Instrumentation *instr, double nTuples)  
{  
        instr_time      endtime;  

        /* count the returned tuples */  
        instr->tuplecount += nTuples;  

        /* let's update the time only if the timer was requested */  
        if (instr->need_timer)  
        {  
                if (INSTR_TIME_IS_ZERO(instr->starttime))  
                        elog(ERROR, "InstrStopNode called without start");  

                INSTR_TIME_SET_CURRENT(endtime);  
                INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);  

                INSTR_TIME_SET_ZERO(instr->starttime);  
        }  

        /* Add delta of buffer usage since entry to node's totals */  
        if (instr->need_bufusage)  
                BufferUsageAccumDiff(&instr->bufusage,  
                                                         &pgBufferUsage, &instr->bufusage_start);  

        /* Is this the first tuple of this cycle? */  
        if (!instr->running)  
        {  
                instr->running = true;  
                instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);  
        }  
}  

src/include/portability/instr_time.h

#define INSTR_TIME_SET_CURRENT(t)       gettimeofday(&(t), NULL)  

单条QUERY , EXECUTE可能被多次调用,所以如果要统计时间。

src/backend/executor/execMain.c

void  
ExecutorRun(QueryDesc *queryDesc,  
                        ScanDirection direction, uint64 count)  
{  
        if (ExecutorRun_hook)  
                (*ExecutorRun_hook) (queryDesc, direction, count);  
        else  
                standard_ExecutorRun(queryDesc, direction, count);  
}  

void  
standard_ExecutorRun(QueryDesc *queryDesc,  
                                         ScanDirection direction, uint64 count)  
{  
...  

        /* Allow instrumentation of Executor overall runtime */  
        if (queryDesc->totaltime)  
                InstrStartNode(queryDesc->totaltime);  // 时钟   
...  
        if (queryDesc->totaltime)  
                InstrStopNode(queryDesc->totaltime, estate->es_processed);  

如果需要处理的记录数非常多,由于频繁调用gettimeofday,就导致了explain analyze timing时间拉长的问题。

例子

1000万纪录的count(*)操作,我们看看实际的执行时间,以及开启analyze timing后,不同时钟硬件的性能影响

create table tbl_time(id int);  
insert into tbl_time select generate_series(1,10000000);  

\timing  
postgres=# SELECT COUNT(*) FROM tbl_time;  
  count     
----------  
 10000000  
(1 row)  
Time: 1171.956 ms  

使用tsc hpet acpi_pm三种时钟硬件测试性能影响

#cat /sys/devices/system/clocksource/clocksource0/available_clocksource  
tsc hpet acpi_pm   

1. tsc

#echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource   

postgres=# explain analyze SELECT COUNT(*) FROM tbl_time;  
                                                         QUERY PLAN                                                            
-----------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=169247.71..169247.72 rows=1 width=8) (actual time=2113.432..2113.432 rows=1 loops=1)  
   ->  Seq Scan on tbl_time  (cost=0.00..144247.77 rows=9999977 width=0) (actual time=0.013..1128.860 rows=10000000 loops=1)  
 Planning time: 0.062 ms  
 Execution time: 2113.514 ms  
(4 rows)  

2. hpet

#echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource   

postgres=# explain analyze SELECT COUNT(*) FROM tbl_time;  
                                                         QUERY PLAN                                                            
-----------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=169247.71..169247.72 rows=1 width=8) (actual time=13968.218..13968.218 rows=1 loops=1)  
   ->  Seq Scan on tbl_time  (cost=0.00..144247.77 rows=9999977 width=0) (actual time=0.018..7067.711 rows=10000000 loops=1)  
 Planning time: 0.059 ms  
 Execution time: 13968.271 ms  
(4 rows)  

3. acpi_pm

#echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource   

postgres=# explain analyze SELECT COUNT(*) FROM tbl_time;  
                                                         QUERY PLAN                                                            
-----------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=169247.71..169247.72 rows=1 width=8) (actual time=19641.242..19641.243 rows=1 loops=1)  
   ->  Seq Scan on tbl_time  (cost=0.00..144247.77 rows=9999977 width=0) (actual time=0.018..9896.285 rows=10000000 loops=1)  
 Planning time: 0.060 ms  
 Execution time: 19641.296 ms  
(4 rows)  

使用pg_test_timing测试不同的时钟硬件在使用gettimeofday时带来的性能影响

pg_test_timing是用来测试不同硬件时钟的损耗的,使用gettimeofday进行测试

下面是测试

1. tsc

pg_test_timing   
Testing timing overhead for 3 seconds.  
Per loop time including overhead: 47.70 nsec  
Histogram of timing durations:  
< usec   % of total      count  
     1     95.23986   59893249  
     2      4.75540    2990515  
     4      0.00384       2414  
     8      0.00077        485  
    16      0.00013         79  
    32      0.00000          3  
    64      0.00000          1  

2. hpet

pg_test_timing   
Testing timing overhead for 3 seconds.  
Per loop time including overhead: 696.44 nsec  
Histogram of timing durations:  
< usec   % of total      count  
     1     31.81944    1370669  
     2     67.06767    2889038  
     4      1.03890      44752  
     8      0.05959       2567  
    16      0.01418        611  
    32      0.00016          7  
    64      0.00005          2  

3. acpi_pm

pg_test_timing   
Testing timing overhead for 3 seconds.  
Per loop time including overhead: 919.07 nsec  
Histogram of timing durations:  
< usec   % of total      count  
     1     12.25423     399999  
     2     84.17305    2747553  
     4      3.45019     112620  
     8      0.08648       2823  
    16      0.03468       1132  
    32      0.00132         43  
    64      0.00003          1  
   128      0.00003          1  

根据以上测试,可以预估前面EXPLAIN ANALYZE带来的问题

acpi_pm  
<1到底是多少不知道了,<1越多,误差越大  
10000000*(1*0.1225+2*0.8417)/1000000    

hpet  
<1到底是多少不知道了,<1越多,误差越大  
10000000*(1*0.3182+2*0.6706)/1000000    

tsc   
不太好估算,因为<1到底是多少不知道了  
10000000*(1*0.9524+2*0.0475)/1000000    

auto_explain log timing也有类似问题

当开启auto_explain的timing计数后,由于时钟开销的问题,可能严重的影响性能,如本文所示,即使使用TSC时钟,对于扫描记录数很多的时候,性能下降了一倍。

虽然auto_explain有一个超时阈值,但是当你开启了timing的记录后,就会导致它把所有的NODE执行时间都记录下来,因为执行结束前,并不知道总时间会不会超,所以每条QUERY的ANALYZE TIMING都会被开启。

建议如果不是特殊需求,不要开启auto_explain timing选项。

参考

https://www.ibm.com/developerworks/cn/linux/1308_liuming_linuxtime4/

https://www.postgresql.org/docs/9.6/static/pgtesttiming.html

https://www.postgresql.org/docs/9.6/static/auto-explain.html

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
关系型数据库 Linux PostgreSQL
Linux centos8 docker中安装postgresql12.4及远程访问设置
Linux centos8 docker中安装postgresql12.4及远程访问设置
708 0
|
关系型数据库 数据库 PostgreSQL
使用 Docker 在 Windows、Mac 和 Linux 系统轻松部署 PostgreSQL 数据库
使用 Docker 在 Windows、Mac 和 Linux 系统轻松部署 PostgreSQL 数据库
545 1
|
4月前
|
关系型数据库 MySQL Linux
在Linux中,如何配置数据库服务器(如MySQL或PostgreSQL)?
在Linux中,如何配置数据库服务器(如MySQL或PostgreSQL)?
|
4月前
|
关系型数据库 数据库 PostgreSQL
Linux 环境手动备份postgresql数据库
【8月更文挑战第12天】在Docker环境中使用命令行工具对PostgreSQL数据库进行备份和恢复。首先,通过dockerexec进入容器,使用pg_dump进行数据库模式的备份,然后使用dockercp将备份文件导出。接着,若需导入数据到另一数据库,先将备份文件复制到目标容器,再利用psql命令进行数据恢复。整个过程需确保目标数据库无同名模式,以防止导入失败
59 3
|
7月前
|
安全 Linux 调度
xenomai+linux双内核下的时钟管理机制
clock是操作系统正常运行的发动机,系统利用时钟中断维持系统时间、促使任务调度,以保证所有进程共享CPU资源;可以说,“时钟中断”是整个操作系统的脉搏。那你是否好奇xenomai cobalt内核和Linux内核双内核共存的情况下,时间子系统是如何工作的?一个硬件时钟如何为两个操作系统提供服务的?本文将揭开xenomai双核系统时间机制
154 0
xenomai+linux双内核下的时钟管理机制
|
关系型数据库 Linux 数据库
Linux系统之安装PostgreSQL数据库
Linux系统之安装PostgreSQL数据库
1232 1
|
7月前
|
Shell Linux C语言
【Shell 命令集合 系统设置 】⭐Linux 显示系统的时钟时间 clock命令 使用指南
【Shell 命令集合 系统设置 】⭐Linux 显示系统的时钟时间 clock命令 使用指南
104 0
【Shell 命令集合 系统设置 】⭐Linux 显示系统的时钟时间 clock命令 使用指南
|
7月前
|
关系型数据库 Linux PostgreSQL
linux postgresql 常用指令
linux postgresql 常用指令
35 0
|
7月前
|
存储 Linux Shell
【Shell 命令集合 系统设置 】Linux 设置和显示硬件时钟 hwclock命令 使用指南
【Shell 命令集合 系统设置 】Linux 设置和显示硬件时钟 hwclock命令 使用指南
110 0
|
7月前
|
缓存 Linux 编译器
C/C++ 函数调用以及Linux中系统调用 开销介绍:介绍C/C函数调用以及Linux中系统调用的开销情况
C/C++ 函数调用以及Linux中系统调用 开销介绍:介绍C/C函数调用以及Linux中系统调用的开销情况
114 0