GPDB中gp_vmem_protect_limit参数的意义

简介: GPDB中gp_vmem_protect_limit参数的意义

gp_vmem_protect_limit参数的意义


1、gp_vmem_protect_limit参数说明


1)在启用了基于资源队列的资源管理系统时,gp_vmem_protect_limit参数表示每个segment分配到的内存大小。预估值计算方式:所有GP数据库进程可用内存大小/发生故障时最大的primary segment个数。如果gp_vmem_protect_limit设置过高,则查询可能会失败。

2)该参数单位为MB

3)该参数推荐值计算方法:

     先计算gp_vmem值,若总内存小于256GB:

     gp_vmem = ((SWAP + RAM) – (7.5GB + 0.05 * RAM)) / 1.7

     若总内存大于256GB:

     gp_vmem = ((SWAP + RAM) – (7.5GB + 0.05 * RAM)) / 1.17

 再计算max_acting_primary_segments:当mirror segment启动后,可以在主机上运行primary segment最大个数。块镜像配置下,比如4台主机,每个主机上8个primary segment:SDW2坏掉,会导致SDW3上对应的3个mirror提升主提供服务,也就是说一个主机上最多8+3=11个primary segment(SDW1有2个mirror提升主,SDW4有3个mirror提升主)。

最后计算gp_vmem_protect_limit值:

gp_vmem_protect_limit = <gp_vmem> / <acting_primary_segments>

4)对于产生大量workfile的场景,gp_vmem的计算方式

     总系统内存<256GB:


<gp_vmem> = ((<SWAP> + <RAM>) – (7.5GB + 0.05 * <RAM> - (300KB * <total_#_workfiles>))) / 1.7

     总系统内存>=256GB:


<gp_vmem> = ((<SWAP> + <RAM>) – (7.5GB + 0.05 * <RAM> - (300KB * <total_#_workfiles>))) / 1.17

2、问题


最近,业务执行复杂SQL的场景,会fork上千进程来执行该SQL。按照上述说明将gp_vmem_protect_limit参数调整为合理的值后,操作系统仍旧会OOM,将GP杀掉。

通过脚本统计所有进程的内存占用:/etc/<pid>/smaps中Pss值的和,发现该值远大于gp_vmem_protect_limit*segment个数。

gp_vmem_protect_limit参数为什么没有将这个场景下的内存限制住,导致系统OOM?


3、分析


我们首先看下代码中gp_vmem_protect_limit是如何限制的?

在日志中有时会看到有下面类似的日志:

VM protect failed to allocate %d bytes from system, VM Protect %d MB available

我们找到代码位置:gp_failed_to_alloc函数中,以此为线索进行梳理。

也就是函数VmemTracker_ReserveVmem会中会进行校验,我们看下这个函数:

VmemTracker_ReserveVmemChunks函数进行检测,返回是否能够分配成功,我们接着看下这个函数:因为我们使用的是资源组,所以仅整理资源组相关代码:

内存限制的判断:segmentVmemChunks > vmemChunksQuota

我们看下segmentVmemChunks和vmem_ChunksQuata来源:

//postmaster和fork的进程会调用

CreateSharedMemoryAndSemaphores

   GPMemoryProtect_ShmemInit

       VmemTracker_ShmemInit//初始化vmem tracker共享内存中的状态

关注VmemTracker_ShmemInit函数:

三个初始值segmentVmemChunks在共享内存,初始0;vmemChunksQuota为gp_vmem_protect_limit值;redZoneChunks红线值为gp_vmem_protect_limit*0.8。那么我们搜索代码后,vmemChunksQuota和redZoneChunks是定值了,不再变化,segmentVmemChunks在开始统计内存使用前,还会初始化一个启动使用内存:

可以看到segmentVmemChunks共享内存中的值加上了一个起始值,最大是16MB,为什么要加上这个16MB呢?下面是原因:主要是每个QE进程本身的committed memory

    /*
    * Add the per-process startup committed memory to vmem tracker.
    *
    * Postgresql suggests setting vm.overcommit_memory to 2, so system level OOM
    * is triggered by committed memory size.  Each postgres process has several MB
    * committed memory since it being forked, in practice the size can be 6~16 MB,
    * depends on build-time and runtime configurations.
    *
    * These memory were not tracked by vmem tracker however, as a result an idle
    * (just gets launched, or just finished execution) QE process has 0 chunks in
    * track, but it might have 16MB committed memory.  The vmem tracker protect
    * limit will never be reached no matter how many such processes there are, but
    * when there are enough such processes the system level OOM will be triggered,
    * which will lead to unpredictable failures on not only postgres but also all
    * the other processes on the system.
    *
    * To prevent this worst case we add a startup cost to the vmem tracker, so the
    * vmem tracker OOM will happen instead of the system one, this is less harmful
    * and easier to handle.  The startup cost, however, is a hard coded value from
    * practice for now, we may want to make a better estimation at runtime in the
    * future.  Another thing to improve is that this startup cost should only be
    * added when vm.overcommit_memory is 2.
    */

    我们再返回去看下内存申请时,受该参数限制的地方:

    可以看到,先进行红线判断,超出红线也就是gp_vmem_protect_limit*0.8后,就进行清理回收,回收后再次进行分配,若segmentVmemChunks仍旧大于gp_vmem_protect_limit,则分配失败。

    最后,我们看下不受vmem tracker跟踪的地方:gp_malloc中gp_mp_inited没有启动的地方不受其跟踪:

    gp_malloc为GP代码中palloc、malloc等重定义的函数,在内存上下文中使用,该内存上下文是进程私有的。当然,除了调用gp_malloc外,代码中仍有直接调用操作系统malloc函数也就是不受vmem tracker跟踪的地方,但是通过SQL复现GDB跟踪,发现这种情况比较少,且申请的内存不大。


    4、总结


    至此,我们清晰梳理了gp_vmem_protect_limit参数的使用流程。通过共享内存参数segmentVmemChunks来统计一个segment上所有进程分配的内存,每个进程跟踪的初始值是16MB,在内存上下文中申请的内存都会统计进去。这些都是进程私有的内存,而我们通过/etc/<pid>/smaps中Pss值统计的包括进程私有和共享内存平摊后的内存。gp_vmem_protect_limit没有统计到共享内存,仍旧有操作系统OOM的风险

    目录
    相关文章
    |
    存储 缓存 大数据
    Starrocks执行查询报错:Memory of process exceed limit. Used: XXX, Limit: XXX. Mem usage has exceed the limit of BE
    Starrocks执行查询报错:Memory of process exceed limit. Used: XXX, Limit: XXX. Mem usage has exceed the limit of BE
    |
    SQL 关系型数据库 数据库
    PG/GP group by expression语法
    PG/GP group by expression语法
    119 1
    |
    关系型数据库
    PG/GP limit...offset...实现机制
    PG/GP limit...offset...实现机制
    126 0
    |
    SQL 数据库 数据库管理
    Replication-Replication Distribution Subsystem: agent xxxxxx failed. Column names in each table must be unique
    原文:Replication-Replication Distribution Subsystem: agent xxxxxx failed. Column names in each table must be unique   最近遇到一个关于发布订阅(Replication)的奇葩问题,特此记录一下这个案例。
    1042 0
    |
    SQL 机器学习/深度学习 关系型数据库
    RDS SQL Server– Best Practices of Execution Plan Cache for Missing Indexes
    Execution plan cache is a significant part of SQL Server memory management. It can reveal to you how the execution of a query will occur, or how query execution took place.
    2747 0

    热门文章

    最新文章