PostreSQL崩溃试验全记录

本文涉及的产品
云原生数据库 PolarDB MySQL 版,通用型 2核4GB 50GB
云原生数据库 PolarDB PostgreSQL 版,标准版 2核4GB 50GB
简介:

由于客户提到,运行某些大运算量批处理时,PostgreSQL突发性使用大量资源的问题,

进行了如下的调查,发现确实发生了崩溃。PostgreSQL 需要资源控制方案啊。

现在正在考虑是否必须要用 os 级别的限制方案:

过程如下:

测试环境:

内存:大约1024MB

postgresql.conf 设定:

    使用缺省值:checkpoint_segments =3  shard_buffers=32MB

    这些是有意为之,就是想看看数据量大、shared_buffer小的状况:

建立表(一条记录大约1024字节):

    postgres=# create table test01(id integer, val char(1024)); 

向表中插入大量数据(总共插入2400MB)

   postgres=# insert into test01 values(generate_series(1,2457600),repeat( chr(int4(random()*26)+65),1024));

插入操作会花费一些时间,此时用ps 命令看PostgreSQL个进程对内存使用状况:‘

 

复制代码
[root@server ~]# ps aux | grep post

root      3180  0.0  0.0 105296   712 pts/1    S    16:31   0:00 su - postgres

postgres  3181  0.0  0.0  70304   676 pts/1    S+   16:31   0:00 -bash

postgres  3219  0.0  0.2 113644  2864 pts/1    S    16:32   0:00 /usr/local/pgsql/bin/postgres -D /gao/data

postgres  3221  0.4  3.0 113724 35252 ?        Ss   16:32   0:01 postgres: writer process                  

postgres  3222  0.2  0.1 113644  1616 ?        Ds   16:32   0:00 postgres: wal writer process              

postgres  3223  0.0  0.0 114380  1148 ?        Ss   16:32   0:00 postgres: autovacuum launcher process     

postgres  3224  0.0  0.0  73332   472 ?        Ss   16:32   0:00 postgres: stats collector process         

root      3252  0.0  0.0 105296   712 pts/2    S    16:32   0:00 su - postgres

postgres  3253  0.0  0.0  70304   676 pts/2    S    16:32   0:00 -bash

postgres  3285  0.0  0.0  83488   740 pts/2    S+   16:32   0:00 ./psql

postgres  3286 14.8 80.2 2598332 924308 ?      Ds   16:32   0:35 postgres: postgres postgres [local] INSERT

root      3333  0.0  0.0  65424   812 pts/3    S+   16:36   0:00 grep post
复制代码

可以发现,INSERT操作正在消耗80%以上的内存。

再等片刻,发现psql端传来反馈:

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
The connection to the server was lost. Attempting reset: Failed.
!> 

此时看看Log,可以看到:background writer (3321) 已经被干掉了,所有连接被重置。

复制代码
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
LOG:  checkpoints are occurring too frequently (2 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (1 second apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (2 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (1 second apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (2 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (1 second apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (2 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (4 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (10 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (9 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (7 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (9 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (7 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (5 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (5 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (10 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (11 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (15 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (23 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (7 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (8 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (13 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (7 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (8 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (8 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (8 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (4 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (5 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  checkpoints are occurring too frequently (9 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
LOG:  background writer process (PID 3221) was terminated by signal 9: Killed
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
LOG:  all server processes terminated; reinitializing
FATAL:  the database system is in recovery mode
LOG:  database system was interrupted; last known up at 2013-08-30 16:36:42 CST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  consistent recovery state reached at 0/B7657BD0
LOG:  redo starts at 0/B60FE2B8
LOG:  unexpected pageaddr 0/B044C000 in log file 0, segment 184, offset 4505600
LOG:  redo done at 0/B844B940
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
复制代码

各个进程都重新生成了:

复制代码
[root@server ~]# ps aux | grep post
root      3180  0.0  0.0 105296   660 pts/1    S    16:31   0:00 su - postgres
postgres  3181  0.0  0.0  70304   628 pts/1    S+   16:31   0:00 -bash
postgres  3219  0.0  0.5 113644  6276 pts/1    S    16:32   0:00 /usr/local/pgsql/bin/postgres -D /gao/data
root      3252  0.0  0.0 105296   660 pts/2    S    16:32   0:00 su - postgres
postgres  3253  0.0  0.0  70304   628 pts/2    S    16:32   0:00 -bash
postgres  3285  0.0  0.0  83488  1144 pts/2    S+   16:32   0:00 ./psql
postgres  3348  0.0  0.0 113644   984 ?        Ss   16:40   0:00 postgres: writer process                  
postgres  3349  0.0  0.0 113644   732 ?        Ss   16:40   0:00 postgres: wal writer process              
postgres  3350  0.0  0.1 114336  1756 ?        Ss   16:40   0:00 postgres: autovacuum launcher process     
postgres  3351  0.0  0.0  73332   580 ?        Ss   16:40   0:00 postgres: stats collector process         
root      3361  0.0  0.0  65420   796 pts/3    R+   16:44   0:00 grep post
[root@server ~]# 
复制代码

回到psql端再看,发现连接已经失效了:

!> \
Invalid command \. Try \? for help.
!> \dt;
You are currently not connected to a database.
!> 

根据向社区提问,据说是因为OS级别的OOM错误发生,所以杀死了Postmaster进程。

总之,此种情况表明,如果没有对资源消费总量的控制,那么突发性的用户访问会导致崩溃的。









本文转自健哥的数据花园博客园博客,原文链接:http://www.cnblogs.com/gaojian/p/3291938.html,如需转载请自行联系原作者

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
4月前
|
传感器 安全 文件存储
CrowdStrike更新导致全球Windows系统大规模崩溃,CEO致歉并详解修复措施
CrowdStrike更新导致全球Windows系统大规模崩溃,CEO致歉并详解修复措施
CrowdStrike更新导致全球Windows系统大规模崩溃,CEO致歉并详解修复措施
|
7月前
|
存储 NoSQL 编译器
实战总结|抽丝剥茧,记一次神奇的崩溃
本文详细回放了一个崩溃案例的分析过程。回顾了C++多态和类内存布局、pc指针与芯片异常处理、内存屏障的相关知识。
|
4月前
|
运维 监控
"网络工程师必备秘籍:4大招轻松破解M-LAG故障难题,你的网络还能更稳定!"
【8月更文挑战第19天】M-LAG技术通过多链路聚合提升网络可靠性和带宽。面对M-LAG故障,四步定位法助您迅速排障:1) 检查M-LAG成员状态确保链路活动;2) 验证链路聚合配置一致;3) 分析控制和平面与数据平面状态;4) 排除物理层故障如端口状态异常。结合网络监控和定期检查,保障M-LAG稳定运行。
94 0
|
6月前
|
运维 监控 Java
网络之谜:记一次失败排查的故事
【6月更文挑战第6天】文章详述了一次故障排查经历,故障表现为客户端接口调用延迟,服务器报错(Broken pipe和Connection reset by peer),Nginx连接数异常增加。通过pinpoint平台发现三种错误类型。排查过程涉及数据库、中间链路和第三方服务,但未找到根本原因。监控手段不足(如无法生成Java dump)和故障难以复现增加了难度。尽管最终靠重启服务暂时解决,但提出改进监控和提升故障排查技巧的重要性。总结中强调了故障排查的复杂性、所需专业知识及冷静分析的态度。
|
7月前
|
编解码 安全 定位技术
典型崩溃问题集锦
典型崩溃问题集锦
52 0
|
7月前
|
机器学习/深度学习 存储 算法
算法人生(4):从“选项学习”看“战胜拖延”(担心失败版)
选项学习是强化学习的一种策略,通过定义、学习和切换选项来解决复杂任务,将大任务分解为可重复使用的子任务,以提高学习效率和适应性。面对因担心失败而拖延的问题,我们可以借鉴选项学习的思想:将大任务拆分为小目标,正视失败作为成长的一部分,回顾成功经验并寻求支持。通过这种方式,逐步增强自信,降低拖延现象。
|
7月前
|
程序员
程序员缓解工作压力有哪些小窍门
程序员合理释放工作压力和情绪至关重要,需要找到合适自己的节奏和方式
89 0
|
测试技术 内存技术
|
安全 Windows
Win系统 - 更新后,竟有这么“严重”的后果?
Win系统 - 更新后,竟有这么“严重”的后果?
175 0
Win系统 - 更新后,竟有这么“严重”的后果?
|
JSON 运维 网络协议
平稳运行半年的系统宕机了,记录一次排错调优的全过程!
最近发生了一件很让人头疼的事情,已经上线半年且平稳运行半年系统在年后早高峰的使用时发生了濒临宕机的情况。访问速度特别慢,后台查到大量time_wait的连接,从代码层面到架构层面到网络层面排查了几天几夜,总算是有了结果。