关于Linux的core dump

本文涉及的产品
RDS MySQL Serverless 基础系列,0.5-2RCU 50GB
云数据库 RDS PostgreSQL,高可用系列 2核4GB
RDS MySQL Serverless 高可用系列,价值2615元额度,1个月
简介: core dump简介 core dump就是在进程crash时把包括内存在内的现场保留下来,以备故障分析。 但有时候,进程crash了却没有输出core,因为有一些因素会影响输出还是不输出core文件。

core dump简介

core dump就是在进程crash时把包括内存在内的现场保留下来,以备故障分析。 但有时候,进程crash了却没有输出core,因为有一些因素会影响输出还是不输出core文件。 常见的一个coredump开关是ulimit -c,它限制允许输出的coredump文件的最大size,如果要输出的core文件大小超过这个值将不输出core文件。

ulimit -c的输出为0,代表关闭core dump输出。

[root@srdsdevapp69 ~]# ulimit -c
0 

设置ulimit -c unlimited,将不对core文件大小做限制

[root@srdsdevapp69 ~]# ulimit -c unlimited
[root@srdsdevapp69 ~]# ulimit -c
unlimited 

这样设置的ulimit值只在当前会话中有效,重开一个终端起进程是不受影响的。

ulimit -c只是众多影响core输出因素中的一个,其它因素可以参考man。

$ man core
...
There are various circumstances in which a core dump file is not produced:

*  The  process  does  not have permission to write the core file.  (By default the core file is called core,
  and is created in the current working directory.  See below for details on naming.)  Writing the core file
  will  fail  if the directory in which it is to be created is non-writable, or if a file with the same name
  exists and is not writable or is not a regular file (e.g., it is a directory or a symbolic link).

*  A (writable, regular) file with the same name as would be used for the core dump already exists, but there
  is more than one hard link to that file.

*  The file system where the core dump file would be created is full; or has run out of inodes; or is mounted
  read-only; or the user has reached their quota for the file system.

*  The directory in which the core dump file is to be created does not exist.

*  The RLIMIT_CORE (core file size) or RLIMIT_FSIZE (file size) resource limits for the process  are  set  to
  zero; see getrlimit(2) and the documentation of the shell’s ulimit command (limit in csh(1)).

*  The binary being executed by the process does not have read permission enabled.

*  The  process  is executing a set-user-ID (set-group-ID) program that is owned by a user (group) other than
  the real user (group) ID of the process.  (However, see the description of  the  prctl(2)  PR_SET_DUMPABLE
  operation, and the description of the /proc/sys/fs/suid_dumpable file in proc(5).) 

其实还漏了一个,进程可以捕获那些本来会出core的信号,然后自己来处理,比如MySQL就是这么干的。

abrtd

RHEL/CentOS下默认开启abrtd进行故障现场记录(包括生成coredump)和故障报告

此时abrtd进程是启动的,

[root@srdsdevapp69 ~]# service abrtd status
abrtd (pid  8711) is running... 

core文件的生成位置被重定向到了abrt-hook-ccpp

[root@srdsdevapp69 ~]# cat /proc/sys/kernel/core_pattern
|/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e 

测试coredump

生成以下产生coredump的程序,并执行。

testcoredump.c:

int main()
{
  return 1/0;
} 

编译并执行

$gcc testcoredump.c -o testcoredump
$./testcoredump 

查看系统日志,中途临时产生了core文件,但最后又被删掉了。

$tail -f  /var/log/messages
...
Dec  8 09:54:44 srdsdevapp69 kernel: testcoredump[4028] trap divide error ip:400489 sp:7fff5a54b200 error:0 in testcoredump[400000+1000]
Dec  8 09:54:44 srdsdevapp69 abrtd: Directory 'ccpp-2016-12-08-09:54:44-4028' creation detected
Dec  8 09:54:44 srdsdevapp69 abrt[4029]: Saved core dump of pid 4028 (/root/testcoredump) to /var/spool/abrt/ccpp-2016-12-08-09:54:44-4028 (184320 bytes)
Dec  8 09:54:44 srdsdevapp69 abrtd: Executable '/root/testcoredump' doesn't belong to any package
Dec  8 09:54:44 srdsdevapp69 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2016-12-08-09:54:44-4028' exited with 1
Dec  8 09:54:44 srdsdevapp69 abrtd: Corrupted or bad directory /var/spool/abrt/ccpp-2016-12-08-09:54:44-4028, deleting 

abrtd默认只保留软件包里的程序产生的core文件,修改下面的参数可以让其记录所有程序的core文件。

$vi /etc/abrt/abrt-action-save-package-data.conf
...
ProcessUnpackaged = yes 

再执行一次测试程序就好生成core文件了

Dec  8 10:04:30 srdsdevapp69 kernel: testcoredump[9189] trap divide error ip:400489 sp:7fff99973b30 error:0 in testcoredump[400000+1000]
Dec  8 10:04:30 srdsdevapp69 abrtd: Directory 'ccpp-2016-12-08-10:04:30-9189' creation detected
Dec  8 10:04:30 srdsdevapp69 abrt[9190]: Saved core dump of pid 9189 (/root/testcoredump) to /var/spool/abrt/ccpp-2016-12-08-10:04:30-9189 (184320 bytes)
Dec  8 10:04:31 srdsdevapp69 kernel: Bridge firewalling registered
Dec  8 10:04:44 srdsdevapp69 abrtd: Sending an email...
Dec  8 10:04:44 srdsdevapp69 abrtd: Email was sent to: root@localhost
Dec  8 10:04:44 srdsdevapp69 abrtd: New problem directory /var/spool/abrt/ccpp-2016-12-08-10:04:30-9189, processing
Dec  8 10:04:44 srdsdevapp69 abrtd: No actions are found for event 'notify' 

abrtd可以识别出是重复问题,并能够去重,这可以防止core文件生成的过多把磁盘用光。

Dec  8 10:18:35 srdsdevapp69 kernel: testcoredump[16598] trap divide error ip:400489 sp:7fff26cc9f50 error:0 in testcoredump[400000+1000]
Dec  8 10:18:35 srdsdevapp69 abrtd: Directory 'ccpp-2016-12-08-10:18:35-16598' creation detected
Dec  8 10:18:35 srdsdevapp69 abrt[16599]: Saved core dump of pid 16598 (/root/testcoredump) to /var/spool/abrt/ccpp-2016-12-08-10:18:35-16598 (184320 bytes)
Dec  8 10:18:45 srdsdevapp69 abrtd: Sending an email...
Dec  8 10:18:45 srdsdevapp69 abrtd: Email was sent to: root@localhost
Dec  8 10:18:45 srdsdevapp69 abrtd: Duplicate: UUID
Dec  8 10:18:45 srdsdevapp69 abrtd: DUP_OF_DIR: /var/spool/abrt/ccpp-2016-12-08-10:04:30-9189
Dec  8 10:18:45 srdsdevapp69 abrtd: Problem directory is a duplicate of /var/spool/abrt/ccpp-2016-12-08-10:04:30-9189
Dec  8 10:18:45 srdsdevapp69 abrtd: Deleting problem directory ccpp-2016-12-08-10:18:35-16598 (dup of ccpp-2016-12-08-10:04:30-9189)
Dec  8 10:18:45 srdsdevapp69 abrtd: No actions are found for event 'notify_dup' 

abrtd对crash报告的大小(主要是core文件)有限制(参数MaxCrashReportsSize设置),超过了也不会生成core文件,相应的日志如下。

Dec  8 14:10:32 srdsdevapp69 abrt[10548]: Saved core dump of pid 10527 (/usr/local/Percona-Server-5.6.29-rel76.2-Linux.x86_64.ssl101/bin/mysqld) to /var/spool/abrt/ccpp-2016-12-08-14:10:00-10527 (10513362944 bytes)
Dec  8 14:10:32 srdsdevapp69 abrtd: Directory 'ccpp-2016-12-08-14:10:00-10527' creation detected
Dec  8 14:10:32 srdsdevapp69 abrtd: Size of '/var/spool/abrt' >= 1000 MB, deleting 'ccpp-2016-12-08-14:05:43-8080'
Dec  8 14:10:32 srdsdevapp69 abrt[10548]: /var/spool/abrt is 25854515653 bytes (more than 1279MiB), deleting 'ccpp-2016-12-08-14:05:43-8080'
Dec  8 14:10:32 srdsdevapp69 abrt[10548]: Lock file '/var/spool/abrt/ccpp-2016-12-08-14:05:43-8080/.lock' is locked by process 7893
Dec  8 14:10:32 srdsdevapp69 abrt[10548]: '/var/spool/abrt/ccpp-2016-12-08-14:05:43-8080' does not exist
Dec  8 14:10:41 srdsdevapp69 abrtd: Sending an email...
Dec  8 14:10:41 srdsdevapp69 abrtd: Email was sent to: root@localhost
Dec  8 14:10:41 srdsdevapp69 abrtd: New problem directory /var/spool/abrt/ccpp-2016-12-08-14:10:00-10527, processing
Dec  8 14:10:41 srdsdevapp69 abrtd: No actions are found for event 'notify' 

abrtd如何工作

abrtd是监控/var/spool/abrt/目录触发的,做个copy操作也会触发abrtd。

[root@srdsdevapp69 abrt]# cp -rf ccpp-2016-12-08-10:04:30-9189 ccpp-2016-12-08-10:04:30-91891 

下面是产生的系统日志:

Dec  8 10:35:33 srdsdevapp69 abrtd: Directory 'ccpp-2016-12-08-10:04:30-91891' creation detected
Dec  8 10:35:33 srdsdevapp69 abrtd: Duplicate: UUID
Dec  8 10:35:33 srdsdevapp69 abrtd: DUP_OF_DIR: /var/spool/abrt/ccpp-2016-12-08-10:04:30-9189
Dec  8 10:35:33 srdsdevapp69 abrtd: Problem directory is a duplicate of /var/spool/abrt/ccpp-2016-12-08-10:04:30-9189
Dec  8 10:35:33 srdsdevapp69 abrtd: Deleting problem directory ccpp-2016-12-08-10:04:30-91891 (dup of ccpp-2016-12-08-10:04:30-9189)
Dec  8 10:35:33 srdsdevapp69 abrtd: No actions are found for event 'notify_dup' 

如果修改core生成目录,不使用abrt-hook-ccpp回调程序等于禁用了abrtd

echo "/data/core-%e-%p-%t">/proc/sys/kernel/core_pattern 

再发生coredump时/var/log/messages中没有abrtd相关的记录

Dec  8 10:30:24 srdsdevapp69 kernel: testcoredump[23050] trap divide error ip:400489 sp:7fff9f01dfb0 error:0 in testcoredump[400000+1000] 

此时core文件会被直接生成到/proc/sys/kernel/core_pattern指定的位置

/data/core-testcoredump-23050-1481164224 

由于/proc/sys/kernel/core_pattern中未使用abrt-hook-ccpp回调程序,检查abrt-ccpp服务状态也会相应的返回服务未启动。

[root@srdsdevapp69 ~]# service abrt-ccpp status
[root@srdsdevapp69 ~]# echo $?
3 

恢复/proc/sys/kernel/core_pattern之后,abrt-ccpp服务变回正常

[root@srdsdevapp69 ~]# echo "|/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e">/proc/sys/kernel/core_pattern
[root@srdsdevapp69 ~]# service abrt-ccpp status
[root@srdsdevapp69 ~]# echo $?
0 

如果停止abrtd

/proc/sys/kernel/core_pattern为"|/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e" 

会在生成当前目录生成core文件

Dec  8 10:46:21 srdsdevapp69 kernel: testcoredump[31364] trap divide error ip:400489 sp:7fff15d6f450 error:0 in testcoredump[400000+1000]
Dec  8 10:46:21 srdsdevapp69 abrt[31365]: abrtd is not running. If it crashed, /proc/sys/kernel/core_pattern contains a stale value, consider resetting it to 'core'
Dec  8 10:46:21 srdsdevapp69 abrt[31365]: Saved core dump of pid 31364 to /root/core.31364 (184320 bytes) 

开启MySQL的coredump

MySQL的服务进程mysqld会自己捕获可能引起crash的信号,默认会输出调用栈后异常退出不会生成core文件。

2016-12-08 11:14:51 14034 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
Version: '5.6.29-76.2-debug-log'  socket: '/mysqlrds/data/mysql.sock'  port: 3306  Source distribution
03:18:43 UTC - mysqld got signal 8 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=33554432
read_buffer_size=2097152
max_used_connections=2
max_threads=100001
thread_count=1
connection_count=1
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 307242932 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x2427ca20
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fd53066bca8 thread_stack 0x40000
/usr/local/mysql/bin/mysqld(my_print_stacktrace+0x35)[0xaf23c9]
/usr/local/mysql/bin/mysqld(handle_fatal_signal+0x42e)[0x74d42a]
/lib64/libpthread.so.0[0x3805a0f7e0]
/usr/local/mysql/bin/mysqld(_Z19mysql_rename_tablesP3THDP10TABLE_LISTb+0x6c)[0x82fa64]
/usr/local/mysql/bin/mysqld(_Z21mysql_execute_commandP3THD+0x2aab)[0x8079e9]
/usr/local/mysql/bin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x588)[0x810ce3]
/usr/local/mysql/bin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0xd8b)[0x80228a]
/usr/local/mysql/bin/mysqld(_Z10do_commandP3THD+0x3bd)[0x801087]
/usr/local/mysql/bin/mysqld(_Z26threadpool_process_requestP3THD+0x71)[0x8ec721]
/usr/local/mysql/bin/mysqld[0x8ef363]
/usr/local/mysql/bin/mysqld[0x8ef5a0]
/usr/local/mysql/bin/mysqld(pfs_spawn_thread+0x159)[0xe14049]
/lib64/libpthread.so.0[0x3805a07aa1]
/lib64/libc.so.6(clone+0x6d)[0x32286e893d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7fd508004d80): is an invalid pointer
Connection ID (thread ID): 1
Status: NOT_KILLED

You may download the Percona Server operations manual by visiting
http://www.percona.com/software/percona-server/. You may find information
in the manual which will help you identify the cause of the crash. 

要使其产生core文件必须打开--core-file开关

mysqld --defaults-file=/home/mysql/etc/my.cnf --core-file & 

也可以将这个参数加入到my.cnf文件中

core_file 

core文件的大小

关于core文件的大小有个奇怪的现象,其实际占用的磁盘空间可能远小于文件大小。

比如下面的core文件,文件大小10GB,但实际占用磁盘只有2GB(1940984 * 512B)。

[root@srdsdevapp69 ccpp-2016-12-08-14:10:00-10527]# stat coredump
  File: `coredump'
  Size: 10513362944 Blocks: 1940984    IO Block: 4096   regular file
Device: fd03h/64771d    Inode: 14990       Links: 1
Access: (0640/-rw-r-----)  Uid: (  173/    abrt)   Gid: (  512/   mysql)
Access: 2016-12-08 14:10:41.886280668 +0800
Modify: 2016-12-08 14:10:27.704523443 +0800
Change: 2016-12-08 14:10:27.704523443 +0800 

这是由于系统在生成core文件时,skip了部分全零的块,即文件中有hole(用dd的seek可以模拟这个现象)。不管是在/proc/sys/kernel/core_pattern中设置abrt-hook-ccpp程序还是直接设置文件目录,都是这个现象。这其实是一个不错的优化,节省了磁盘空间也加快了core文件生成速度。

相关实践学习
每个IT人都想学的“Web应用上云经典架构”实战
本实验从Web应用上云这个最基本的、最普遍的需求出发,帮助IT从业者们通过“阿里云Web应用上云解决方案”,了解一个企业级Web应用上云的常见架构,了解如何构建一个高可用、可扩展的企业级应用架构。
MySQL数据库入门学习
本课程通过最流行的开源数据库MySQL带你了解数据库的世界。   相关的阿里云产品:云数据库RDS MySQL 版 阿里云关系型数据库RDS(Relational Database Service)是一种稳定可靠、可弹性伸缩的在线数据库服务,提供容灾、备份、恢复、迁移等方面的全套解决方案,彻底解决数据库运维的烦恼。 了解产品详情: https://www.aliyun.com/product/rds/mysql 
相关文章
|
NoSQL 安全 Linux
Linux 中 core dump 文件的作用和使用方法
Linux 中 core dump 文件的作用和使用方法
2245 1
|
存储 监控 Shell
【Shell 命令集合 备份压缩 】Linux 备份文件系统 dump命令 使用指南
【Shell 命令集合 备份压缩 】Linux 备份文件系统 dump命令 使用指南
227 0
|
存储 Shell Linux
【Shell 命令集合 备份压缩 】Linux 恢复由dump命令创建的备份文件 restore命令 使用指南
【Shell 命令集合 备份压缩 】Linux 恢复由dump命令创建的备份文件 restore命令 使用指南
153 0
|
11月前
|
Linux 网络安全 数据安全/隐私保护
Linux 超级强大的十六进制 dump 工具:XXD 命令,我教你应该如何使用!
在 Linux 系统中,xxd 命令是一个强大的十六进制 dump 工具,可以将文件或数据以十六进制和 ASCII 字符形式显示,帮助用户深入了解和分析数据。本文详细介绍了 xxd 命令的基本用法、高级功能及实际应用案例,包括查看文件内容、指定输出格式、写入文件、数据比较、数据提取、数据转换和数据加密解密等。通过掌握这些技巧,用户可以更高效地处理各种数据问题。
1101 8
|
10月前
|
存储 NoSQL Linux
linux积累-core文件是干啥的
核心文件是Linux系统在程序崩溃时生成的重要调试文件,通过分析核心文件,开发者可以找到程序崩溃的原因并进行调试和修复。本文详细介绍了核心文件的生成、配置、查看和分析方法
584 6
|
10月前
|
存储 NoSQL Linux
linux之core文件如何查看和调试
通过设置和生成 core 文件,可以在程序崩溃时获取详细的调试信息。结合 GDB 等调试工具,可以深入分析 core 文件,找到程序崩溃的具体原因,并进行相应的修复。掌握这些调试技巧,对于提高程序的稳定性和可靠性具有重要意义。
4105 6
|
存储 安全 Ubuntu
Linux dump命令教程
绍了Linuxdump命令的功能,包括用于备份整个文件系统的全备份和增量备份,以及如何在不同Linux发行版中安装和使用dump命令。
407 16
|
监控 NoSQL Linux
linux常见的coredump原因都有哪些?
Core dump通常发生在程序遇到严重错误时,操作系统会生成core文件来记录程序崩溃时的内存、寄存器状态、栈信息等。
731 3
|
Linux C# C++
【Azure App Service For Container】创建ASP.NET Core Blazor项目并打包为Linux镜像发布到Azure应用服务
【Azure App Service For Container】创建ASP.NET Core Blazor项目并打包为Linux镜像发布到Azure应用服务
131 0
|
存储 NoSQL 安全
深入Linux Core文件生成与自定义命名规则
深入Linux Core文件生成与自定义命名规则
326 2