Kernel Debugging Tricks

简介: Kernel Debugging Tricks

Debugging the kernel is not necessarily rocket science; in fact it can be achieved using very simple and straight forward techniques and some time, patience and perseverance. This page describes some tricks and techniques to help debug the kernel.

  • printk is your friend
    The simplest, and probably most effective way to debug the kernel is via printk(). This enables one to print messages to the console, and it very similar to printf(). Note that printk() can slow down the execution of code which can alter the way code runs, for example, changing the way race conditions occur.
  • CHANGING THE RING BUFFER SIZE
    The internal kernel console message buffer can sometimes be too small to capture all of the printk messages, especially when debug code generates a lot of printk messages. If the buffer fills up, it wraps around and one can lose valueable debug messages.

To increase the internal buffer, use the kernel boot parameter:

log_buf_len=N

where N is the size of the buffer in bytes, and must be a power of 2.

  • CHANGING DEBUG LEVELS
    One can specify the type of printk() log level by pre-pending the 1st printk() argument with one of the following:
KERN_EMERG    /* system is unusable                   */
KERN_ALERT    /* action must be taken immediately     */
KERN_CRIT     /* critical conditions                  */
KERN_ERR      /* error conditions                     */
KERN_WARNING  /* warning conditions                   */
KERN_NOTICE   /* normal but significant condition     */
KERN_INFO     /* informational                        */
KERN_DEBUG    /* debug-level messages                 */
e.g. printk(KERN_DEBUG "example debug message\n");

If one does not specify the log level then the default log level of KERN_WARNING is used. For example, enable all levels of console message:

echo 7 > /proc/sys/kernel/printk

To view console messages at boot, remove the quite and splash boot parameters from the kernel boot line in grub. This will disable the usplash splash screen and re-enable console messages.

  • Serial Console
    Serial console enables one to dump out console messages over a serial cable. Most modern PCs do not have legacy serial ports, so instead, one can use a USB serial dongle instead. A "null serial cable" or "universal file transfer cable" is needed to connect the target computer with the host. Most commonly this will be a DB9 female to DB9 female null serial cable. In addition, one needs to enable USB serial support as a kernel build configuration:
CONFIG_USB_SERIAL_CONSOLE=y
CONFIG_USB_SERIAL=y

and enable the appropriate driver, e.g.:

CONFIG_USB_SERIAL_PL2303=y

and boot this kernel with

console=ttyUSB0,9600n8

one may need to adjust the baud rate appropriately.

Note: Generally, there is NO hardware or software flow control on serial console drivers, which means one may get dropped characters when running very high speed tty baud rates, such as 115200 baud.

  • Console Messages
    Kernel Oops messages general contain a fair amount of information, ranging from register and process state dump and a stack dump too. Unfortunately the stack dump can be more than 25 lines and can scroll off the top of the 25 line Virtual Console. Hence to capture more of a Oops, try the following:
chvt 1
setfont /usr/share/consolefonts/Uni1-VGA8.psf.gz

Of course, one may still have a stack dump that scrolls the top of the Oops message off the console, so one trick is to rebuild the kernel with the stack dump removed, just to capture the initial Oops information. To do this, modify dump_stack in arch/x86/kernel/dumpstack_*.c and comment out the call to show_trace()

  • Slowing down kernel messages on boot
    One may find a machine hangs during the kernel boot process and one would like to be able to see all the kernel messages but unfortunately they scroll off the console too quickly. One can slow down kernel console messages at boot time using by building the kernel with the following option enabled:
CONFIG_BOOT_PRINTK_DELAY=y

And boot the machine with the following kernel boot parameter:

boot_delay=N

where N = msecs delay between each console message.

  • Kernel panic during suspend
    Debugging suspend/resume issues can be difficult if the kernel panics during suspend, especially late in the suspend because console messages are disabled. One can stop console messages from being suspended by using the kernel parameter no_console_suspend:
    no_console_suspend=1
    This will force the console not to suspend. Boot with this option, chvt 1 (to console #1), and suspend using pm-suspend
  • Serial Console in VirtualBox
    In some debug scenerios it can be helpful to debug the kernel running inside a virtual machine. This is useful for some classes of non-hardware specific bugs, for example generic kernel core problems or debugging file system drivers.

One can capture Linux console messages running inside VirtualBox by setting it the VirtualBox serial log to /tmp/vbox and running a serial tty communications program such as minicom, and configure it to communicate with a named pipe tty called unix#/tmp/vbox

Boot with virtualised kernel boot line:

console=ttyS0,9600

and minicom will capture the console messages

  • Network Console
    One can route console messages over a network using netconsole. Note that it's not useful for capturing kernel panics as kernel halts before the messages can be transmitted over the network. However it can be useful to monitor systems without the need of message serial console cabling.

see Documentation/networking/netconsole.txt and also Kernel/Netconsole.

netconsole=[src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]
        where
             src-port      source for UDP packets (defaults to 6665)
             src-ip        source IP to use (interface address)
             dev           network interface (eth0)
             tgt-port      port for logging agent (6666)
             tgt-ip        IP address for logging agent
             tgt-macaddr   ethernet MAC address for logging agent (broadcast)

Examples:

linux netconsole=4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc

The remote host can run either netcat -u -l -p <port>or syslogd.

  • gdb on vmlinux
    One can disassemble a built kernel using gdb on the vmlinux image. This is useful when one gets a kernel Oops message and a stack dump - one can then disassemble the object code and see where the Oops is occuring. For example:
gdb  debian/build/build-generic/vmlinux
(gdb) disassemble printk
Dump of assembler code for function printk:
0xffffffff8023dce0 <printk+0>:  sub    $0xd8,%rsp
0xffffffff8023dce7 <printk+7>:  lea    0xe0(%rsp),%rax
0xffffffff8023dcef <printk+15>: mov    %rsi,0x28(%rsp)
0xffffffff8023dcf4 <printk+20>: mov    %rsp,%rsi
0xffffffff8023dcf7 <printk+23>: mov    %rdx,0x30(%rsp)
0xffffffff8023dcfc <printk+28>: mov    %rcx,0x38(%rsp)
0xffffffff8023dd01 <printk+33>: mov    %rax,0x8(%rsp)
0xffffffff8023dd06 <printk+38>: lea    0x20(%rsp),%rax
0xffffffff8023dd0b <printk+43>: mov    %r8,0x40(%rsp)
0xffffffff8023dd10 <printk+48>: mov    %r9,0x48(%rsp)
0xffffffff8023dd15 <printk+53>: movl   $0x8,(%rsp)
0xffffffff8023dd1c <printk+60>: movl   $0x30,0x4(%rsp)
0xffffffff8023dd24 <printk+68>: mov    %rax,0x10(%rsp)
0xffffffff8023dd29 <printk+73>: callq  0xffffffff8023d980 <vprintk>
0xffffffff8023dd2e <printk+78>: add    $0xd8,%rsp
0xffffffff8023dd35 <printk+85>: retq

End of assembler dump.

  • Objdump
    If one has the built object code at hand, one can disassemble the object using objdump as follows:
objdump -SdCg debian/build/build-generic/fs/dcache.o
  • Using GDB to find the location where your kernel panicked or oopsed.
    A quick and easy way to find the line of code where your kernel panicked or oopsed is to use GDB list command. You can do this as follows.

Lets assume your panic/oops message says something like:

[  174.507084] Stack:
[  174.507163]  ce0bd8ac 00000008 00000000 ce4a7e90 c039ce30 ce0bd8ac c0718b04 c07185a0
[  174.507380]  ce4a7ea0 c0398f22 ce0bd8ac c0718b04 ce4a7eb0 c037deee ce0bd8e0 ce0bd8ac
[  174.507597]  ce4a7ec0 c037dfe0 c07185a0 ce0bd8ac ce4a7ed4 c037d353 ce0bd8ac ce0bd8ac
[  174.507888] Call Trace:
[  174.508125]  [<c039ce30>] ? sd_remove+0x20/0x70
[  174.508235]  [<c0398f22>] ? scsi_bus_remove+0x32/0x40
[  174.508326]  [<c037deee>] ? __device_release_driver+0x3e/0x70
[  174.508421]  [<c037dfe0>] ? device_release_driver+0x20/0x40
[  174.508514]  [<c037d353>] ? bus_remove_device+0x73/0x90
[  174.508606]  [<c037bccf>] ? device_del+0xef/0x150
[  174.508693]  [<c0399207>] ? __scsi_remove_device+0x47/0x80
[  174.508786]  [<c0399262>] ? scsi_remove_device+0x22/0x40
[  174.508877]  [<c0399324>] ? __scsi_remove_target+0x94/0xd0
[  174.508969]  [<c03993c0>] ? __remove_child+0x0/0x20
[  174.509060]  [<c03993d7>] ? __remove_child+0x17/0x20
[  174.509148]  [<c037b868>] ? device_for_each_child+0x38/0x60
[  174.509241]  [<c039938f>] ? scsi_remove_target+0x2f/0x60
[  174.509393]  [<d0c38907>] ? __iscsi_unbind_session+0x77/0xa0 [scsi_transport_iscsi]
[  174.509699]  [<c015272e>] ? run_workqueue+0x6e/0x140
[  174.509801]  [<d0c38890>] ? __iscsi_unbind_session+0x0/0xa0 [scsi_transport_iscsi]
[  174.509977]  [<c0152888>] ? worker_thread+0x88/0xe0
[  174.510047]  [<c01566a0>] ? autoremove_wake_function+0x0/0x40

Lets say you want to know what line of code represents sd_remove+0x20/0x70. cd to the ubuntu debian/build/build-generic directory in your kernel tree and run gdb on the ".o" file which has the function sd_remove() in this case in sd.o, and use the gdb "list" command, (gdb) list *(function+0xoffset), in this case function is sd_remove() and offset is 0x20, and gdb should tell you the line number where you hit the panic or oops. This has worked for me very reliably for most cases.

manjo@hungry:~/devel/ubuntu/kernel/ubuntu-karmic-397906/debian/build/build-generic/drivers/scsi$ gdb sd.o
GNU gdb (GDB) 6.8.50.20090628-cvs-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
(gdb) list *(sd_remove+0x20)
0x1650 is in sd_remove (/home/manjo/devel/ubuntu/kernel/ubuntu-karmic-397906/drivers/scsi/sd.c:2125).
2120    static int sd_remove(struct device *dev)
2121    {
2122            struct scsi_disk *sdkp;
2123
2124            async_synchronize_full();
2125            sdkp = dev_get_drvdata(dev);
2126            blk_queue_prep_rq(sdkp->device->request_queue, scsi_prep_fn);
2127            device_del(&sdkp->dev);
2128            del_gendisk(sdkp->disk);
2129            sd_shutdown(dev);
(gdb)
相关实践学习
阿里云图数据库GDB入门与应用
图数据库(Graph Database,简称GDB)是一种支持Property Graph图模型、用于处理高度连接数据查询与存储的实时、可靠的在线数据库服务。它支持Apache TinkerPop Gremlin查询语言,可以帮您快速构建基于高度连接的数据集的应用程序。GDB非常适合社交网络、欺诈检测、推荐引擎、实时图谱、网络/IT运营这类高度互连数据集的场景。 GDB由阿里云自主研发,具备如下优势: 标准图查询语言:支持属性图,高度兼容Gremlin图查询语言。 高度优化的自研引擎:高度优化的自研图计算层和存储层,云盘多副本保障数据超高可靠,支持ACID事务。 服务高可用:支持高可用实例,节点故障迅速转移,保障业务连续性。 易运维:提供备份恢复、自动升级、监控告警、故障切换等丰富的运维功能,大幅降低运维成本。 产品主页:https://www.aliyun.com/product/gdb
相关文章
|
Linux Shell C语言
【Shell 命令集合 磁盘维护 】Linux 分区管理的工具 sfdisk命令使用教程
【Shell 命令集合 磁盘维护 】Linux 分区管理的工具 sfdisk命令使用教程
284 1
|
7月前
|
机器学习/深度学习 人工智能 负载均衡
Trae 04.22版本深度解析:Agent能力升级与MCP市场对复杂任务执行的革新
在当今快速发展的AI技术领域,Agent系统正成为自动化任务执行和智能交互的核心组件。Trae作为一款先进的AI协作平台,在04.22版本中带来了重大更新,特别是在Agent能力升级和MCP市场支持方面。本文将深入探讨这些更新如何重新定义复杂任务的执行方式,为开发者提供更强大的工具和更灵活的解决方案。
818 1
断路器/熔断器? 断路器的状态有哪些
● closed:关闭状态,断路器放行所有请求,并开始统计异常比例、慢请求比例。超过阈值则切换到open状态 ● open:打开状态,服务调用被熔断,访问被熔断服务的请求会被拒绝,快速失败,直接走降级逻辑。Open状态5秒后会进入half-open状态 ● half-open:半开状态,放行一次请求,根据执行结果来判断接下来的操作。 ○ 请求成功:则切换到closed状态 ○ 请求失败:则切换到open状态
|
8月前
|
人工智能 缓存 NoSQL
高并发秒杀系统设计:关键技术解析与典型陷阱规避
在电商、在线票务等场景中,高并发秒杀活动对系统性能和稳定性提出极大挑战。海量请求可能导致服务器资源耗尽、数据库锁争用及库存超卖等问题。通过飞算JavaAI生成的Redis + Lua分布式锁代码,可有效解决高并发下的锁问题,提升系统QPS达70%,同时避免缓存击穿与库存超卖。相较传统写法,AI优化代码显著提高性能与响应速度,为高并发系统开发提供高效解决方案。
|
Ubuntu Linux Shell
使用ramdisk启动ubuntu文件系统(pivot_root)
使用ramdisk启动ubuntu文件系统(pivot_root)
|
人工智能 自然语言处理 机器人
如何从0部署一个大模型RAG应用
本文介绍了如何从零开始部署一套RAG应用,并将其集成到移动端,如钉钉群聊中。应用场景包括客服系统、智能助手、教育辅导和医疗咨询等。通过阿里云PAI和AppFlow,您可以轻松部署大模型RAG应用,并实现智能化的问答服务。具体步骤包括准备向量检索库、训练私有模型、部署RAG对话应用、创建钉钉应用及配置机器人等。
2157 2
|
虚拟化 Windows
VMwareWorkstationPro16的下载与安装,以及vm账号注册的问题
本文介绍了VMware Workstation Pro 16的下载、安装过程以及VMware账号的注册问题,包括如何检查虚拟化支持是否开启、VMware的下载步骤、注册VM账号时的常见问题以及VMware 16的安装步骤。
VMwareWorkstationPro16的下载与安装,以及vm账号注册的问题
|
Linux 数据处理
探索Linux下的readelf命令:深入了解ELF文件
`readelf`是Linux下分析ELF文件的命令行工具,用于查看文件头、节区、符号表等信息。支持可执行文件、共享库等多种类型。常用选项有`-h`(文件头)、`-l`(程序头)、`-S`(节区)、`-s`(符号表)、`-r`(重定位)和`-d`(动态节区)。结合其他工具如`objdump`,能深入理解二进制文件,助力开发和调试。
|
XML 存储 Java
SpringBoot集成Flowable:构建强大的工作流引擎
在企业级应用开发中,工作流管理是核心功能之一。Flowable是一个开源的工作流引擎,它提供了BPMN 2.0规范的实现,并且与SpringBoot框架完美集成。本文将探讨如何使用SpringBoot和Flowable构建一个强大的工作流引擎,并分享一些实践技巧。
3300 0
|
机器学习/深度学习 人工智能 算法
数据挖掘/深度学习-高校实训解决方案
云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,多租户,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国产cpu/gpu/npu芯片,支持RDMA,支持pytorch/tf/mxnet/deepspeed/paddle/colossalai/horovod/spark/ray/volcano分布式,私有化部署。
423 0

热门文章

最新文章