perf命令详解从Linux性能诊断到动态跟踪与火焰图-开发者社区-阿里云

背景

Linux在服务端已占据非常大的比例，很多业务很多服务都跑在Linux上面。

软件运行在Linux下，软件本身、以及Linux系统的性能诊断也成为热门的话题。

例如，你要如何回答这些问题

Why is the kernel on-CPU so much? What code-paths?  

Which code-paths are causing CPU level 2 cache misses?  

Are the CPUs stalled on memory I/O?  

Which code-paths are allocating memory, and how much?  

What is triggering TCP retransmits?  

Is a certain kernel function being called, and how often?  

What reasons are threads leaving the CPU?

又或者你是一名DBA或者开发人员，想知道数据库在跑某些benchmark时，性能瓶颈在哪里，是IO，是等待，还是网络，代码瓶颈在哪里？

在Linux下诊断的工具比较多，比如systemtap, dtrace, perf。

本文将介绍一下perf的用法，网上很多叫法如perf_events , perf profiler , Performance Counters for Linux。叫法不同，都指perf。

什么是perf

perf是Linux 2.6+内核中的一个工具，在内核源码包中的位置 tools/perf。

perf利用Linux的trace特性，可以用于实时跟踪，统计event计数(perf stat)；或者使用采样(perf record)，报告(perf report|script|annotate)的使用方式进行诊断。

perf命令行接口并不能利用所有的Linux trace特性，有些trace需要通过ftrace接口得到。

参考 https://github.com/brendangregg/perf-tools

perf工作原理

这张图大致列出了perf支持的跟踪事件，从kernerl到user space，支持块设备、网络、CPU、文件系统、内存等，同时还支持系统调用，用户库的事件跟踪。

你可以使用perf list输出当前内核perf 支持的预置events

perf list

List of pre-defined events (to be used in -e):

  ref-cycles                                         [Hardware event]

  alignment-faults                                   [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

.....略.......

  writeback:writeback_pages_written                  [Tracepoint event]
  writeback:writeback_queue                          [Tracepoint event]
  writeback:writeback_task_start                     [Tracepoint event]
  writeback:writeback_task_stop                      [Tracepoint event]

perf background

我们看到perf支持这么多的事件和trace，它依赖了很多的接口来干这件事情。

1. Symbols

没有符号表，无法将内存地址翻译成函数和变量名。

例如，无符号表的跟踪显示如下

    57.14%     sshd  libc-2.15.so        [.] connect           
               |
               --- connect
                  |          
                  |--25.00%-- 0x7ff3c1cddf29
                  |          
                  |--25.00%-- 0x7ff3bfe82761
                  |          0x7ff3bfe82b7c
                  |          
                  |--25.00%-- 0x7ff3bfe82dfc
                   --25.00%-- [...]

有符号表的跟踪显示如下

    57.14%     sshd  libc-2.15.so        [.] __GI___connect_internal
               |
               --- __GI___connect_internal
                  |          
                  |--25.00%-- add_one_listen_addr.isra.0
                  |          
                  |--25.00%-- __nscd_get_mapping
                  |          __nscd_get_map_ref
                  |          
                  |--25.00%-- __nscd_open_socket
                   --25.00%-- [...]

如何安装符号表?

对于内核代码的符号表，在编译内核时，使用CONFIG_KALLSYMS=y。检查如下

# cat /boot/config-`uname -r` |grep CONFIG_KALLSYMS
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_EXTRA_PASS=y

对于用户安装软件的符号表，如果是yum安装的，可以安装对应的debuginfo包。

如果是用户自己编译的，例如使用GCC编译时加上-g选项。

2. perf annotate

perf annotate can generate sourcecode level information if the application is compiled with -ggdb.

3. Stack Traces (使用perf record -g收集stack traces)

要跟踪完整的stack，编译时需要注意几个东西。

Always compile with frame pointers. 
Omitting frame pointers is an evil compiler optimization that breaks debuggers, and sadly, is often the default. 

Without them, you may see incomplete stacks from perf_events, like seen in the earlier sshd symbols example. 

There are two ways to fix this: 
either using dwarf data to unwind the stack, or returning the frame pointers.

3.1 编译perf时包含libunwind和-g dwarf，需要3.9以上的内核版本。

Since about the 3.9 kernel, perf_events has supported a workaround for missing frame pointers in user-level stacks: libunwind, which uses dwarf. 

This can be enabled using "-g dwarf".

3.2 有些编译优化项会忽略frame pointer，所以编译软件时必须指定 -fno-omit-frame-pointer ，才能跟踪完整的stack trace.

The earlier sshd example was a default build of OpenSSH, which uses compiler optimizations (-O2), which in this case has omitted the frame pointer. 

Here's how it looks after recompiling OpenSSH with -fno-omit-frame-pointer

3.3 编译内核时包含 CONFIG_FRAME_POINTER=y

总结一下，要愉快的跟踪更完备的信息，就要在编译软件时打开符号表的支持(gcc -g)，开启annotate的支持(gcc -ggdb)，以及Stack trace的支持(gcc -fno-omit-frame-pointer)。

perf pre-defined event说明

Hardware [Cache] Events: 
  CPU相关计数器
  CPU周期、指令重试，内存间隔周期、L2CACHE miss等
  These instrument low-level processor activity based on CPU performance counters. 
  For example, CPU cycles, instructions retired, memory stall cycles, level 2 cache misses, etc. 
  Some will be listed as Hardware Cache Events.

Software Events: 
  内核相关计数器
  These are low level events based on kernel counters. 
  For example, CPU migrations, minor faults, major faults, etc.

Tracepoint Events: 
  内核ftrace框架相关，例如系统调用，TCP事件，文件系统IO事件，块设备事件等。  
  根据LIBRARY归类。如sock表示socket事件。  
  This are kernel-level events based on the ftrace framework. These tracepoints are placed in interesting and logical locations of the kernel, so that higher-level behavior can be easily traced. 
  For example, system calls, TCP events, file system I/O, disk I/O, etc. 
  These are grouped into libraries of tracepoints; 
    eg, "sock:" for socket events, "sched:" for CPU scheduler events.

Dynamic Tracing: 
  动态跟踪，可以在代码中的任何位置创建事件跟踪节点。很好很强大。  
  内核跟踪使用kprobe，user-level跟踪使用uprobe。  
  Software can be dynamically instrumented, creating events in any location. 
  For kernel software, this uses the kprobes framework. 
  For user-level software, uprobes.

Timed Profiling: 
  采样频度，按指定频率采样，被用于perf record。  
  Snapshots can be collected at an arbitrary frequency, using perf record -FHz. 
  This is commonly used for CPU usage profiling, and works by creating custom timed interrupt events.

了解了perf event后，我们可以更精细的，有针对性的对事件进行跟踪，采样，报告。

当然，你也可以不指定事件，全面采样。

build perf

例如centos你可以使用yum安装，也可以使用源码安装。

perf在内核源码的tools/perf中，所以下载与你的内核大版本一致的内核源码即可

uname -a 

wget https://cdn.kernel.org/pub/linux/kernel/v3.x/linux-3.10.104.tar.xz

tar -xvf linux-3.10.104.tar.xz

cd linux-3.10.104/tools/perf/

安装依赖库，有一个小窍门可以找到依赖的库

$cat Makefile |grep found

                msg := $(warning No libelf found, disables 'probe' tool, please install elfutils-libelf-devel/libelf-dev);
                msg := $(error No gnu/libc-version.h found, please install glibc-dev[el]/glibc-static);
                msg := $(warning No libdw.h found or old libdw.h found or elfutils is older than 0.138, disables dwarf support. Please install new elfutils-devel/libdw-dev);
        msg := $(warning No libunwind found, disabling post unwind support. Please install libunwind-dev[el] >= 0.99);
                msg := $(warning No libaudit.h found, disables 'trace' tool, please install audit-libs-devel or libaudit-dev);
                msg := $(warning slang not found, disables TUI support. Please install slang-devel or libslang-dev);
                msg := $(warning GTK2 not found, disables GTK2 support. Please install gtk2-devel or libgtk2.0-dev);
  $(if $(1),$(warning No $(1) was found))
                                                msg := $(warning No bfd.h/libbfd found, install binutils-dev[el]/zlib-static to gain symbol demangling)
                msg := $(warning No numa.h found, disables 'perf bench numa mem' benchmark, please install numa-libs-devel or libnuma-dev);

通常依赖 gcc make bison flex elfutils libelf-dev libdw-dev libaudit-dev python-dev binutils-dev

perf依赖的kernel宏

并不是每个开关都需要，但是有些没有就不方便或者功能缺失，例如没有打开符号表的话，看到的是一堆内存地址。

# for perf_events:
CONFIG_PERF_EVENTS=y
# for stack traces:
CONFIG_FRAME_POINTER=y
# kernel symbols:
CONFIG_KALLSYMS=y
# tracepoints:
CONFIG_TRACEPOINTS=y
# kernel function trace:
CONFIG_FTRACE=y
# kernel-level dynamic tracing:
CONFIG_KPROBES=y
CONFIG_KPROBE_EVENTS=y
# user-level dynamic tracing:
CONFIG_UPROBES=y
CONFIG_UPROBE_EVENTS=y
# full kernel debug info:
CONFIG_DEBUG_INFO=y
# kernel lock tracing:
CONFIG_LOCKDEP=y
# kernel lock tracing:
CONFIG_LOCK_STAT=y
# kernel dynamic tracepoint variables:
CONFIG_DEBUG_INFO=y

一些开关的用途介绍

1. Kernel-level symbols are in the kernel debuginfo package, or when the kernel is compiled with CONFIG_KALLSYMS.

2. The kernel stack traces are incomplete. Now a similar profile with CONFIG_FRAME_POINTER=y

3. 当我们使用perf record [stack traces (-g)]时，可以跟踪stack，但是如果内核编译时没有指定CONFIG_FRAME_POINTER=y，perf report时就会看到缺失的信息。

不包含CONFIG_FRAME_POINTER=y时

    99.97%  swapper  [kernel.kallsyms]  [k] default_idle
            |
            --- default_idle

     0.03%     sshd  [kernel.kallsyms]  [k] iowrite16   
               |
               --- iowrite16
                   __write_nocancel
                   (nil)

包含CONFIG_FRAME_POINTER=y时（Much better -- the entire path from the write() syscall (__write_nocancel) to iowrite16() can be seen.）

    99.97%  swapper  [kernel.kallsyms]  [k] default_idle
            |
            --- default_idle
                cpu_idle
               |          
               |--87.50%-- start_secondary
               |          
                --12.50%-- rest_init
                          start_kernel
                          x86_64_start_reservations
                          x86_64_start_kernel

     0.03%     sshd  [kernel.kallsyms]  [k] iowrite16
               |
               --- iowrite16
                   vp_notify
                   virtqueue_kick
                   start_xmit
                   dev_hard_start_xmit
                   sch_direct_xmit
                   dev_queue_xmit
                   ip_finish_output
                   ip_output
                   ip_local_out
                   ip_queue_xmit
                   tcp_transmit_skb
                   tcp_write_xmit
                   __tcp_push_pending_frames
                   tcp_sendmsg
                   inet_sendmsg
                   sock_aio_write
                   do_sync_write
                   vfs_write
                   sys_write
                   system_call_fastpath
                   __write_nocancel

4. 如果要使用动态跟踪，即跟踪任意指定代码，则需要打开这些开关:

For kernel analysis, using CONFIG_KPROBES=y and CONFIG_KPROBE_EVENTS=y, to enable kernel dynamic tracing. and CONFIG_FRAME_POINTER=y, for frame pointer-based kernel stacks.

For user-level analysis, CONFIG_UPROBES=y and CONFIG_UPROBE_EVENTS=y, for user-level dynamic tracing.

5. 如果打开了CONFIG_DEBUG_INFO，则可以在动态跟踪中打印内核变量的值。

If your kernel has debuginfo (CONFIG_DEBUG_INFO=y), you can fish out kernel variables from functions. This is a simple example of examining a size_t (integer)

例如

1. Listing variables available for tcp_sendmsg():

# perf probe -V tcp_sendmsg
Available variables at tcp_sendmsg
        @<tcp_sendmsg+0>
                size_t  size
                struct kiocb*   iocb
                struct msghdr*  msg
                struct sock*    sk

2. Creating a probe for tcp_sendmsg() with the "size" variable:

# perf probe --add 'tcp_sendmsg size'
Added new event:
  probe:tcp_sendmsg    (on tcp_sendmsg with size)

3. You can now use it in all perf tools, such as: 

    perf record -e probe:tcp_sendmsg -aR sleep 1

通过以下命令可以查看linux的config

cat /boot/config-`uname -r`

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.32-573.el6.x86_64
# Thu Jul 23 15:38:20 2015
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_GENERIC_CMOS_UPDATE=y
.......

perf使用说明

perf --help

先了解一下概貌

perf 命令用法还是挺简单的，根据功能区分了COMMAND，每个COMMAND有各自的用法。

用得比较多的有list, record, report, script, stat, top。

 usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]

 The most commonly used perf commands are:
   annotate        Read perf.data (created by perf record) and display annotated code
   archive         Create archive with object files with build-ids found in perf.data file
   bench           General framework for benchmark suites
   buildid-cache   Manage build-id cache.
   buildid-list    List the buildids in a perf.data file
   data            Data file related processing
   diff            Read perf.data files and display the differential profile
   evlist          List the event names in a perf.data file
   inject          Filter to augment the events stream with additional information
   kmem            Tool to trace/measure kernel memory properties
   kvm             Tool to trace/measure kvm guest os
   list            List all symbolic event types
   lock            Analyze lock events
   mem             Profile memory accesses
   record          Run a command and record its profile into perf.data
   report          Read perf.data (created by perf record) and display the profile
   sched           Tool to trace/measure scheduler properties (latencies)
   script          Read perf.data (created by perf record) and display trace output
   stat            Run a command and gather performance counter statistics
   test            Runs sanity tests.
   timechart       Tool to visualize total system behavior during a workload
   top             System profiling tool.
   probe           Define new dynamic tracepoints
   trace           strace inspired tool

 See 'perf help COMMAND' for more information on a specific command.

perf help COMMAND

要得到每个command的用法也蛮简单，可以使用perf help COMMAND得到。

例如

perf help record

PERF-RECORD(1)                    perf Manual                   PERF-RECORD(1)

NAME
       perf-record - Run a command and record its profile into perf.data

SYNOPSIS
       perf record [-e <EVENT> | --event=EVENT] [-l] [-a] <command>
       perf record [-e <EVENT> | --event=EVENT] [-l] [-a] — <command> [<options>]

DESCRIPTION
       This command runs a command and gathers a performance counter profile from it, into perf.data - without displaying anything.

       This file can then be inspected later on, using perf report.

OPTIONS
       <command>...
           Any command you can specify in a shell.
.....

perf top跟踪实时信息

跟踪时可以指定事件，CPU，以及是否跟踪stack trace。

perf top -ag 

       -a, --all-cpus
           System-wide collection. (default)
       -g
           Enables call-graph (stack chain/backtrace) recording.

输出如下

Samples: 240  of event 'cpu-clock', Event count (approx.): 19122881                                                                                                                                                                          
  Children      Self  Shared Object      Symbol  
+   14.64%    14.64%  [kernel]           [k] _spin_unlock_irqrestore   
+   10.91%    10.91%  libslang.so.2.2.1  [.] SLtt_smart_puts       
+    6.02%     6.02%  perf               [.] symbols__insert   
+    6.02%     6.02%  [kernel]           [k] kallsyms_expand_symbol   
+    6.01%     6.01%  [kernel]           [k] copy_page   
+    3.96%     0.64%  libc-2.12.so       [.] _int_malloc   
+    3.61%     3.61%  [kernel]           [k] number          
+    3.31%     3.31%  [kernel]           [k] clear_page   
+    2.71%     2.71%  [kernel]           [k] pointer     
....

输入 ? 可以得到top的帮助介绍

│                                                                                     
│UP/DOWN/PGUP                                                 │
│PGDN/SPACE    Navigate                                       │
│q/ESC/CTRL+C  Exit browser                                   │
│                                                             │
│For multiple event sessions:                                 │
│                                                             │
│TAB/UNTAB     Switch events                                  │
│                                                             │
│For symbolic views (--sort has sym):                         │
│                                                             │
│->            Zoom into DSO/Threads & Annotate current symbol│
│<-            Zoom out                                       │
│a             Annotate current symbol                        │
│C             Collapse all callchains                        │
│d             Zoom into current DSO                          │
│E             Expand all callchains                          │
│F             Toggle percentage of filtered entries          │
│H             Display column headers                         │
│P             Print histograms to perf.hist.N                │
│t             Zoom into current Thread                       │
│V             Verbose (DSO names in callchains, etc)         │
│z             Toggle zeroing of samples                      │
│/             Filter symbol by name

输入E全部展开，展开后可以得到stack trace的结果, 如果发现有地址的信息，但是没有符号表的信息，可能是软件编译时没有指定-g，如果stack信息不完整，则软件编译时需要加-fno-omit-frame-pointer(参考前面章节的介绍)。

Samples: 1K of event 'cpu-clock', Event count (approx.): 17510648      
  Children      Self  Shared Object        Symbol                      
-   28.51%    28.51%  [kernel]             [k] _spin_unlock_irqrestore 
   -→_spin_unlock_irqrestore [kernel] 
        17.24% __write_nocancel libc-2.12.so                                                                                                                               
        13.79% 0xe7cd libpthread-2.12.so                               
        10.34% 0x7f670b15b923 unknown                                  
        10.34% __fdatasync_nocancel libc-2.12.so                       
           0x1b00000017 unknown        4                               
        10.34% __libc_fork libc-2.12.so                                
        6.90% __select libc-2.12.so                                    
        3.45% 0x7f670b15b822 unknown                                   
        3.45% 0x7f670b14d560 unknown                                   
        3.45% 0x7f670b11ede7 unknown                                   
        3.45% 0x7f670b11eb5d unknown                                   
           0x2ad1 pam_unix.so                                          
           0xdcdd4c000000000 unknown                                   
        3.45% __clone libc-2.12.so                                     
        3.45% __libc_writev libc-2.12.so                               
        3.45% __read_nocancel libc-2.12.so                             
        3.45% 0x7b294 libslang.so.2.2.1                                
        3.45% 0x7efee6ee2560 unknown

输入C全部收起

Samples: 240  of event 'cpu-clock', Event count (approx.): 19122881                                                                                                                                                                          
  Children      Self  Shared Object      Symbol  
+   14.64%    14.64%  [kernel]           [k] _spin_unlock_irqrestore   
+   10.91%    10.91%  libslang.so.2.2.1  [.] SLtt_smart_puts       
+    6.02%     6.02%  perf               [.] symbols__insert   
+    6.02%     6.02%  [kernel]           [k] kallsyms_expand_symbol   
+    6.01%     6.01%  [kernel]           [k] copy_page   
+    3.96%     0.64%  libc-2.12.so       [.] _int_malloc   
+    3.61%     3.61%  [kernel]           [k] number          
+    3.31%     3.31%  [kernel]           [k] clear_page   
+    2.71%     2.71%  [kernel]           [k] pointer     
....

Symbol列[]中字符代表的含义如下，通常是k或者.表示kernel和user space事件。

[.]: user level
[k]: kernel level
[g]: guest kernel level (virtualization)
[u]: guest os user space
[H]: hypervisor

perf record 收集统计信息

perf record用来收集统计信息，通常的用法包括

1. 使用record收集统计信息，可以收集全局的，指定PID或者线程ID的，指定CPU的，指定USER ID的。

       -a, --all-cpus
           System-wide collection from all CPUs.
       -p, --pid=
           Record events on existing process ID (comma separated list).
       -t, --tid=
           Record events on existing thread ID (comma separated list). This option also disables inheritance by default. Enable it by adding --inherit.
       -u, --uid=
           Record events in threads owned by uid. Name or number.

2. 收集间隔，可以指定采样的频率，采样的EVENT数，采样时间。

       -c, --count=
           Event period to sample.
       -F, --freq=
           Profile at this frequency.

3. 采集的详细程度，可以指定event，使用CPU实时调度策略的进程（可以参考RHEL 讲CGROUP cpu部分的文章），是否跟踪stack chain，

       -r, --realtime=
           Collect data with this RT SCHED_FIFO priority.
       -g
           Enables call-graph (stack chain/backtrace) recording.
       --call-graph
           Setup and enable call-graph (stack chain/backtrace) recording, implies -g.

               Allows specifying "fp" (frame pointer) or "dwarf"
               (DWARF′s CFI - Call Frame Information) or "lbr"
               (Hardware Last Branch Record facility) as the method to collect
               the information used to show the call graphs.

               In some systems, where binaries are build with gcc
               --fomit-frame-pointer, using the "fp" method will produce bogus
               call graphs, using "dwarf", if available (perf tools linked to
               the libunwind library) should be used instead.
               Using the "lbr" method doesn′t require any compiler options. It
               will produce call graphs from the hardware LBR registers. The
               main limition is that it is only available on new Intel
               platforms, such as Haswell. It can only get user call chain. It
               doesn′t work with branch stack sampling at the same time.
       -e, --event=
           Select the PMU event. Selection can be:
           ·   a symbolic event name (use perf list to list all events)
           ·   a raw PMU event (eventsel+umask) in the form of rNNN where NNN is a hexadecimal event descriptor.
           ·   a symbolically formed PMU event like pmu/param1=0x3,param2/ where param1, param2, etc are defined as formats for the PMU in /sys/bus/event_sources/devices/<pmu>/format/*.
           ·   a symbolically formed event like pmu/config=M,config1=N,config3=K/
                   where M, N, K are numbers (in decimal, hex, octal format). Acceptable
                   values for each of ′config′, ′config1′ and ′config2′ are defined by
                   corresponding entries in /sys/bus/event_sources/devices/<pmu>/format/*
                   param1 and param2 are defined as formats for the PMU in:
                   /sys/bus/event_sources/devices/<pmu>/format/*
           ·   a group of events surrounded by a pair of brace ("{event1,event2,...}"). Each event is separated by commas and the group should be quoted to prevent the shell interpretation. You also need to use
               --group on "perf report" to view group events together.
       --filter=<filter>
           Event filter.

例子

开始跑一个benchmark，同时收集统计信息(指定-a, -g选项)。  
CFLAGS='-g -ggdb -fno-omit-frame-pointer' ./configure --prefix=/home/digoal/pgsql9.6
make world -j 32
make install-world -j 32
初始化略
pgbench -i -s 100

跟踪
perf record -ag  

Ctrl-C，退出perf record，统计信息已输出到perf.data。

perf report

解读前面收集到的perf.data.

常用的开关如下，--tui是交互式的文本显示窗口，--stdio是文本显示窗口。

       -i, --input=
           Input file name. (default: perf.data)
       -v, --verbose
           Be more verbose. (show symbol address, etc)
       -n, --show-nr-samples
           Show the number of samples for each symbol
       -g [type,min], --call-graph
           Display call chains using type and min percent threshold. type can be either:
           ·   flat: single column, linear exposure of call chains.
           ·   graph: use a graph tree, displaying absolute overhead rates.
           ·   fractal: like graph, but displays relative rates. Each branch of the tree is considered as a new profiled object.
               Default: fractal,0.5.
       --stdio
           Use the stdio interface.
       --tui
           Use the TUI interface, that is integrated with annotate and allows zooming into DSOs or threads, among other features. Use of --tui requires a tty, if one is not present, as when piping to other
           commands, the stdio interface is used.

交互式显示例子，看概貌挺方便的 (常用的交互命令: E扩展，C收敛，q退出)

#perf report -g --tui

Events: 52K cycles
+     11.17%         postgres  postgres                             [.] hash_seq_search     [.]表示user call, [k]表示kernel call
+      4.41%          pgbench  libc-2.12.so                         [.] __GI_vfprintf                           
+      4.02%         postgres  postgres                             [.] NextCopyFrom                            
+      3.27%         postgres  [kernel.kallsyms]                    [k] copy_user_generic_string                
+      2.83%         postgres  postgres                             [.] CopyReadLineText                        
+      2.25%         postgres  postgres                             [.] pg_comp_crc32c_sse42                    
+      2.18%         postgres  libc-2.12.so                         [.] memcpy                                  
+      1.99%         postgres  postgres                             [.] CopyReadAttributesText                  
+      1.72%         postgres  [xfs]                                [k] ftrace_raw_init_event_xfs_dir2_leafn_add
+      1.65%         postgres  postgres                             [.] heap_fill_tuple                         
+      1.52%         postgres  libc-2.12.so                         [.] __GI_____strtoll_l_internal             
+      1.26%         postgres  postgres                             [.] heap_form_tuple                         
+      1.22%          pgbench  libc-2.12.so                         [.] _IO_default_xsputn_internal             
+      1.17%         postgres  postgres                             [.] AllocSetAlloc                           
+      1.16%         postgres  postgres                             [.] heap_compute_data_size                  
+      1.10%         postgres  postgres                             [.] heap_multi_insert                       
+      1.03%         postgres  postgres                             [.] pg_atoi                                 
+      1.03%         postgres  libc-2.12.so                         [.] __memset_sse2                           
+      1.01%         postgres  postgres                             [.] pg_verify_mbstr_len                     
+      0.95%          pgbench  libc-2.12.so                         [.] _itoa_word                              
+      0.89%         postgres  postgres                             [.] int4in                                  
+      0.83%         postgres  postgres                             [.] resetStringInfo                         
+      0.81%         postgres  postgres                             [.] InputFunctionCall                       
+      0.81%         postgres  postgres                             [.] hash_search_with_hash_value             
+      0.79%         postgres  postgres                             [.] ExecConstraints                         
+      0.71%         postgres  postgres                             [.] CopyFrom                                
+      0.71%          pgbench  libc-2.12.so                         [.] __strchrnul                             
+      0.67%         postgres  postgres                             [.] PageAddItemExtended                     
+      0.66%         postgres  [kernel.kallsyms]                    [k] find_get_page                           
+      0.65%         postgres  postgres                             [.] CopyReadLine                            
+      0.62%         postgres  [kernel.kallsyms]                    [k] page_fault                              
+      0.58%         postgres  postgres                             [.] heap_prepare_insert                     
+      0.57%         postgres  postgres                             [.] heap_prune_chain                        
+      0.52%         postgres  postgres                             [.] bpcharin                                
+      0.52%          pgbench  libpq.so.5.9                         [.] PQputCopyData                           
+      0.52%         postgres  postgres                             [.] AllocSetFreeIndex                       
+      0.52%         postgres  postgres                             [.] pq_getbytes                             
+      0.51%         postgres  postgres                             [.] RelationPutHeapTuple                    
+      0.49%         postgres  postgres                             [.] BufferGetBlockNumber                    
+      0.47%         postgres  postgres                             [.] enlargeStringInfo                       
+      0.47%         postgres  postgres                             [.] bpchar_input                            
+      0.46%         postgres  postgres                             [.] CopyGetData                             
+      0.45%         postgres  postgres                             [.] lazy_scan_heap                          
+      0.45%         postgres  [kernel.kallsyms]                    [k] radix_tree_lookup_slot                  
+      0.44%          pgbench  libpq.so.5.9                         [.] PQputline                               
+      0.44%         postgres  postgres                             [.] CopyLoadRawBuf                          
+      0.42%         postgres  postgres                             [.] HeapTupleSatisfiesVacuum                
+      0.42%         postgres  [kernel.kallsyms]                    [k] _spin_lock_irq                          
+      0.42%         postgres  [kernel.kallsyms]                    [k] list_del                                
+      0.38%       flush-8:16  [xfs]                                [k] ftrace_raw_init_event_xfs_dir2_leafn_add
+      0.37%         postgres  postgres                             [.] ExecStoreTuple                          
+      0.36%         postgres  postgres                             [.] appendBinaryStringInfo                  
+      0.35%         postgres  postgres                             [.] pq_getmessage                           
+      0.35%          pgbench  libpq.so.5.9                         [.] pqPutMsgEnd                             
Press '?' for help on key bindings

文本显示例子，看细节挺方便

#perf report -v -n --showcpuutilization -g --stdio

首先显示了一些异常，如果你发现少了符号表或者什么的，可以根据提示安装debuginfo，或者重新编译内核或软件。

Looking at the vmlinux_path (6 entries long)
dso__load_sym: cannot get elf header.
Using /proc/kallsyms for symbols
Looking at the vmlinux_path (6 entries long)
No kallsyms or vmlinux with build-id 3187a0b0fc53e27c19f9fad3e63f9437402e5548 was found
[bianque_driver] with build id 3187a0b0fc53e27c19f9fad3e63f9437402e5548 not found, continuing without symbols
Looking at the vmlinux_path (6 entries long)
No kallsyms or vmlinux with build-id e6c06734499e665685cd28ac846d8d69d95cce8c was found
[missing_slash] with build id e6c06734499e665685cd28ac846d8d69d95cce8c not found, continuing without symbols
/lib64/ld-2.12.so was updated, restart the long running apps that use it!
/usr/lib64/libsyslog-ng-3.6.so.0.0.0 was updated, restart the long running apps that use it!
/disk1/gpdb_20160101/bin/postgres was updated, restart the long running apps that use it!
Looking at the vmlinux_path (6 entries long)
No kallsyms or vmlinux with build-id 253ed41ca3926c96e4181ac4df30cf3cde868089 was found
[pdeath_signal] with build id 253ed41ca3926c96e4181ac4df30cf3cde868089 not found, continuing without symbols
no symbols found in /sbin/iscsid, maybe install a debug package?
no symbols found in /bin/egrep, maybe install a debug package?
no symbols found in /usr/sbin/dmidecode, maybe install a debug package?
/opt/aegis/lib64/aegis_monitor.so was updated, restart the long running apps that use it!

报告的细节如下

# Events: 52K cycles
#
# Overhead  Samples    sys    us            Command                                   Shared Object                                                               
# 事件百分比 采样百分比  系统消耗占比  系统消耗u秒  命令  共享对象
# ........ ..........  ...............  ............................................  ...............................................................
#

第一条
   事件百分比  采样百分比  系统消耗占比  系统消耗u秒    命令                                         共享对象             
    11.17%     0.00%     5.56%       2173               postgres                              /home/digoal/pgsql9.6/bin/postgres       0x50eb35         B [.] hash_seq_search
                   |
                   --- hash_seq_search  (输出call stack trace，需要在编译软件时开启-g开关, -fno-omit-frame-pointer开关)
                       GetRunningTransactionLocks
                       LogStandbySnapshot    (有两个分支会调用 LogStandbySnapshot)
                      |          
                      |--50.23%-- CreateCheckPoint    (分支开销占比)
                      |          CheckpointerMain
                      |          AuxiliaryProcessMain
                      |          StartChildProcess
                      |          reaper
                      |          __restore_rt
                      |          PostmasterMain
                      |          startup_hacks
                      |          __libc_start_main
                      |          
                       --49.77%-- BackgroundWriterMain  (分支开销占比)
                                 AuxiliaryProcessMain
                                 StartChildProcess
                                 reaper
                                 __restore_rt
                                 PostmasterMain
                                 startup_hacks
                                 __libc_start_main

下面这条全部是系统开销  
     4.41%     0.00%     4.41%       1808            pgbench                                   /lib64/libc-2.12.so                          0x44005          B [.] __GI_vfprintf
                    |
                    --- __GI_vfprintf
                       |          
                       |--99.48%-- __vsnprintf
                       |          |          
后面。。。。。。略

perf script

一条条打印perf.data内的数据，输出的是perf record收集到的原始数据。

生成热力图、火焰图也需要perf script的输出，从最原始的信息中提取数据，生成svg。

perf-script - Read perf.data (created by perf record) and display trace output  

主要的options
       -G, --hide-call-graph
           When printing symbols do not display call chain.
    -v, --verbose         be more verbose (show symbol address, etc)
    -L, --Latency         show latency attributes (irqs/preemption disabled, etc)
    -d, --debug-mode      do various checks like samples ordering and lost events
    -D, --dump-raw-trace  dump raw trace in ASCII

#perf script -v -G|wc -l
52958 

刚好等于前面使用perf report输出的# Events: 52K cycles  

详细内容就不列出了，非常多

perf annotate

解读前面收集到的perf.data，辅以汇编指令显示。

perf-annotate - Read perf.data (created by perf record) and display annotated code  

       -i, --input=
           Input file name. (default: perf.data)
       -v, --verbose
           Be more verbose. (Show symbol address, etc)
       -D, --dump-raw-trace
           Dump raw trace in ASCII.
       -l, --print-line
           Print matching source lines (may be slow).
       -P, --full-paths
           Don’t shorten the displayed pathnames.
       --stdio
           Use the stdio interface.
       --tui
           Use the TUI interface Use of --tui requires a tty, if one is not present, as when piping to other commands, the stdio interface is used. This interfaces starts by centering on the line with more
           samples, TAB/UNTAB cycles through the lines with more samples.

例子

Sorted summary for file /home/digoal/pgsql9.6/bin/postgres
----------------------------------------------

   32.43 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1)
   16.20 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1)
   14.16 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1)
   10.68 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1406
   10.68 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1)
    9.97 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1412
    3.76 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1)
 Percent |      Source code & Disassembly of /home/digoal/pgsql9.6/bin/postgres
------------------------------------------------
                          50eac0: 466
                          50eac4: 1
                          50eac7: 3
                          50eaca: 1
                          50eaec: 435
                          50eaf5: 12
                          50eafb: 7
                          50eb00: 20
                          50eb08: 5
                          50eb10: 8
                          50eb14: 1
                          50eb18: 5
                          50eb1b: 4
                          50eb1e: 15
                          50eb22: 466
                          50eb26: 2
                          50eb2e: 2
                          50eb32: 6
                          50eb35: 618
                          50eb38: 707
                          50eb3c: 164
                          50eb41: 1415
                          h->sum: 4363
         :
         :
         :
         :      Disassembly of section .text:
         :
         :      000000000090e9cc <hash_seq_search>:
         :                      register_seq_scan(hashp);
         :      }
         :
         :      void *
         :      hash_seq_search(HASH_SEQ_STATUS *status)
         :      {
    0.00 :        90e9cc:       55                      push   %rbp
    0.00 :        90e9cd:       48 89 e5                mov    %rsp,%rbp
    0.00 :        90e9d0:       48 83 ec 60             sub    $0x60,%rsp
    0.00 :        90e9d4:       48 89 7d a8             mov    %rdi,-0x58(%rbp)
         :              long            segment_ndx;
         :              HASHSEGMENT segp;
         :              uint32          curBucket;
         :              HASHELEMENT *curElem;
         :
         :              if ((curElem = status->curEntry) != NULL)
    0.00 :        90e9d8:       48 8b 45 a8             mov    -0x58(%rbp),%rax
    0.00 :        90e9dc:       48 8b 40 10             mov    0x10(%rax),%rax
    0.00 :        90e9e0:       48 89 45 d8             mov    %rax,-0x28(%rbp)
    0.00 :        90e9e4:       48 83 7d d8 00          cmpq   $0x0,-0x28(%rbp)
    0.00 :        90e9e9:       74 3a                   je     90ea25 <hash_seq_search+0x59>
         :              {
         :                      /* Continuing scan of curBucket... */
         :                      status->curEntry = curElem->link;
    0.00 :        90e9eb:       48 8b 45 d8             mov    -0x28(%rbp),%rax
    0.00 :        90e9ef:       48 8b 10                mov    (%rax),%rdx
    0.00 :        90e9f2:       48 8b 45 a8             mov    -0x58(%rbp),%rax
    0.00 :        90e9f6:       48 89 50 10             mov    %rdx,0x10(%rax)
         :                      if (status->curEntry == NULL)   /* end of this bucket */
    0.00 :        90e9fa:       48 8b 45 a8             mov    -0x58(%rbp),%rax
    0.00 :        90e9fe:       48 8b 40 10             mov    0x10(%rax),%rax
    0.00 :        90ea02:       48 85 c0                test   %rax,%rax
    0.00 :        90ea05:       75 11                   jne    90ea18 <hash_seq_search+0x4c>
         :                              ++status->curBucket;
    0.00 :        90ea07:       48 8b 45 a8             mov    -0x58(%rbp),%rax
    0.00 :        90ea0b:       8b 40 08                mov    0x8(%rax),%eax
    0.00 :        90ea0e:       8d 50 01                lea    0x1(%rax),%edx
    0.00 :        90ea11:       48 8b 45 a8             mov    -0x58(%rbp),%rax
    0.00 :        90ea15:       89 50 08                mov    %edx,0x8(%rax)
         :                      return (void *) ELEMENTKEY(curElem);
    0.00 :        90ea18:       48 8b 45 d8             mov    -0x28(%rbp),%rax
    0.00 :        90ea1c:       48 83 c0 10             add    $0x10,%rax
    0.00 :        90ea20:       e9 54 01 00 00          jmpq   90eb79 <hash_seq_search+0x1ad>
         :              }
         :
         :              /*
         :               * Search for next nonempty bucket starting at curBucket.
         :               */
         :              curBucket = status->curBucket;
    0.00 :        90ea25:       48 8b 45 a8             mov    -0x58(%rbp),%rax
    0.00 :        90ea29:       8b 40 08                mov    0x8(%rax),%eax
    0.00 :        90ea2c:       89 45 e4                mov    %eax,-0x1c(%rbp)
         :              hashp = status->hashp;
    0.00 :        90ea2f:       48 8b 45 a8             mov    -0x58(%rbp),%rax
    0.00 :        90ea33:       48 8b 00                mov    (%rax),%rax
    0.00 :        90ea36:       48 89 45 d0             mov    %rax,-0x30(%rbp)
         :              hctl = hashp->hctl;
    0.00 :        90ea3a:       48 8b 45 d0             mov    -0x30(%rbp),%rax
    0.00 :        90ea3e:       48 8b 00                mov    (%rax),%rax
    0.00 :        90ea41:       48 89 45 c8             mov    %rax,-0x38(%rbp)
         :              ssize = hashp->ssize;
    0.00 :        90ea45:       48 8b 45 d0             mov    -0x30(%rbp),%rax
    0.00 :        90ea49:       48 8b 40 50             mov    0x50(%rax),%rax
    0.00 :        90ea4d:       48 89 45 c0             mov    %rax,-0x40(%rbp)
         :              max_bucket = hctl->max_bucket;
    0.00 :        90ea51:       48 8b 45 c8             mov    -0x38(%rbp),%rax
    0.00 :        90ea55:       8b 80 10 03 00 00       mov    0x310(%rax),%eax
    0.00 :        90ea5b:       89 45 bc                mov    %eax,-0x44(%rbp)
         :
         :              if (curBucket > max_bucket)
    0.00 :        90ea5e:       8b 45 e4                mov    -0x1c(%rbp),%eax
    0.00 :        90ea61:       3b 45 bc                cmp    -0x44(%rbp),%eax
    0.00 :        90ea64:       76 16                   jbe    90ea7c <hash_seq_search+0xb0>
         :              {
         :                      hash_seq_term(status);
    0.00 :        90ea66:       48 8b 45 a8             mov    -0x58(%rbp),%rax
    0.00 :        90ea6a:       48 89 c7                mov    %rax,%rdi
    0.00 :        90ea6d:       e8 09 01 00 00          callq  90eb7b <hash_seq_term>
         :                      return NULL;                    /* search is done */
    0.00 :        90ea72:       b8 00 00 00 00          mov    $0x0,%eax
    0.00 :        90ea77:       e9 fd 00 00 00          jmpq   90eb79 <hash_seq_search+0x1ad>
         :              }
         :
         :              /*
         :               * first find the right segment in the table directory.
         :               */
         :              segment_num = curBucket >> hashp->sshift;
    0.00 :        90ea7c:       48 8b 45 d0             mov    -0x30(%rbp),%rax
    0.00 :        90ea80:       8b 40 58                mov    0x58(%rax),%eax
    0.00 :        90ea83:       8b 55 e4                mov    -0x1c(%rbp),%edx
    0.00 :        90ea86:       89 c1                   mov    %eax,%ecx
    0.00 :        90ea88:       d3 ea                   shr    %cl,%edx
    0.00 :        90ea8a:       89 d0                   mov    %edx,%eax
    0.00 :        90ea8c:       89 c0                   mov    %eax,%eax
    0.00 :        90ea8e:       48 89 45 f8             mov    %rax,-0x8(%rbp)
         :              segment_ndx = MOD(curBucket, ssize);
    0.00 :        90ea92:       8b 45 e4                mov    -0x1c(%rbp),%eax
    0.00 :        90ea95:       48 8b 55 c0             mov    -0x40(%rbp),%rdx
    0.00 :        90ea99:       48 83 ea 01             sub    $0x1,%rdx
    0.00 :        90ea9d:       48 21 d0                and    %rdx,%rax
    0.00 :        90eaa0:       48 89 45 f0             mov    %rax,-0x10(%rbp)
         :
         :              segp = hashp->dir[segment_num];
    0.00 :        90eaa4:       48 8b 45 d0             mov    -0x30(%rbp),%rax
    0.00 :        90eaa8:       48 8b 40 08             mov    0x8(%rax),%rax
    0.00 :        90eaac:       48 8b 55 f8             mov    -0x8(%rbp),%rdx
    0.00 :        90eab0:       48 c1 e2 03             shl    $0x3,%rdx
    0.00 :        90eab4:       48 01 d0                add    %rdx,%rax
    0.00 :        90eab7:       48 8b 00                mov    (%rax),%rax
    0.00 :        90eaba:       48 89 45 e8             mov    %rax,-0x18(%rbp)
         :               * Pick up the first item in this bucket's chain.  If chain is not empty
         :               * we can begin searching it.  Otherwise we have to advance to find the
         :               * next nonempty bucket.  We try to optimize that case since searching a
         :               * near-empty hashtable has to iterate this loop a lot.
         :               */
         :              while ((curElem = segp[segment_ndx]) == NULL)
    0.00 :        90eabe:       eb 62                   jmp    90eb22 <hash_seq_search+0x156>
         :              {
         :                      /* empty bucket, advance to next */
         :                      if (++curBucket > max_bucket)
 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1406
   10.68 :        90eac0:       83 45 e4 01             addl   $0x1,-0x1c(%rbp)
    0.02 :        90eac4:       8b 45 e4                mov    -0x1c(%rbp),%eax
    0.07 :        90eac7:       3b 45 bc                cmp    -0x44(%rbp),%eax
    0.02 :        90eaca:       76 20                   jbe    90eaec <hash_seq_search+0x120>
         :                      {
         :                              status->curBucket = curBucket;
    0.00 :        90eacc:       48 8b 45 a8             mov    -0x58(%rbp),%rax
    0.00 :        90ead0:       8b 55 e4                mov    -0x1c(%rbp),%edx
    0.00 :        90ead3:       89 50 08                mov    %edx,0x8(%rax)
         :                              hash_seq_term(status);
    0.00 :        90ead6:       48 8b 45 a8             mov    -0x58(%rbp),%rax
    0.00 :        90eada:       48 89 c7                mov    %rax,%rdi
    0.00 :        90eadd:       e8 99 00 00 00          callq  90eb7b <hash_seq_term>
         :                              return NULL;            /* search is done */
    0.00 :        90eae2:       b8 00 00 00 00          mov    $0x0,%eax
    0.00 :        90eae7:       e9 8d 00 00 00          jmpq   90eb79 <hash_seq_search+0x1ad>
         :                      }
         :                      if (++segment_ndx >= ssize)
 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1412
    9.97 :        90eaec:       48 83 45 f0 01          addq   $0x1,-0x10(%rbp)
    0.00 :        90eaf1:       48 8b 45 f0             mov    -0x10(%rbp),%rax
    0.28 :        90eaf5:       48 3b 45 c0             cmp    -0x40(%rbp),%rax
    0.00 :        90eaf9:       7c 27                   jl     90eb22 <hash_seq_search+0x156>
         :                      {
         :                              segment_num++;
    0.16 :        90eafb:       48 83 45 f8 01          addq   $0x1,-0x8(%rbp)
         :                              segment_ndx = 0;
    0.46 :        90eb00:       48 c7 45 f0 00 00 00    movq   $0x0,-0x10(%rbp)
    0.00 :        90eb07:       00 
         :                              segp = hashp->dir[segment_num];
    0.11 :        90eb08:       48 8b 45 d0             mov    -0x30(%rbp),%rax
    0.00 :        90eb0c:       48 8b 40 08             mov    0x8(%rax),%rax
    0.18 :        90eb10:   Can't annotate copy_user_generic_string: No vmlinux file with build id 7f980c04dad5ff7955a0035ceddb54b83d0b1ab1 was found in the path.
Please use 'perf buildid-cache -av vmlinux' or --vmlinux vmlinux.
symbol__annotate: filename=/root/.debug/.build-id/3a/c348a69f62bfc2280da1a8188173961bb2e9be, sym=__GI_vfprintf, start=0x7fed22c73d10, end=0x7fed22c78e78
annotating [0xba0c30]            /lib64/libc-2.12.so : [0x1c7c310]                  __GI_vfprintf
Executing: objdump --start-address=0x0000000000043d10 --stop-address=0x0000000000048e78 -dS -C /root/.debug/.build-id/3a/c348a69f62bfc2280da1a8188173961bb2e9be|grep -v /root/.debug/.build-id/3a/c348a69f62bfc2280da1a8188173961bb2e9be|expa
    48 8b 55 f8             mov    -0x8(%rbp),%rdx
    0.02 :        90eb14:       48 c1 e2 03             shl    $0x3,%rdx
    0.11 :        90eb18:       48 01 d0                add    %rdx,%rax
    0.09 :        90eb1b:       48 8b 00                mov    (%rax),%rax
    0.34 :        90eb1e:       48 89 45 e8             mov    %rax,-0x18(%rbp)
         :               * Pick up the first item in this bucket's chain.  If chain is not empty
         :               * we can begin searching it.  Otherwise we have to advance to find the
         :               * next nonempty bucket.  We try to optimize that case since searching a
         :               * near-empty hashtable has to iterate this loop a lot.
         :               */
         :              while ((curElem = segp[segment_ndx]) == NULL)
 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1)
   10.68 :        90eb22:       48 8b 45 f0             mov    -0x10(%rbp),%rax
    0.05 :        90eb26:       48 8d 14 c5 00 00 00    lea    0x0(,%rax,8),%rdx
    0.00 :        90eb2d:       00 
    0.05 :        90eb2e:       48 8b 45 e8             mov    -0x18(%rbp),%rax
    0.14 :        90eb32:       48 01 d0                add    %rdx,%rax
   14.16 :        90eb35:       48 8b 00                mov    (%rax),%rax
   16.20 :        90eb38:       48 89 45 d8             mov    %rax,-0x28(%rbp)
 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1)
    3.76 :        90eb3c:       48 83 7d d8 00          cmpq   $0x0,-0x28(%rbp)
 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1)
   32.43 :        90eb41:       0f 84 79 ff ff ff       je     90eac0 <hash_seq_search+0xf4>
         :                              segp = hashp->dir[segment_num];
         :                      }
         :              }
         :
         :              /* Begin scan of curBucket... */
         :              status->curEntry = curElem->link;
    0.00 :        90eb47:       48 8b 45 d8             mov    -0x28(%rbp),%rax
    0.00 :        90eb4b:       48 8b 10                mov    (%rax),%rdx
    0.00 :        90eb4e:       48 8b 45 a8             mov    -0x58(%rbp),%rax
    0.00 :        90eb52:       48 89 50 10             mov    %rdx,0x10(%rax)
         :              if (status->curEntry == NULL)           /* end of this bucket */
    0.00 :        90eb56:       48 8b 45 a8             mov    -0x58(%rbp),%rax
    0.00 :        90eb5a:       48 8b 40 10             mov    0x10(%rax),%rax
    0.00 :        90eb5e:       48 85 c0                test   %rax,%rax
    0.00 :        90eb61:       75 04                   jne    90eb67 <hash_seq_search+0x19b>
         :                      ++curBucket;
    0.00 :        90eb63:       83 45 e4 01             addl   $0x1,-0x1c(%rbp)
         :              status->curBucket = curBucket;
    0.00 :        90eb67:       48 8b 45 a8             mov    -0x58(%rbp),%rax
    0.00 :        90eb6b:       8b 55 e4                mov    -0x1c(%rbp),%edx
    0.00 :        90eb6e:       89 50 08                mov    %edx,0x8(%rax)
         :              return (void *) ELEMENTKEY(curElem);
    0.00 :        90eb71:       48 8b 45 d8             mov    -0x28(%rbp),%rax
    0.00 :        90eb75:       48 83 c0 10             add    $0x10,%rax
         :      }
    0.00 :        90eb79:       c9                      leaveq 

Sorted summary for file /lib64/libc-2.12.so
----------------------------------------------

    4.34 ??:0
    3.71 ??:0
    3.24 ??:0
    2.93 ??:0
    2.93 ??:0
    略

perf bench

用来测试系统的一些常见指标的性能(如IPC, message or pipe, memcpy)。

       perf-bench - General framework for benchmark suites

SYNOPSIS
       perf bench [<common options>] <subsystem> <suite> [<options>]
SUBSYSTEM
       sched
           Scheduler and IPC mechanisms.

   SUITES FOR sched
       messaging
           Suite for evaluating performance of scheduler and IPC mechanisms. Based on hackbench by Rusty Russell.

       Options of messaging
           -p, --pipe
               Use pipe() instead of socketpair()

           -t, --thread
               Be multi thread instead of multi process

           -g, --group=
               Specify number of groups

           -l, --loop=
               Specify number of loops

       Example of messaging
               .ft C
               % perf bench sched messaging                 # run with default
               options (20 sender and receiver processes per group)
               (10 groups == 400 processes run)

                     Total time:0.308 sec

               % perf bench sched messaging -t -g 20        # be multi-thread, with 20 groups
               (20 sender and receiver threads per group)
               (20 groups == 800 threads run)

                     Total time:0.582 sec
               .ft

       pipe
           Suite for pipe() system call. Based on pipe-test-1m.c by Ingo Molnar.

       Options of pipe
           -l, --loop=
               Specify number of loops.

       Example of pipe
               .ft C
               % perf bench sched pipe
               (executing 1000000 pipe operations between two tasks)

                       Total time:8.091 sec
                               8.091833 usecs/op
                               123581 ops/sec

               % perf bench sched pipe -l 1000              # loop 1000
               (executing 1000 pipe operations between two tasks)

                       Total time:0.016 sec
                               16.948000 usecs/op
                               59004 ops/sec
               .ft

例子

测试进程或线程通信性能

使用200个进程
#perf bench sched messaging -g 200 -l 100
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 200 groups == 8000 processes run

     Total time: 2.665 [sec]

使用200个线程代替进程
#perf bench sched messaging -g 200 -l 100 -t
# Running sched/messaging benchmark...
# 20 sender and receiver threads per group
# 200 groups == 8000 threads run

     Total time: 2.025 [sec]

使用线程, 使用pipe()代替socketpair()
#perf bench sched messaging -g 200 -l 100 -t -p
# Running sched/messaging benchmark...
# 20 sender and receiver threads per group
# 200 groups == 8000 threads run

     Total time: 1.028 [sec]

使用进程, 使用pipe()代替socketpair()
#perf bench sched messaging -g 200 -l 100 -p
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 200 groups == 8000 processes run

     Total time: 1.126 [sec]

从测试来看，线程, pipe()通信效率是最高的。

测试pipe()

#perf bench sched pipe -l 100000
# Running sched/pipe benchmark...
# Executed 100000 pipe operations between two tasks

     Total time: 0.785 [sec]

       7.851860 usecs/op
         127358 ops/sec

测试所有，包括memcpy

#perf bench all
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

     Total time: 0.102 [sec]

# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two tasks

     Total time: 7.967 [sec]

       7.967202 usecs/op
         125514 ops/sec

# Running mem/memcpy benchmark...
# Copying 1MB Bytes from 0x7f360c80c010 to 0x7f360f5d9010 ...

     855.431993 MB/Sec

perf stat

perf 除了可以采样(使用perf record)（包括call stack trace），还可以用于event计数。

perf stat就是用于event计数的，可以跟踪指定命令的event计数。

NAME
       perf-stat - Run a command and gather performance counter statistics

SYNOPSIS
       perf stat [-e <EVENT> | --event=EVENT] [-a] <command>
       perf stat [-e <EVENT> | --event=EVENT] [-a] — <command> [<options>]

DESCRIPTION
       This command runs a command and gathers performance counter statistics from it.

跟踪指定command，或者pid, tid，或指定cgroup的event计数
       <command>...
           Any command you can specify in a shell.
       -r, --repeat=<n>  重复执行N次命令
           repeat command and print average + stddev (max: 100)
       -p, --pid=<pid>
           stat events on existing process id
       -t, --tid=<tid>
           stat events on existing thread id
       -G name, --cgroup name
           monitor only in the container (cgroup) called "name". This option is available only in per-cpu mode. The cgroup filesystem must be mounted. All threads belonging to container "name" are monitored when
           they run on the monitored CPUs. Multiple cgroups can be provided. Each cgroup is applied to the corresponding event, i.e., first cgroup to first event, second cgroup to second event and so on. It is
           possible to provide an empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must have corresponding events, i.e., they always refer to events defined earlier on the command line.

只跟踪指定事件
       -e, --event=
           Select the PMU event. 
       Selection can be a symbolic event name (use perf list to list all events) or a raw PMU event (eventsel+umask) in the form of rNNN where NNN is a hexadecimal event descriptor.

跟踪所有CPU
       -a, --all-cpus
           system-wide collection from all CPUs

输出相关
       -c, --scale
           scale/normalize counter values
       -B, --big-num
           print large numbers with thousands′ separators according to locale
       -v, --verbose
           be more verbose (show counter open errors, etc)

例子

124331 digoal  20   0 93.0g  38g  38g S  0.0 20.6   0:28.21 /home/digoal/pgsql9.6/bin/postgres                                                                                                                                           
124589 digoal  20   0  125m  972  464 S  0.0  0.0   0:00.03 postgres: logger process                                                                                                                                                       
124592 digoal  20   0 93.0g 2.1g 2.1g S  0.0  1.1   0:11.64 postgres: checkpointer process                                                                                                                                                 
124593 digoal  20   0 93.0g 2.1g 2.1g S  0.0  1.1   0:13.18 postgres: writer process                                                                                                                                                       
124594 digoal  20   0 93.0g 129m 128m S  0.0  0.1   0:03.82 postgres: wal writer process                                                                                                                                                   
124595 digoal  20   0 93.0g 2172 1360 S  0.0  0.0   0:00.43 postgres: autovacuum launcher process                                                                                                                                          
124596 digoal  20   0  127m 1076  548 S  0.0  0.0   0:00.33 postgres: archiver process   failed on 00000001000000820000002D                                                                                                                
124597 digoal  20   0  127m 1132  560 S  0.0  0.0   0:00.93 postgres: stats collector process

pgbench -M prepared -n -r -P 1 -c 32 -j 32 -T 1000

跟踪10秒 wal writer进程

#perf stat -p 124594 -a -c -B -v sleep 10

task-clock: 240196875907 240196865902 240196818048
context-switches: 966775 240197106316 240197074990
CPU-migrations: 24926 240197787932 240197772224
page-faults: 350142 240196568475 240196568475
cycles: 376379394416 239735813498 198569141526
stalled-cycles-frontend: 305098988594 239733245755 199380423076
stalled-cycles-backend: 205940630988 239730830300 160173568784
instructions: 233667165879 239725298183 200717526856
branches: 38305737741 239719766523 201249886312
branch-misses: 524692310 239712708191 199567372892

 Performance counter stats for process id '124594':

     240196.875907 task-clock                #   23.997 CPUs utilized           [100.00%]
           966,775 context-switches          #    0.004 M/sec                   [100.00%]
            24,926 CPU-migrations            #    0.000 M/sec                   [100.00%]
           350,142 page-faults               #    0.001 M/sec                  
   376,379,394,416 cycles                    #    1.567 GHz                     [82.83%] 
   305,098,988,594 stalled-cycles-frontend   #   81.06% frontend cycles idle    [83.17%]
   205,940,630,988 stalled-cycles-backend    #   54.72% backend  cycles idle    [66.81%]
   233,667,165,879 instructions              #    0.62  insns per cycle     注意一个cycle只执行了0.62条指令，说明有大量的等待，并没有充分发挥CPU的性能，也说明了系统存在瓶颈。       
                                             #    1.31  stalled cycles per insn [83.73%]  每个指令的空闲CPU周期，越大说明CPU越闲，没干活。  
    38,305,737,741 branches                  #  159.476 M/sec                   [83.95%]
       524,692,310 branch-misses             #    1.37% of all branches         [83.25%]

      10.009649400 seconds time elapsed

如果insns per cycle很大，但是实际软件的运行效率确很低时，可能是类似SPIN LOCK，导致CPU空转，实际没有干活。

如果你想跟踪PostgreSQL数据库，可以把数据库的所有进程塞到CGROUP里，然后使用perf stat -G cgname统计整个CGROUP。

perf stat还支持-e指定事件，事件支持通配符。

$ sudo perf stat -e block:block_rq_*,syscalls:sys_enter_write,syscalls:sys_enter_fsync -a -r 5 -- psql -q -U postgres craig -c "drop table if exists x; create table x as select a FROM generate_series(1,1000000) a;";

 Performance counter stats for 'psql -U postgres craig -c drop table if exists x; create table x as select a FROM generate_series(1,1000000) a;' (5 runs):

     0 block:block_rq_abort                       [100.00%]
     0 block:block_rq_requeue                     [100.00%]
    97 block:block_rq_complete      ( +- 14.82% ) [100.00%]
    96 block:block_rq_insert        ( +- 14.97% ) [100.00%] 
    98 block:block_rq_issue         ( +- 14.67% ) [100.00%]
     0 block:block_rq_remap                       [100.00%]
10,607 syscalls:sys_enter_write     ( +-  0.17% ) [100.00%]
     1 syscalls:sys_enter_fsync                                    

0.908835058 seconds time elapsed    ( +- 18.31% )

或者
perf stat -e block:*,syscalls:* -a -r 5 -- psql -q -U postgres craig -c "drop table if exists x; create table x as select a FROM generate_series(1,1000000) a;";

perf event modifier

使用instruction:modifier可以指定要跟踪的instruction在哪里？在kernel space或user space，又或者在虚拟机，虚拟机OS，宿主机OS等。

modifier用法如下，写在event:后面。

Modifiers	Description	Example
u	monitor at priv level 3, 2, 1 (user)	event:u
k	monitor at priv level 0 (kernel)	event:k
h	monitor hypervisor events on a virtualization environment	event:h
H	monitor host machine on a virtualization environment	event:H
G	monitor guest machine on a virtualization environment	event:G

例子 -e instructions:u

perf stat -e instructions:u dd if=/dev/zero of=/dev/null count=100000

perf高级用法 - 动态跟踪

perf probe可以实现动态跟踪，指哪打哪。静态跟踪是预置的，而动态跟踪是补充预置不足的。

比如你想跟踪kernel的某个function, 甚至某一行代码，某些变量的值。

或者你想跟踪用户软件的某个function，甚至某一行代码，某些变量的值。

首先要添加需要动态跟踪的对象（function, var, ...）

然后record，和report分析，这和前面的用法是一样的。

例子

Listing variables available for tcp_sendmsg():

# perf probe -V tcp_sendmsg
Available variables at tcp_sendmsg
        @<tcp_sendmsg+0>
                size_t  size
                struct kiocb*   iocb
                struct msghdr*  msg
                struct sock*    sk
Creating a probe for tcp_sendmsg() with the "size" variable:

# perf probe --add 'tcp_sendmsg size'
Added new event:
  probe:tcp_sendmsg    (on tcp_sendmsg with size)

You can now use it in all perf tools, such as:

    perf record -e probe:tcp_sendmsg -aR sleep 1
Tracing this probe:

# perf record -e probe:tcp_sendmsg -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.052 MB perf.data (~2252 samples) ]
# perf script
# ========
# captured on: Fri Jan 31 23:49:55 2014
# hostname : dev1
# os release : 3.13.1-ubuntu-12-opt
# perf version : 3.13.1
# arch : x86_64
# nrcpus online : 2
# nrcpus avail : 2
# cpudesc : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
# cpuid : GenuineIntel,6,44,2
# total memory : 1796024 kB
# cmdline : /usr/bin/perf record -e probe:tcp_sendmsg -a 
# event : name = probe:tcp_sendmsg, type = 2, config = 0x1dd, config1 = 0x0, config2 = ...
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: software = 1, tracepoint = 2, breakpoint = 5
# ========
#
            sshd  1301 [001]   502.424719: probe:tcp_sendmsg: (ffffffff81505d80) size=b0
            sshd  1301 [001]   502.424814: probe:tcp_sendmsg: (ffffffff81505d80) size=40
            sshd  2371 [000]   502.952590: probe:tcp_sendmsg: (ffffffff81505d80) size=27
            sshd  2372 [000]   503.025023: probe:tcp_sendmsg: (ffffffff81505d80) size=3c0
            sshd  2372 [001]   503.203776: probe:tcp_sendmsg: (ffffffff81505d80) size=98
            sshd  2372 [001]   503.281312: probe:tcp_sendmsg: (ffffffff81505d80) size=2d0
            sshd  2372 [001]   503.461358: probe:tcp_sendmsg: (ffffffff81505d80) size=30
            sshd  2372 [001]   503.670239: probe:tcp_sendmsg: (ffffffff81505d80) size=40
            sshd  2372 [001]   503.742565: probe:tcp_sendmsg: (ffffffff81505d80) size=140
            sshd  2372 [001]   503.822005: probe:tcp_sendmsg: (ffffffff81505d80) size=20
            sshd  2371 [000]   504.118728: probe:tcp_sendmsg: (ffffffff81505d80) size=30
            sshd  2371 [000]   504.192575: probe:tcp_sendmsg: (ffffffff81505d80) size=70
[...]
The size is shown as hexadecimal.

跟踪某行代码

# perf probe -L tcp_sendmsg
<tcp_sendmsg@/mnt/src/linux-3.14.5/net/ipv4/tcp.c:0>
      0  int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
                        size_t size)
      2  {
                struct iovec *iov;
                struct tcp_sock *tp = tcp_sk(sk);
                struct sk_buff *skb;
      6         int iovlen, flags, err, copied = 0;
      7         int mss_now = 0, size_goal, copied_syn = 0, offset = 0;
                bool sg;
                long timeo;
[...]
     79                 while (seglen > 0) {
                                int copy = 0;
     81                         int max = size_goal;

                                skb = tcp_write_queue_tail(sk);
     84                         if (tcp_send_head(sk)) {
     85                                 if (skb->ip_summed == CHECKSUM_NONE)
                                                max = mss_now;
     87                                 copy = max - skb->len;
                                }

     90                         if (copy <= 0) {
         new_segment:
[...]


# perf probe -V tcp_sendmsg:81
Available variables at tcp_sendmsg:81
        @<tcp_sendmsg+537>
                bool    sg
                int     copied
                int     copied_syn
                int     flags
                int     mss_now
                int     offset
                int     size_goal
                long int        timeo
                size_t  seglen
                struct iovec*   iov
                struct sock*    sk
                unsigned char*  from


Now lets trace line 81, with the seglen variable that is checked in the loop:

# perf probe --add 'tcp_sendmsg:81 seglen'
Added new event:
  probe:tcp_sendmsg    (on tcp_sendmsg:81 with seglen)

You can now use it in all perf tools, such as:

    perf record -e probe:tcp_sendmsg -aR sleep 1

# perf record -e probe:tcp_sendmsg -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.188 MB perf.data (~8200 samples) ]
# perf script
            sshd  4652 [001] 2082360.931086: probe:tcp_sendmsg: (ffffffff81642ca9) seglen=0x80
   app_plugin.pl  2400 [001] 2082360.970489: probe:tcp_sendmsg: (ffffffff81642ca9) seglen=0x20
        postgres  2422 [000] 2082360.970703: probe:tcp_sendmsg: (ffffffff81642ca9) seglen=0x52
   app_plugin.pl  2400 [000] 2082360.970890: probe:tcp_sendmsg: (ffffffff81642ca9) seglen=0x7b
        postgres  2422 [001] 2082360.971099: probe:tcp_sendmsg: (ffffffff81642ca9) seglen=0xb
   app_plugin.pl  2400 [000] 2082360.971140: probe:tcp_sendmsg: (ffffffff81642ca9) seglen=0x55
[...]

跟踪用户软件的指定function

# perf probe -x /lib/x86_64-linux-gnu/libc-2.15.so --add malloc
Added new event:
  probe_libc:malloc    (on 0x82f20)

You can now use it in all perf tools, such as:

    perf record -e probe_libc:malloc -aR sleep 1

Tracing it system-wide:

# perf record -e probe_libc:malloc -a
^C[ perf record: Woken up 12 times to write data ]
[ perf record: Captured and wrote 3.522 MB perf.data (~153866 samples) ]
The report:

# perf report -n
[...]
# Samples: 45K of event 'probe_libc:malloc'
# Event count (approx.): 45158
#
# Overhead       Samples          Command  Shared Object      Symbol
# ........  ............  ...............  .............  ..........
#
    42.72%         19292       apt-config  libc-2.15.so   [.] malloc
    19.71%          8902             grep  libc-2.15.so   [.] malloc
     7.88%          3557             sshd  libc-2.15.so   [.] malloc
     6.25%          2824              sed  libc-2.15.so   [.] malloc
     6.06%          2738            which  libc-2.15.so   [.] malloc
     4.12%          1862  update-motd-upd  libc-2.15.so   [.] malloc
     3.72%          1680             stat  libc-2.15.so   [.] malloc
     1.68%           758            login  libc-2.15.so   [.] malloc
     1.21%           546        run-parts  libc-2.15.so   [.] malloc
     1.21%           545               ls  libc-2.15.so   [.] malloc
     0.80%           360        dircolors  libc-2.15.so   [.] malloc
     0.56%           252               tr  libc-2.15.so   [.] malloc
     0.54%           242              top  libc-2.15.so   [.] malloc
     0.49%           222       irqbalance  libc-2.15.so   [.] malloc
     0.44%           200             dpkg  libc-2.15.so   [.] malloc
     0.38%           173         lesspipe  libc-2.15.so   [.] malloc
     0.29%           130  update-motd-fsc  libc-2.15.so   [.] malloc
     0.25%           112            uname  libc-2.15.so   [.] malloc
     0.24%           108              cut  libc-2.15.so   [.] malloc
     0.23%           104           groups  libc-2.15.so   [.] malloc
     0.21%            94  release-upgrade  libc-2.15.so   [.] malloc
     0.18%            82        00-header  libc-2.15.so   [.] malloc
     0.14%            62             mesg  libc-2.15.so   [.] malloc
     0.09%            42  update-motd-reb  libc-2.15.so   [.] malloc
     0.09%            40             date  libc-2.15.so   [.] malloc
     0.08%            35             bash  libc-2.15.so   [.] malloc
     0.08%            35         basename  libc-2.15.so   [.] malloc
     0.08%            34          dirname  libc-2.15.so   [.] malloc
     0.06%            29               sh  libc-2.15.so   [.] malloc
     0.06%            26        99-footer  libc-2.15.so   [.] malloc
     0.05%            24              cat  libc-2.15.so   [.] malloc
     0.04%            18             expr  libc-2.15.so   [.] malloc
     0.04%            17         rsyslogd  libc-2.15.so   [.] malloc
     0.03%            12             stty  libc-2.15.so   [.] malloc
     0.00%             1             cron  libc-2.15.so   [.] malloc
This shows the most malloc() calls were by apt-config, while I was tracing.

User: malloc() with size

As of the Linux 3.13.1 kernel, this is not supported yet:

# perf probe -x /lib/x86_64-linux-gnu/libc-2.15.so --add 'malloc size'
Debuginfo-analysis is not yet supported with -x/--exec option.
  Error: Failed to add events. (-38)
As a workaround, you can access the registers (on Linux 3.7+). For example, on x86_64:

# perf probe -x /lib64/libc-2.17.so '--add=malloc size=%di'
       probe_libc:malloc    (on 0x800c0 with size=%di)

perf tracepoint

event中的一种类型，实际上是一些比较常见的系统调用。

不在里面的可以使用前面介绍的动态跟踪的方式进行跟踪。

支持哪些tracepoint

perf list | awk -F: '/Tracepoint event/ { lib[$1]++ } END {for (l in lib) { printf "  %-16s %d\n", l, lib[l] } }' | sort | column

    block          18       jbd2           11       kvmmmu         9        napi           1        sched          15       skb            3        timer          12       writeback      16
    ext4           46       kmem           42       mce            1        net            4        scsi           5        sock           2        udp            1        xfs            314
    irq            5        kvm            21       module         5        power          3        signal         2        syscalls       548      workqueue      4

perf list 
......
  xfs:xfs_attr_list_sf                               [Tracepoint event]
  xfs:xfs_attr_list_sf_all                           [Tracepoint event]
  xfs:xfs_attr_list_leaf                             [Tracepoint event]
  xfs:xfs_attr_list_leaf_end                         [Tracepoint event]
  xfs:xfs_attr_list_full                             [Tracepoint event]
  xfs:xfs_attr_list_add                              [Tracepoint event]
......

主要包含以下tracepoint subtype

block: block device I/O
ext3, ext4: file system operations
kmem: kernel memory allocation events
random: kernel random number generator events
sched: CPU scheduler events
syscalls: system call enter and exits
task: task events

例子

I used perf_events to record the block request (disk I/O) issue and completion static tracepoints:

# perf record -e block:block_rq_issue -e block:block_rq_complete -a sleep 120
[ perf record: Woken up 36 times to write data ]
[ perf record: Captured and wrote 8.885 MB perf.data (~388174 samples) ]
# perf script
[...]
     randread.pl  2522 [000]  6011.824759: block:block_rq_issue: 254,16 R 0 () 7322849 + 16 [randread.pl]
     randread.pl  2520 [000]  6011.824866: block:block_rq_issue: 254,16 R 0 () 26144801 + 16 [randread.pl]
         swapper     0 [000]  6011.828913: block:block_rq_complete: 254,16 R () 31262577 + 16 [0]
     randread.pl  2521 [000]  6011.828970: block:block_rq_issue: 254,16 R 0 () 70295937 + 16 [randread.pl]
         swapper     0 [000]  6011.835862: block:block_rq_complete: 254,16 R () 26144801 + 16 [0]
     randread.pl  2520 [000]  6011.835932: block:block_rq_issue: 254,16 R 0 () 5495681 + 16 [randread.pl]
         swapper     0 [000]  6011.837988: block:block_rq_complete: 254,16 R () 7322849 + 16 [0]
     randread.pl  2522 [000]  6011.838051: block:block_rq_issue: 254,16 R 0 () 108589633 + 16 [randread.pl]
         swapper     0 [000]  6011.850615: block:block_rq_complete: 254,16 R () 108589633 + 16 [0]
[...]

绘制perf火焰图

使用perf report -tui或者-stdio输出的文本不够直观的话，使用火焰图可以很直观的表现出哪些代码是瓶颈所在。

压测
$pgbench -M prepared -n -r -P 1 -c 32 -j 32 -T 100

收集统计信息
#perf record -a -g -v sleep 30

生成火焰图

# git clone https://github.com/brendangregg/FlameGraph      # or download it from github
# mv perf.data FlameGraph/
# cd FlameGraph
# perf script | ./stackcollapse-perf.pl > out.perf-folded
# cat out.perf-folded | ./flamegraph.pl > perf-kernel.svg

绘制perf热力图

压测
$pgbench -M prepared -n -r -P 1 -c 32 -j 32 -T 100

收集统计信息
#perf record -a -g -v sleep 30

生成热力图

# git clone https://github.com/brendangregg/HeatMap      # or download it from github
# mv perf.data HeatMap/
# cd HeatMap
# perf script | awk '{ gsub(/:/, "") } $5 ~ /issue/ { ts[$6, $10] = $4 }
    $5 ~ /complete/ { if (l = ts[$6, $9]) { printf "%.f %.f\n", $4 * 1000000,
    ($4 - l) * 1000000; ts[$6, $10] = 0 } }' > out.lat_us
# ./trace2heatmap.pl --unitstime=us --unitslat=us --maxlat=50000 out.lat_us > out.svg

小结

要完备的跟踪和打印跟踪（符号表、call stack trace、汇编指令）信息，建议内核编译时加上

CONFIG_KALLSYMS=y
CONFIG_FRAME_POINTER=y

编译perf时需要支持libunwind, 并加上

gcc -g dwarf

软件编译时加上

gcc -g -ggdb -fno-omit-frame-pointer

如果是yum安装的软件，可以安装对应的debuginfo包。

参考

1. http://www.brendangregg.com/perf.html

2. https://perf.wiki.kernel.org/index.php/Main_Page

3. http://www.linux-kongress.org/2010/slides/lk2010-perf-acme.pdf

4. https://perf.wiki.kernel.org/index.php/Tutorial

5. 火焰图

https://github.com/brendangregg/FlameGraph

6. 热力图

https://github.com/brendangregg/HeatMap

7. https://github.com/brendangregg/perf-tools

8. https://kernel.org/

Linux 性能诊断 perf使用指南

标签

背景