numatop的实现
在霸爷的推荐下浏览了一遍numatop的代码. numatop从numa的角度统计系统的CPU,CPI, 内存访问热点,RMA, LMA,调用栈等.
很好很强大.
numatop底层也是和perf一样,都是通过内核的PerfEvent子系统观察各种counter.
不同的是numatop是从node角度统计和聚合数据.
PerfEvnet是通过采集PMU processor的数据(Performance Monitoring Unit).
PMU
Performance Monitor Unit,性能监视单元,是CPU提供的一个单元.提供各种事件计数器,可以在计数器overflow时候产生新信号或者POLL_HUP.
参考<Intel开发手册 - 18.8之后的几个章节>
运行numatop
numatop官方支持kernel3.8以上版本,硬件支持E5,E7系列。
我们的开发机是2.6.32 的,需要把用到的3.8内核中的特性关掉就好了。
vim intel/wsm.c static plat_event_config_t s_wsmep_profiling[COUNT_NUM] = { { PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, 0x53, 0, "cpu_clk_unhalted.core" }, { PERF_TYPE_RAW, 0x01B7, 0x53, 0x2011, "off_core_response_0" }, { PERF_TYPE_HARDWARE, 1, 0x53, 0, "cpu_clk_unhalted.ref" }, { PERF_TYPE_HARDWARE, 2, 0x53, 0, "instr_retired.any" }, { PERF_TYPE_RAW, INVALID_CODE_UMASK, 0, 0, "off_core_response_1" } }; 然后, make ./numatop
主界面
node维度
同一个进程下的线程维度
numatop如何组织不同node上的不同cpu上的不同事件?
numatop是如何获取numa的布局?
获取node列表
[zunbao.fengzb@rds064071.sqa.cm4 e2e-qos-0.8]$ cat /sys/devices/system/node/online 0-1
获取cpu列表
[zunbao.fengzb@rds064071.sqa.cm4 e2e-qos-0.8]$ cat /sys/devices/system/node//node0/cpulist 0-3,8-11 [zunbao.fengzb@rds064071.sqa.cm4 e2e-qos-0.8]$ cat /sys/devices/system/node//node1/cpulist 4-7,12-15
如何获取cpu详细信息?
使用汇编指令cpuid
更多cpuid参考Intel手册,或者http://en.wikipedia.org/wiki/CPUID
__asm volatile( "cpuid\n\t" :"=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx) :"a" (*eax));
这是一段嵌入汇编:
输出,依次绑定eax, ebx, ecx, edx寄存器。
输入,绑定本地变量eax。
汇编指令cpuid的返回值 - 获取CPU厂商字符信息
eax=0时,返回CPU的厂商信息CPU's manufacturer ID string。返回值是12个字符依次存储在EBX, EDX, ECX。
"AMDisbetter!" – early engineering samples of AMD K5 processor "AuthenticAMD" – AMD "CentaurHauls" – Centaur (Including some VIA CPU) "CyrixInstead" – Cyrix "GenuineIntel" – Intel "TransmetaCPU" – Transmeta "GenuineTMx86" – Transmeta "Geode by NSC" – National Semiconductor "NexGenDriven" – NexGen "RiseRiseRise" – Rise "SiS SiS SiS " – SiS "UMC UMC UMC " – UMC "VIA VIA VIA " – VIA "Vortex86 SoC" – Vortex "KVMKVMKVMKVM" – KVM "Microsoft Hv" – Microsoft Hyper-V or Windows Virtual PC "VMwareVMware" – VMware "XenVMMXenVMM" – Xen HVM
汇编指令cpuid的返回值 - 获取CPU型号家族信息
eax1时,返回CPU的stepping, model, and family information,返回值只有一个,就是eax。
eax每个bit的解释如下。
3:0 – Stepping 7:4 – Model 11:8 – Family 13:12 – Processor Type 19:16 – Extended Model 27:20 – Extended Family
具体的解释应该参考相应的厂商的开发手册
通过cpuid工具获取
已经有个同名的cpuid工具,通过CPU提供的cpuid指令dump出每个cpu的信息。
sudo emerge -avt sys-apps/cpuid cpuid的输出: CPU 0: vendor_id = "GenuineIntel" version information (1/eax): processor type = primary processor (0) family = Intel Pentium Pro/II/III/Celeron/Core/Core 2/Atom, AMD Athlon/Duron, Cyrix M2, VIA C3 (6) model = 0xa (10) stepping id = 0x9 (9) extended family = 0x0 (0) extended model = 0x3 (3) (simple synth) = Intel Core i3-3000 (Ivy Bridge L1) / i5-3000 (Ivy Bridge E1/N0/L1) / i7-3000 (Ivy Bridge E1) / Mobile Core i3-3000 (Ivy Bridge L1) / i5-3000 (Ivy Bridge L1) / Mobile Core i7-3000 (Ivy Bridge E1/L1) / Xeon E3-1200 v2 (Ivy Bridge E1/N0/L1) / Pentium G1600/G2000/G2100 (Ivy Bridge P0) / Pentium 900/1000/2000/2100 (P0), 22nm miscellaneous (1/ebx): process local APIC physical ID = 0x0 (0) cpu count = 0x10 (16) CLFLUSH line size = 0x8 (8) brand index = 0x0 (0) brand id = 0x00 (0): unknown feature information (1/edx): x87 FPU on chip = true virtual-8086 mode enhancement = true debugging extensions = true page size extensions = true time stamp counter = true RDMSR and WRMSR support = true physical address extensions = true machine check exception = true CMPXCHG8B inst. = true ... ... ...
非常的详细,包括TLB,SYSCALL,cache等。
*
/sys/devices/system/node/online
使perf_event_open获取计数器
// 对overflow时间可以选择信号处理SIGIO, 也可以选择epoll, select. // 这里选择SIGIO. // 信号处理函数 static void perf_event_handler(int signum, siginfo_t* info, void* ucontext) { printf("In Signal handler Used %lld instructions\n", (++g_cnt)*1000); ioctl(info->si_fd, PERF_EVENT_IOC_REFRESH, 1); } int main() { struct sigaction sa; memset(&sa, 0, sizeof(struct sigaction)); sa.sa_sigaction = perf_event_handler; sa.sa_flags = SA_SIGINFO; // 注册 SIGIO的信号处理函数 // 当计数器overflow会像进程发送SIGIO信号 sigaction(SIGIO, &sa, NULL); struct perf_event_attr pe; memset(&pe, 0, sizeof(struct perf_event_attr)); pe.type = PERF_TYPE_HARDWARE; pe.size = sizeof(struct perf_event_attr); // 统计 retired instructions pe.config = PERF_COUNT_HW_INSTRUCTIONS; // Event is initially disabled pe.disabled = 1; pe.sample_type = PERF_SAMPLE_IP; // 采样的间隔,设置寄存器为1000,没当时间发生一次就减1,当到达0的时候就触发一次信号. pe.sample_period = 1000; // 内核发生的时间排除在外 pe.exclude_kernel = 1; pe.exclude_hv = 1; // 统计当前进程 pid_t pid = 0; // 统计所有CPU节点 int cpu = -1; // 当前统计项为组长 // 可以同时开启多个统计项, 组长的group_fd为-1, 其他的统计项的group_fd为组长的perf_event_open返回的fd // 一个组的统计项,在所有统计项都能统计时,一起统计. int group_fd = -1; unsigned long flags = 0; // 打开event int fd = perf_event_open(&pe, pid, cpu, group_fd, flags); // 设置以信号的方式处理overflow fcntl(fd, F_SETFL, O_NONBLOCK|O_ASYNC); fcntl(fd, F_SETSIG, SIGIO); fcntl(fd, F_SETOWN, getpid()); // 初始计数器的值0 ioctl(fd, PERF_EVENT_IOC_RESET, 0); ioctl(fd, PERF_EVENT_IOC_REFRESH, 1); // 制造一些payload long loopCount = 1000000; long c = 0; long i = 0; for(i = 0; i < loopCount; i++) { c += 1; } // 关闭event ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); // 读取最终的计数值 long long counter; read(fd, &counter, sizeof(long long)); // Read event counter value printf("Used %lld instructions\n", counter); close(fd); }
附dot画图脚本
dot -Tpng "/tmp/numa_cpu_node.gv" > "/tmp/numa_cpu_node.png"
// /tmp/numa_cpu_node.gv digraph G { node [shape=record, style=filled]; group[fillcolor=green,label="{node_group|<node>node_t [64]}"]; node1[fillcolor=orange, label="{node_t|<cpu>perf_cpu_t cpus[64]|int nid}"]; node2[fillcolor=orange, label="{node_t|<cpu>perf_cpu_t cpus[64]|int nid}"]; cpu11[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"]; cpu12[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"]; cpu13[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"]; cpu21[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"]; cpu22[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"]; cpu23[fillcolor=slateblue1, label="{perf_cpu_t|int cpuid|<fds>int fds[5]|void * map_base}"]; fd11[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"]; fd12[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"]; fd13[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"]; fd21[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"]; fd22[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"]; fd23[fillcolor=wheat, label="{event_group|fd1|fd2|fd3|fd4...}"]; group -> node1; group -> node2; node1:cpu -> cpu11; cpu11:fds -> fd11; node1:cpu -> cpu12; cpu12:fds -> fd12; node1:cpu -> cpu13; cpu13:fds -> fd13; node2:cpu -> cpu21; cpu21:fds -> fd21; node2:cpu -> cpu22; cpu22:fds -> fd22; node2:cpu -> cpu23; cpu23:fds -> fd23; }