分布式通讯优化篇 – IRQ affinity

简介:       在一次C500K性能压测过程中,发现一个问题:8 processor的CPU,负载基本集中在CPU0,并且负载达到70以上,并通过mpstat发现CPU0每秒总中断(%irq+%soft)次数比较高。       基于对此问题的研究,解决和思考,便有了这篇文章,希望大家能够喜欢,也欢迎大家留言讨论。       在正文开始之前,我们先来看两个跟性能相关的基本概念:中断与上线

      在一次C500K性能压测过程中,发现一个问题:8 processor的CPU,负载基本集中在CPU0,并且负载达到70以上,并通过mpstat发现CPU0每秒总中断(%irq+%soft)次数比较高。

      基于对此问题的研究,解决和思考,便有了这篇文章,希望大家能够喜欢,也欢迎大家留言讨论。

      在正文开始之前,我们先来看两个跟性能相关的基本概念:中断与上线文切换(在实际场景中,发现90%以上的同学说不清楚),希望这篇文章能带给你一些帮助,如果有疑问,欢迎交流。


      中断


        Hardware interrupts are used by devices to communicate that they require attention from the operating system. Internally, hardware interrupts are implemented using electronic alerting signals that are sent to the processor from an external device, which is either a part of the computer itself, such as a disk controller, or an external peripheral. For example, pressing a key on the keyboard or moving the mouse triggers hardware interrupts that cause the processor to read the keystroke or mouse position. Unlike the software type (described below), hardware interrupts are asynchronous and can occur in the middle of instruction execution, requiring additional care in programming. The act of initiating a hardware interrupt is referred to as an interrupt request (IRQ).


         A software interrupt is caused either by an exceptional condition in the processor itself, or a special instruction in the instruction set which causes an interrupt when it is executed. The former is often called a trap or exception and is used for errors or events occurring during program execution that are exceptional enough that they cannot be handled within the program itself. For example, if the processor's arithmetic logic unit is commanded to divide a number by zero, this impossible demand will cause a divide-by-zero exception, perhaps causing the computer to abandon the calculation or display an error message. Software interrupt instructions function similarly to subroutine calls and are used for a variety of purposes, such as to request services from low-level system software such as device drivers. For example, computers often use software interrupt instructions to communicate with the disk controller to request data be read or written to the disk.


        硬中断,硬件中断CPU,通常是异步处理的;软中断,指令中断内核执行,分两种情况,一种是异常,另外一种类似subroutine calls,软中断可以用来实现system call。


      上线文切换


        In computing, a context switch is the process of storing and restoring the state (context) of a process or thread so that execution can be resumed from the same point at a later time. This enables multiple processes to share a single CPU and is an essential feature of a multitasking operating system. What constitutes the context is determined by the processor and the operating system.Context switches are usually computationally intensive, and much of the design of operating systems is to optimize the use of context switches. Switching from one process to another requires a certain amount of time for doing the administration – saving and loading registers and memory maps, updating various tables and lists etc. A context switch can mean a register context switch, a task context switch, as tack frame switch, a thread context switch, or a process context switch.

        上下文切换,发生在内核态,其诱因通常是密集型计算。system call仅仅是kernel mode switch,上下文切换有多种表现形式,如进程之间,线程之间,栈帧之间等。举个例子,比方所pidstat -w 统计出来的  cswch/s 和 nvcswch/s 两个指标就是进程维度的。当然,它的实现也是很直观的,尝试一下如下命令,grep ctxt /proc/${process_id}/status,一目了然。

        问题来了,中断和上下文切换之间究竟存在什么样的数理关系?翻阅了很多文献资料,无果而返。最后去check linux kernal代码中关于cs ,%soft和%irq的统计逻辑,发现中断统计的实现是靠Read stats from /proc/interrupts or /proc/softirqs。更详细的实现请参看here。而进程的上下文切换是通过watch /proc/${process_id}/status来实现的。至于两者的数理关系,至少依靠现在的知识体系还无法拿到,欢迎不吝赐教。


        Ok,下面我们来看一下整个亲核优化过程中所需要掌握的基本技巧:RPS/RFS,irqbalance和irq affinity!

      

      RPS/RFS - Receive Package Steering/Receive Flow Steering


        Google同学开发的patch,从2.6.35开始加入到kernel中。简单来说,其原理是利用hash算法来hash TCP或者 UDP的 package header,并根据应用所在的CPU去选择软中断所需要的CPU。文档中有一句话,最能概括它的使用场景,如下。大致意思是说网卡单队列模式以及队列数少于CPU核数的场景下,如果能保证共享内存,用它无疑是最佳神器。

        For a single queue device, a typical RPS configuration would be to set the rps_cpus to the CPUs in the same memory domain of the interrupting CPU. If NUMA locality is not an issue, this could also be all CPUs in the system. At high interrupt rate, it might be wise to exclude the interrupting CPU from the map since that already performs much work. For a multi-queue system, if RSS is configured so that a hardware receive queue is mapped to each CPU, then RPS is probably redundant and unnecessary. If there are fewer hardware queues than CPUs, then RPS might be beneficial if the rps_cpus for each queue are the ones that share the same memory domain as the interrupting CPU for that queue.

        那问题又来了,如何辨别多队列网卡?如何保障共享内存?提供一种思路,对于第一个问题,可以用命令

        lspci -vvv | grep 'Ethernet controller'
        

        如果有MSI-X && Enable+ && TabSize > 1,则该网卡是多队列网卡。对于第二个问题,可以考虑在lscpu的帮助下,将中断绑定到具体的物理CPU上。

      Irqbalance


        手册上是这么说的,distribute hardware interrupts across processors on a multiprocessor system。在SMP体系结构上问题还是蛮多的,可以参看Ubuntu的Bug追踪系统。当然,国内褚霸同学对其源码进行了详细分析,感兴趣的可以也参看这里

       SMP IRQ Affinity


        最后,来看一下kernel 2.4加入的SMP IRQ Affinity:

        An interrupt request (IRQ) is a request for service, sent at the hardware level. Interrupts can be sent by either a dedicated hardware line, or across a hardware bus as an information packet (a Message Signaled Interrupt, or MSI). When interrupts are enabled, receipt of an IRQ prompts a switch to interrupt context. Kernel interrupt dispatch code retrieves the IRQ number and its associated list of registered Interrupt Service Routines (ISRs), and calls each ISR in turn. The ISR acknowledges the interrupt and ignores redundant interrupts from the same IRQ, then queues a deferred handler to finish processing the interrupt and stop the ISR from ignoring future interrupts.

       /proc/interrupts列出了IRQ number, the number of that interrupt handled by each CPU core, the interrupt type, and a comma-delimited list of drivers that are registered to receive that interrupt. (Refer to the proc(5) man page for further details: man 5 proc).


       /proc/irq/IRQ_NUMBER/smp_affinity,smp_affinity是用来描述中断亲和特性的,this property can be used to improve application performance by assigning both interrupt affinity and the application's thread affinity to one or more specific CPU cores. This allows cache line sharing between the specified interrupt and application threads.

       如何验证你的中断亲核性设置是否OK呢?请参看下面的流程:


       a. 查看网卡中断号:

       cat /proc/interrupts

      b. 查看该中断号的cpu affinity:

       sudo cat /proc/irq/42/smp_affinity

      c. 修改绑定:

       sudo echo ff > /proc/irq/42/smp_affinity

      d. 访问特定网站:

       ping -f www.creative.com

      e. 查看中断绑定结果:

       cat /proc/interrupts | grep  'CPU\|42:'

小结:

       在我的多队列网卡中,手动绑定了SMP IRQ Affinity的值,并且排除了其它两种优化方式的干扰,解决掉了开篇提到的性能问题。 但我文章里提到的那个数理关系,还有代进一步挖掘,如果有更多的发现,会及时分享给大家,希望大家能够喜欢!


参考文档:

1. http://en.wikipedia.org/wiki/Interrupt
2. http://en.wikipedia.org/wiki/Context_switch
3. http://wenku.baidu.com/view/315d2c8571fe910ef12df838.html
4. https://www.kernel.org/doc/Documentation/networking/scaling.txt
5. http://kernelnewbies.org/Linux_2_6_35
6. https://cs.uwaterloo.ca/~brecht/servers/apic/SMP-affinity.txt
7. http://www.linfo.org/context_switch.html
8. http://lwn.net/Articles/328339/
9. http://lwn.net/Articles/398385/
10. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/network-rps.html
11. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-cpu-irq.html smp_affinity

12. http://www.linfo.org/context_switch.html

13. http://www.softpanorama.org/Admin/Monitoring/Sar/linux_implementation_of_sar.shtml

14. http://choices.cs.uiuc.edu/ExpCS07.pdf

15. http://sebastien.godard.pagesperso-orange.fr/

16. http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html



目录
相关文章
|
3月前
|
人工智能 弹性计算 PyTorch
【Hello AI】安装和使用AIACC-ACSpeed-分布式训练场景的通信优化库
AIACC-ACSpeed专注于分布式训练场景的通信优化库,通过模块化的解耦优化设计,实现了分布式训练在兼容性、适用性和性能加速等方面的升级。本文为您介绍安装和使用AIACC-ACSpeed v1.1.0的方法。
|
3月前
|
人工智能 算法 PyTorch
【Hello AI】AIACC-ACSpeed-AI分布式训练通信优化库
AIACC-ACSpeed(AIACC 2.0-AIACC Communication Speeding)是阿里云推出的AI分布式训练通信优化库AIACC-Training 2.0版本。相比较于分布式训练AIACC-Training 1.5版本,AIACC-ACSpeed基于模块化的解耦优化设计方案,实现了分布式训练在兼容性、适用性和性能加速等方面的升级。
|
8月前
|
安全 调度 vr&ar
考虑设备动作损耗的配电网分布式电压无功优化(Matlab代码实现)
考虑设备动作损耗的配电网分布式电压无功优化(Matlab代码实现)
|
9月前
|
供应链 安全 新能源
基于主从博弈的社区综合能源系统分布式协同优化运行策略(Matlab代码实现)
基于主从博弈的社区综合能源系统分布式协同优化运行策略(Matlab代码实现)
112 0
|
9月前
|
安全 数据挖掘 新能源
考虑极端天气线路脆弱性的配电网分布式电源配置优化模型【IEEE33节点】(Matlab代码实现)
考虑极端天气线路脆弱性的配电网分布式电源配置优化模型【IEEE33节点】(Matlab代码实现)
|
6月前
|
固态存储 关系型数据库 MySQL
109分布式电商项目 - MySQL优化(服务器优化)
109分布式电商项目 - MySQL优化(服务器优化)
37 0
|
6月前
|
存储 关系型数据库 MySQL
108分布式电商项目 - MySQL优化(插入数据优化)
108分布式电商项目 - MySQL优化(插入数据优化)
32 0
|
6月前
|
关系型数据库 MySQL 数据库
107分布式电商项目 - MySQL优化(数据库结构优化)
107分布式电商项目 - MySQL优化(数据库结构优化)
39 0
|
6月前
|
SQL 关系型数据库 MySQL
106分布式电商项目 - MySQL优化(查询优化)
106分布式电商项目 - MySQL优化(查询优化)
48 0
|
8月前
|
算法 安全 新能源
基于粒子群优化算法的分布式电源优化调度实现配电网稳定运行(Matlab代码实现)
基于粒子群优化算法的分布式电源优化调度实现配电网稳定运行(Matlab代码实现)
112 0

热门文章

最新文章