分布式通讯优化篇 – IRQ affinity-阿里云开发者社区

开发者社区> 沙加10> 正文

分布式通讯优化篇 – IRQ affinity

简介:       在一次C500K性能压测过程中,发现一个问题:8 processor的CPU,负载基本集中在CPU0,并且负载达到70以上,并通过mpstat发现CPU0每秒总中断(%irq+%soft)次数比较高。       基于对此问题的研究,解决和思考,便有了这篇文章,希望大家能够喜欢,也欢迎大家留言讨论。       在正文开始之前,我们先来看两个跟性能相关的基本概念:中断与上线
+关注继续查看

      在一次C500K性能压测过程中,发现一个问题:8 processor的CPU,负载基本集中在CPU0,并且负载达到70以上,并通过mpstat发现CPU0每秒总中断(%irq+%soft)次数比较高。

      基于对此问题的研究,解决和思考,便有了这篇文章,希望大家能够喜欢,也欢迎大家留言讨论。

      在正文开始之前,我们先来看两个跟性能相关的基本概念:中断与上线文切换(在实际场景中,发现90%以上的同学说不清楚),希望这篇文章能带给你一些帮助,如果有疑问,欢迎交流。


      中断


        Hardware interrupts are used by devices to communicate that they require attention from the operating system. Internally, hardware interrupts are implemented using electronic alerting signals that are sent to the processor from an external device, which is either a part of the computer itself, such as a disk controller, or an external peripheral. For example, pressing a key on the keyboard or moving the mouse triggers hardware interrupts that cause the processor to read the keystroke or mouse position. Unlike the software type (described below), hardware interrupts are asynchronous and can occur in the middle of instruction execution, requiring additional care in programming. The act of initiating a hardware interrupt is referred to as an interrupt request (IRQ).


         A software interrupt is caused either by an exceptional condition in the processor itself, or a special instruction in the instruction set which causes an interrupt when it is executed. The former is often called a trap or exception and is used for errors or events occurring during program execution that are exceptional enough that they cannot be handled within the program itself. For example, if the processor's arithmetic logic unit is commanded to divide a number by zero, this impossible demand will cause a divide-by-zero exception, perhaps causing the computer to abandon the calculation or display an error message. Software interrupt instructions function similarly to subroutine calls and are used for a variety of purposes, such as to request services from low-level system software such as device drivers. For example, computers often use software interrupt instructions to communicate with the disk controller to request data be read or written to the disk.


        硬中断,硬件中断CPU,通常是异步处理的;软中断,指令中断内核执行,分两种情况,一种是异常,另外一种类似subroutine calls,软中断可以用来实现system call。


      上线文切换


        In computing, a context switch is the process of storing and restoring the state (context) of a process or thread so that execution can be resumed from the same point at a later time. This enables multiple processes to share a single CPU and is an essential feature of a multitasking operating system. What constitutes the context is determined by the processor and the operating system.Context switches are usually computationally intensive, and much of the design of operating systems is to optimize the use of context switches. Switching from one process to another requires a certain amount of time for doing the administration – saving and loading registers and memory maps, updating various tables and lists etc. A context switch can mean a register context switch, a task context switch, as tack frame switch, a thread context switch, or a process context switch.

        上下文切换,发生在内核态,其诱因通常是密集型计算。system call仅仅是kernel mode switch,上下文切换有多种表现形式,如进程之间,线程之间,栈帧之间等。举个例子,比方所pidstat -w 统计出来的  cswch/s 和 nvcswch/s 两个指标就是进程维度的。当然,它的实现也是很直观的,尝试一下如下命令,grep ctxt /proc/${process_id}/status,一目了然。

        问题来了,中断和上下文切换之间究竟存在什么样的数理关系?翻阅了很多文献资料,无果而返。最后去check linux kernal代码中关于cs ,%soft和%irq的统计逻辑,发现中断统计的实现是靠Read stats from /proc/interrupts or /proc/softirqs。更详细的实现请参看here。而进程的上下文切换是通过watch /proc/${process_id}/status来实现的。至于两者的数理关系,至少依靠现在的知识体系还无法拿到,欢迎不吝赐教。


        Ok,下面我们来看一下整个亲核优化过程中所需要掌握的基本技巧:RPS/RFS,irqbalance和irq affinity!

      

      RPS/RFS - Receive Package Steering/Receive Flow Steering


        Google同学开发的patch,从2.6.35开始加入到kernel中。简单来说,其原理是利用hash算法来hash TCP或者 UDP的 package header,并根据应用所在的CPU去选择软中断所需要的CPU。文档中有一句话,最能概括它的使用场景,如下。大致意思是说网卡单队列模式以及队列数少于CPU核数的场景下,如果能保证共享内存,用它无疑是最佳神器。

        For a single queue device, a typical RPS configuration would be to set the rps_cpus to the CPUs in the same memory domain of the interrupting CPU. If NUMA locality is not an issue, this could also be all CPUs in the system. At high interrupt rate, it might be wise to exclude the interrupting CPU from the map since that already performs much work. For a multi-queue system, if RSS is configured so that a hardware receive queue is mapped to each CPU, then RPS is probably redundant and unnecessary. If there are fewer hardware queues than CPUs, then RPS might be beneficial if the rps_cpus for each queue are the ones that share the same memory domain as the interrupting CPU for that queue.

        那问题又来了,如何辨别多队列网卡?如何保障共享内存?提供一种思路,对于第一个问题,可以用命令

        lspci -vvv | grep 'Ethernet controller'
        

        如果有MSI-X && Enable+ && TabSize > 1,则该网卡是多队列网卡。对于第二个问题,可以考虑在lscpu的帮助下,将中断绑定到具体的物理CPU上。

      Irqbalance


        手册上是这么说的,distribute hardware interrupts across processors on a multiprocessor system。在SMP体系结构上问题还是蛮多的,可以参看Ubuntu的Bug追踪系统。当然,国内褚霸同学对其源码进行了详细分析,感兴趣的可以也参看这里

       SMP IRQ Affinity


        最后,来看一下kernel 2.4加入的SMP IRQ Affinity:

        An interrupt request (IRQ) is a request for service, sent at the hardware level. Interrupts can be sent by either a dedicated hardware line, or across a hardware bus as an information packet (a Message Signaled Interrupt, or MSI). When interrupts are enabled, receipt of an IRQ prompts a switch to interrupt context. Kernel interrupt dispatch code retrieves the IRQ number and its associated list of registered Interrupt Service Routines (ISRs), and calls each ISR in turn. The ISR acknowledges the interrupt and ignores redundant interrupts from the same IRQ, then queues a deferred handler to finish processing the interrupt and stop the ISR from ignoring future interrupts.

       /proc/interrupts列出了IRQ number, the number of that interrupt handled by each CPU core, the interrupt type, and a comma-delimited list of drivers that are registered to receive that interrupt. (Refer to the proc(5) man page for further details: man 5 proc).


       /proc/irq/IRQ_NUMBER/smp_affinity,smp_affinity是用来描述中断亲和特性的,this property can be used to improve application performance by assigning both interrupt affinity and the application's thread affinity to one or more specific CPU cores. This allows cache line sharing between the specified interrupt and application threads.

       如何验证你的中断亲核性设置是否OK呢?请参看下面的流程:


       a. 查看网卡中断号:

       cat /proc/interrupts

      b. 查看该中断号的cpu affinity:

       sudo cat /proc/irq/42/smp_affinity

      c. 修改绑定:

       sudo echo ff > /proc/irq/42/smp_affinity

      d. 访问特定网站:

       ping -f www.creative.com

      e. 查看中断绑定结果:

       cat /proc/interrupts | grep  'CPU\|42:'

小结:

       在我的多队列网卡中,手动绑定了SMP IRQ Affinity的值,并且排除了其它两种优化方式的干扰,解决掉了开篇提到的性能问题。 但我文章里提到的那个数理关系,还有代进一步挖掘,如果有更多的发现,会及时分享给大家,希望大家能够喜欢!


参考文档:

1. http://en.wikipedia.org/wiki/Interrupt
2. http://en.wikipedia.org/wiki/Context_switch
3. http://wenku.baidu.com/view/315d2c8571fe910ef12df838.html
4. https://www.kernel.org/doc/Documentation/networking/scaling.txt
5. http://kernelnewbies.org/Linux_2_6_35
6. https://cs.uwaterloo.ca/~brecht/servers/apic/SMP-affinity.txt
7. http://www.linfo.org/context_switch.html
8. http://lwn.net/Articles/328339/
9. http://lwn.net/Articles/398385/
10. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/network-rps.html
11. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-cpu-irq.html smp_affinity

12. http://www.linfo.org/context_switch.html

13. http://www.softpanorama.org/Admin/Monitoring/Sar/linux_implementation_of_sar.shtml

14. http://choices.cs.uiuc.edu/ExpCS07.pdf

15. http://sebastien.godard.pagesperso-orange.fr/

16. http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html



版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。

相关文章
阿里云服务器怎么设置密码?怎么停机?怎么重启服务器?
如果在创建实例时没有设置密码,或者密码丢失,您可以在控制台上重新设置实例的登录密码。本文仅描述如何在 ECS 管理控制台上修改实例登录密码。
9787 0
从壹开始前后端分离【 .NET Core2.0 Api + Vue 2.0 + AOP + 分布式】框架之一 || 前言
缘起     作为一个.Net攻城狮已经4年有余了,一直不温不火,正好近来项目不是很忙,闲得无聊,搞一搞新技术,一方面是打发无聊的时间,一方面也是督促自己该学习辣!身边的大神都转行的转行,加薪的加薪,本人比较懒,只想搞技术 [哭笑] ,也是怀着小小的梦想,做一个系列文章可以和大家一起进步,讨论,希望总阅读数能上1万,嗯,哈哈哈哈 技术     本系列文章只是对现有的一些技术做一个简单说明或者是引入,只是一个抛砖引玉的作用,主要的还是希望和志同道合的大神们一起切磋武艺。
1310 0
scrapy-redis 构建分布式爬虫,此片文章有问题。不要用
此篇文章为转载,只供学习,有很多问题,如没有解决分布式去重问题。最好还是用scrapy-redis给出的例子代码 前言 scrapy是Python界出名的一个爬虫框架。Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。 虽然scrapy能做的事情很多,但是要做到大规模的分布式应用则捉襟见肘。有
3067 0
拒绝卡顿,揭秘盒马鲜生 APP Android 短视频秒播优化方案
短视频作为内容重要的承载方式,是吸引用户的重点,短视频的内容与体验直接关系到用户是否愿意长时停留。因此,体验的优化就显得尤为重要。上一篇我们分享了 iOS 短视频秒播优化,这篇我们来聊聊 Android 端的优化。
175 0
Unity3d面向英特尔® x86 平台的 Unity* 优化指南: 第 3 部分
目录 编辑器优化 遮挡剔除 LOD:细节级别 阴影 使用一个摄像头 渲染队列排序 光照贴图 针对复杂模型,使用简单的碰撞器代替网格碰撞器 返回至第 2 部分教程: 面向英特尔® x86 平台的 Unity* 优化指南: 第 2 部分 编辑器优化 遮挡剔除 遮挡剔除是 Unity 的一种特性,可帮助您剔除被摄像头可视范围内其他物体遮挡的物体。
974 0
阿里云服务器如何登录?阿里云服务器的三种登录方法
购买阿里云ECS云服务器后如何登录?场景不同,阿里云优惠总结大概有三种登录方式: 登录到ECS云服务器控制台 在ECS云服务器控制台用户可以更改密码、更换系.
13514 0
分布式架构设计篇(四)-聊聊cap
CAP理论可以说是分布式时代设计的指导理论之一
293 0
+关注
46
文章
0
问答
文章排行榜
最热
最新
相关电子书
更多
《2021云上架构与运维峰会演讲合集》
立即下载
《零基础CSS入门教程》
立即下载
《零基础HTML入门教程》
立即下载