Further Adventures With CAS Instructions And Micro Benchmarking

简介:

原文地址:http://mechanical-sympathy.blogspot.com/2013/01/further-adventures-with-cas.html

In a previous article I reported what appeared to be a performance issue with CAS/LOCK instructions on the Sandy Bridge microarchitecture compared to the previous Nehalem microarchitecture.  Since then I’ve worked with the good people of Intel to understand what was going on and I’m now pleased to be able to shine some light on the previous results.

I observed a small drop in throughput with the uncontended single-thread case, and an order-of-magnitude decrease in throughput once multiple threads contend when performing updates.  This testing spawned out of observations testing Java Queue implementations and the Disruptor for the multi-producer case.  I was initially puzzled by these findings because almost every other performance test I applied to Sandy Bridge indicated a major step forward for this microarchitecture.

After digging deeper into this issue it has come to light that my tests have once again fallen fowl of the difficulties in micro-benchmarking.  My test is not a good means of testing throughput and it is actually testing fairness in a roundabout manner.  Let’s revisit the code and work through what is going on.

Test Code


001 #include <time.h>
002 #include <pthread.h>
003 #include <stdlib.h>
004 #include <iostream>
005   
006 typedef unsigned long long uint64;
007   
008 const uint64 COUNT = 500 * 1000 * 1000;
009   
010 volatile uint64 counter = 0;
011   
012 void* run_add(void* numThreads)
013 {
014     register uint64 value = (COUNT / *((int*)numThreads)) + 1;
015   
016     while (--value != 0)
017     {
018         __sync_add_and_fetch(&counter, 1);
019     }
020 }
021   
022 void* run_xadd(void*)
023 {
024     register uint64 value = counter;
025   
026     while (value < COUNT)
027     {
028         value = __sync_add_and_fetch(&counter, 1);
029     }
030 }
031   
032 void* run_cas(void*)
033 {
034     register uint64 value = 0;
035   
036     while (value < COUNT)
037     {
038         do
039         {
040             value = counter;
041         }
042         while (!__sync_bool_compare_and_swap(&counter, value, value + 1));
043     }
044 }
045   
046 void* run_cas2(void*)
047 {
048     register uint64 value = 0;
049     register uint64 next = 0;
050   
051     while (value < COUNT)
052     {
053         value = counter;
054         do
055         {
056             next = value + 1;
057             value = __sync_val_compare_and_swap(&counter, value, next);
058         }
059         while (value != next);
060     }
061 }
062   
063 int main (int argc, char *argv[])
064 {
065     const int NUM_THREADS = atoi(argv[1]);
066     const int TESTCASE = atoi(argv[2]);
067   
068     pthread_t threads[NUM_THREADS];
069     void* status;
070   
071     timespec ts_start;
072     timespec ts_finish;
073     clock_gettime(CLOCK_MONOTONIC, &ts_start);
074   
075   
076     for (int i = 0; i < NUM_THREADS; i++)
077     {
078         switch (TESTCASE)
079         {
080             case 1:
081                 std::cout << "LOCK ADD" << std::endl;
082                 pthread_create(&threads[i], NULL, run_add, (void*)&NUM_THREADS);
083                 break;
084   
085             case 2:
086                 std::cout << "LOCK XADD" << std::endl;
087                 pthread_create(&threads[i], NULL, run_xadd, (void*)&NUM_THREADS);
088                 break;
089   
090             case 3:
091                 std::cout << "LOCK CMPXCHG BOOL" << std::endl;
092                 pthread_create(&threads[i], NULL, run_cas, (void*)&NUM_THREADS);
093                 break;
094   
095             case 4:
096                 std::cout << "LOCK CMPXCHG VAL" << std::endl;
097                 pthread_create(&threads[i], NULL, run_cas2, (void*)&NUM_THREADS);
098                 break;
099   
100             default:
101                 exit(1);
102         }
103     }
104   
105     for (int i = 0; i < NUM_THREADS; i++)
106     {
107         pthread_join(threads[i], &status);
108     }
109   
110     clock_gettime(CLOCK_MONOTONIC, &ts_finish);
111   
112     uint64 start = (ts_start.tv_sec * 1000000000) + ts_start.tv_nsec;
113     uint64 finish = (ts_finish.tv_sec * 1000000000) + ts_finish.tv_nsec;
114     uint64 duration = finish - start;
115   
116     std::cout << "threads = " << NUM_THREADS << std::endl;
117     std::cout << "duration = " <<  duration << std::endl;
118     std::cout << "ns per op = " << (duration / (COUNT * 2)) << std::endl;
119     std::cout << "op/sec = " << ((COUNT * 2 * 1000 * 1000 * 1000) / duration) << std::endl;
120     std::cout << "counter = " << counter << std::endl;
121   
122     return 0;
123 }

The code above makes it possible to test the major CAS based techniques on x86. For full clarity an objdump -d of the binary reveals the compiler generated assembly instructions for the above methods. The “lock” instruction in each section is where the atomic update is happening.


0000000000400dc0 <_z8run_cas2pv>:
  400dc0: 48 8b 05 d9 07 20 00  mov    0x2007d9(%rip),%rax      
  400dc7: 66 0f 1f 84 00 00 00  nopw   0x0(%rax,%rax,1)
  400dce: 00 00 
  400dd0: 48 8d 50 01           lea    0x1(%rax),%rdx
  400dd4: f0 48 0f b1 15 c3 07  lock cmpxchg %rdx,0x2007c3(%rip)
  400ddb: 20 00 
  400ddd: 48 39 c2              cmp    %rax,%rdx
  400de0: 75 ee                 jne    400dd0 <_z8run_cas2pv>
  400de2: 48 3d ff 64 cd 1d     cmp    $0x1dcd64ff,%rax
  400de8: 76 d6                 jbe    400dc0 <_z8run_cas2pv>
  400dea: f3 c3                 repz retq 
  400dec: 0f 1f 40 00           nopl   0x0(%rax)

0000000000400df0 <_z7run_caspv>:
  400df0: 48 8b 15 a9 07 20 00  mov    0x2007a9(%rip),%rdx     
  400df7: 48 8d 4a 01           lea    0x1(%rdx),%rcx
  400dfb: 48 89 d0              mov    %rdx,%rax
  400dfe: f0 48 0f b1 0d 99 07  lock cmpxchg %rcx,0x200799(%rip)  
  400e05: 20 00 
  400e07: 75 e7                 jne    400df0 <_z7run_caspv>
  400e09: 48 81 fa ff 64 cd 1d  cmp    $0x1dcd64ff,%rdx
  400e10: 76 de                 jbe    400df0 <_z7run_caspv>
  400e12: f3 c3                 repz retq 
  400e14: 66 66 66 2e 0f 1f 84  data32 data32 nopw %cs:0x0(%rax,%rax,1)
  400e1b: 00 00 00 00 00 

0000000000400e20 <_z8run_xaddpv>:
  400e20: 48 8b 05 79 07 20 00  mov    0x200779(%rip),%rax    
  400e27: 48 3d ff 64 cd 1d     cmp    $0x1dcd64ff,%rax
  400e2d: 77 1b                 ja     400e4a <_z8run_xaddpv>
  400e2f: 90                    nop
  400e30: b8 01 00 00 00        mov    $0x1,%eax
  400e35: f0 48 0f c1 05 62 07  lock xadd %rax,0x200762(%rip) 
  400e3c: 20 00 
  400e3e: 48 83 c0 01           add    $0x1,%rax
  400e42: 48 3d ff 64 cd 1d     cmp    $0x1dcd64ff,%rax
  400e48: 76 e6                 jbe    400e30 <_z8run_xaddp>
  400e4a: f3 c3                 repz retq 
  400e4c: 0f 1f 40 00           nopl   0x0(%rax)

0000000000400e50 <_z7run_addpv>:
  400e50: 48 63 0f              movslq (%rdi),%rcx
  400e53: 31 d2                 xor    %edx,%edx
  400e55: b8 00 65 cd 1d        mov    $0x1dcd6500,%eax
  400e5a: 48 f7 f1              div    %rcx
  400e5d: 48 85 c0              test   %rax,%rax
  400e60: 74 15                 je     400e77 <_z7run_addpv>
  400e62: 66 0f 1f 44 00 00     nopw   0x0(%rax,%rax,1)
  400e68: f0 48 83 05 2f 07 20  lock addq $0x1,0x20072f(%rip)    
  400e6f: 00 01 
  400e71: 48 83 e8 01           sub    $0x1,%rax
  400e75: 75 f1                 jne    400e68 <_z7run_addpv>
  400e77: f3 c3                 repz retq 
  400e79: 90                    nop
  400e7a: 90                    nop
  400e7b: 90                    nop
  400e7c: 90                    nop
  400e7d: 90                    nop
  400e7e: 90                    nop
  400e7f: 90                    nop


To purely isolate the performance of the CAS operation the test should be run using the lock xadd option for an atomic increment in hardware.  This instruction lets us avoid the spin-retry loop of a pure software CAS that can dirty the experiment.

I repeated the experiment from the previous article and got very similar results.  Previously, I thought I’d observed a throughput drop even in the uncontended single-threaded case.  So I focused in on this to confirm.  To do this I had to find two processors that once Turbo Boost had kicked in then the clock speeds would be comparable.  I found this by using a 2.8GHz Nehalem and 2.4GHz Sandy Bridge.  For the single-threaded case they are both operating at ~3.4GHz.


Nehalem 2.8GHz
==============
$ perf stat ./atomic_inc 1 2
LOCK XADD
threads = 1
duration = 3090445546
ns per op = 3
op/sec = 323577938

 Performance counter stats for './atomic_inc 1 2':

       3085.466216 task-clock                #    0.997 CPUs utilized          
               331 context-switches          #    0.107 K/sec                  
                 4 CPU-migrations            #    0.001 K/sec                  
               360 page-faults               #    0.117 K/sec                  
    10,527,264,923 cycles                    #    3.412 GHz                 
     9,394,575,677 stalled-cycles-frontend   #   89.24% frontend cycles idle
     7,423,070,202 stalled-cycles-backend    #   70.51% backend  cycles idle 
     2,517,668,293 instructions              #    0.24  insns per cycle        
                                             #    3.73  stalled cycles per insn
       503,526,119 branches                  #  163.193 M/sec                  
           110,695 branch-misses             #    0.02% of all branches       

       3.093402966 seconds time elapsed

Sandy Bridge 2.4GHz
===================
$ perf stat ./atomic_inc 1 2
LOCK XADD
threads = 1
duration = 3394221940
ns per op = 3
op/sec = 294618330

 Performance counter stats for './atomic_inc 1 2':

       3390.404400 task-clock                #    0.998 CPUs utilized          
               357 context-switches          #    0.105 K/sec                  
                 1 CPU-migrations            #    0.000 K/sec                  
               358 page-faults               #    0.106 K/sec                  
    11,522,932,068 cycles                    #    3.399 GHz                 
     9,542,667,311 stalled-cycles-frontend   #   82.81% frontend cycles idle  
     6,721,330,874 stalled-cycles-backend    #   58.33% backend  cycles idle  
     2,518,638,461 instructions              #    0.22  insns per cycle        
                                             #    3.79  stalled cycles per insn
       502,490,710 branches                  #  148.210 M/sec                  
            36,955 branch-misses             #    0.01% of all branches        

       3.398206155 seconds time elapsed


Analysis

So repeating the tests with comparable clock speeds confirmed the previous results.  The single-threaded case shows a ~10% drop in throughput and the multi-threaded contended case displays an order-of-magnitude difference in throughput.

Now the big question is what is going on and why has throughput dropped?  Well the single-threaded case suggests nothing major has happened to number of cycles required to execute the instruction when uncontended.  The small differences could be attributed to system noise or the changes in the CPU front-end for Sandy Bridge with introduction of the additional load address generation unit.

For the multi-threaded case we found an interesting surprise when Intel monitored what the instructions are doing.  We found that each thread on Nehalem was able to perform more updates in a batch before loosing the exclusive state on the cacheline containing the counter.  This is because the inter-core latency has improved with Sandy Bridge so other threads are able to faster claim the cacheline containing the counter to do their own updates.  What we are actually measuring with this micro-benchmark is how long a core can hold a cacheline before it is released to another core.  Sandy Bridge is exhibiting greater fairness which is what you’d want in a real world application.

This micro-benchmark is very unrealistic for a real world application.  Normally between performing counter updates a core would be doing a lot of other work.  At the point when the counter needs to be updated the reduced latency inter-core would then be a benefit.

In all my macro application benchmarks Sandy Bridge has proved to have better performance than Nehalem at comparable clock speeds.

Conclusion

What did I learn from this?  Well once again that writing micro-benchmarks is notoriously difficult.  It is so hard to know what you are measuring and what effects can come into play.  To illustrate how difficult it is to recognise such a flaw, for all those who have read this blog, no one has identified the issue and fed this back to me.

It also shows that what on first blush can be considered a performance bug is actually the opposite.  This shows how it is possible to have a second order effect when a performance improvement can make a specific work case run more slowly. 

目录
相关文章
|
10月前
|
Java API Spring
读书笔记系列 - Operating Systems: Three Easy Pieces - Virtualization - Chapter 4: Processes
读书笔记系列 - Operating Systems: Three Easy Pieces - Virtualization - Chapter 4: Processes
62 0
读书笔记系列 - Operating Systems: Three Easy Pieces - Virtualization - Chapter 4: Processes
|
10月前
|
安全 Unix Shell
读书笔记系列 - Operating Systems: Three Easy Pieces - Virtualization - Chapter 5: Process API
读书笔记系列 - Operating Systems: Three Easy Pieces - Virtualization - Chapter 5: Process API
69 0
|
11月前
|
网络协议 安全 Unix
Coroutine made DPDK development easy
This is a English translation of the article 《有了协程库,开发DPDK应用程序第一次可以这么简单》. The original link is https://developer.aliyun.com/article/1208390
|
分布式计算 Spark
《OPTIMIZING SPARK DEPLOYMENTS FOR CONTAINERS ISOLATION,SAFETY,AND PERFORMANCE》电子版地址
OPTIMIZING SPARK DEPLOYMENTS FOR CONTAINERS: ISOLATION,SAFETY,AND PERFORMANCE
59 0
《OPTIMIZING SPARK DEPLOYMENTS FOR CONTAINERS ISOLATION,SAFETY,AND PERFORMANCE》电子版地址
|
人工智能 Linux vr&ar
High-performance RISC-V Processor Xuantie C908
High-performance RISC-V Processor Xuantie C908
342 0
High-performance RISC-V Processor Xuantie C908
|
机器学习/深度学习 数据采集 人工智能
Re10:读论文 Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous gr
Re10:读论文 Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous gr
Re10:读论文 Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous gr
|
TensorFlow 算法框架/工具 Python
成功解决Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
成功解决Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
成功解决Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
|
SQL 编译器 API
Efficiently Compiling Efficient Query Plans for Modern Hardware 论文解读
这应该是SQL查询编译的一篇经典文章了,作者是著名的Thomas Neumann,主要讲解了TUM的HyPer数据库中对于CodeGen的应用。 在morsel-driven那篇paper 中,介绍了HyPer的整个执行框架,会以task为单位处理一个morsel的数据,而执行的处理逻辑(一个pipeline job)就被编译为一个函数。这篇paper则具体讲如何实现动态编译。
373 0
Efficiently Compiling Efficient Query Plans for Modern Hardware 论文解读
|
JavaScript 前端开发
Guidelines for Function Compute Development - Troubleshoot Timeout Issues
Endless codes and endless bugs When you write code, you may inadvertently introduce some hidden bugs, even if you test a large proportion of the codes to the maximum extent possible.
1598 0