一、cacheline概念
cpu利用cache和内存之间交换数据的最小粒度不是字节,而是称为cacheline的一块固定大小的区域,详细信息参见wiki文档:
http://en.wikipedia.org/wiki/CPU_cache#Cache_entry_structure
http://en.wikipedia.org/wiki/CPU_cache#Cache_entry_structure
二、cacheline查看方法
前文《
cpu cache信息查看
》中介绍了查看cacheline大小的方法:
cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64
cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64
三、cacheline对性能的影响
关于cpu cache对性能的影响, Igor Ostrovsky有一篇精彩的文章:
http://igoro.com/archive/gallery-of-processor-cache-effects/
本文尝试验证上文中的观点,编写了下面的例子程序:
cacheline.c
编译一下: gcc -O0 -o cacheline cacheline.c
下面开始看看cacheline对程序性能的影响。按照cacheline的定义,我们可以推测step从1到64,加载cacheline的次数是一致的。而继续增大step,加载cacheline的次数就会变少。
看看结果:
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 1
134217728
Performance counter stats for './cacheline 1':
2,352,446 L1-dcache-loads-misses # 0.35% of all L1-dcache hits
673,338,076 L1-dcache-load
1,041,209,909 cycles # 0.000 GHz
0.433421077 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 2
67108864
Performance counter stats for './cacheline 2':
2,326,564 L1-dcache-loads-misses # 0.69% of all L1-dcache hits
337,577,957 L1-dcache-load
524,684,462 cycles # 0.000 GHz
0.254773008 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 4
33554432
Performance counter stats for './cacheline 4':
2,309,318 L1-dcache-loads-misses # 1.36% of all L1-dcache hits
169,703,215 L1-dcache-load
255,623,966 cycles # 0.000 GHz
0.154640897 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 64
2097152
Performance counter stats for './cacheline 64':
2,292,510 L1-dcache-loads-misses # 18.64% of all L1-dcache hits
12,299,250 L1-dcache-load
55,040,163 cycles # 0.000 GHz
0.034769960 seconds time elapsed
可以看出,
i)step从1调整到64,L1 cache misses非常接近
ii) 程序执行时间不光取决于cache miss,还与很多因素有关(比如cpu clocks)
继续增大step:
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 128
1048576
Performance counter stats for './cacheline 128':
1,308,532 L1-dcache-loads-misses # 18.56% of all L1-dcache hits
7,048,673 L1-dcache-load
38,773,055 cycles # 0.000 GHz
0.024586981 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 1024
131072
Performance counter stats for './cacheline 1024':
442,176 L1-dcache-loads-misses # 18.21% of all L1-dcache hits
2,427,631 L1-dcache-load
17,618,913 cycles # 0.000 GHz
0.011433279 seconds time elapsed
http://igoro.com/archive/gallery-of-processor-cache-effects/
本文尝试验证上文中的观点,编写了下面的例子程序:
cacheline.c
点击(此处)折叠或打开
- #include stdio.h>
- #include string.h>
-
- #define BUF_SIZE 8388608
- #define LOOPS 16
-
- char arr[BUF_SIZE] __attribute__((__aligned__((64)),__section__(".data.cacheline_aligned"))) ;
-
- int main(int argc, char **argv)
- {
- int step = atoi(argv[1]);
- int i = 0;
- int j = 0;
- int iter = 0;
-
- for (i = 0; i LOOPS; i++){
- for (j = 0; j BUF_SIZE; j += step){
- iter++;
- arr[j] = 3;
- }
- }
-
- printf("%d\n", iter);
- return 0;
- }
编译一下: gcc -O0 -o cacheline cacheline.c
下面开始看看cacheline对程序性能的影响。按照cacheline的定义,我们可以推测step从1到64,加载cacheline的次数是一致的。而继续增大step,加载cacheline的次数就会变少。
看看结果:
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 1
134217728
Performance counter stats for './cacheline 1':
2,352,446 L1-dcache-loads-misses # 0.35% of all L1-dcache hits
673,338,076 L1-dcache-load
1,041,209,909 cycles # 0.000 GHz
0.433421077 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 2
67108864
Performance counter stats for './cacheline 2':
2,326,564 L1-dcache-loads-misses # 0.69% of all L1-dcache hits
337,577,957 L1-dcache-load
524,684,462 cycles # 0.000 GHz
0.254773008 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 4
33554432
Performance counter stats for './cacheline 4':
2,309,318 L1-dcache-loads-misses # 1.36% of all L1-dcache hits
169,703,215 L1-dcache-load
255,623,966 cycles # 0.000 GHz
0.154640897 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 64
2097152
Performance counter stats for './cacheline 64':
2,292,510 L1-dcache-loads-misses # 18.64% of all L1-dcache hits
12,299,250 L1-dcache-load
55,040,163 cycles # 0.000 GHz
0.034769960 seconds time elapsed
可以看出,
i)step从1调整到64,L1 cache misses非常接近
ii) 程序执行时间不光取决于cache miss,还与很多因素有关(比如cpu clocks)
继续增大step:
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 128
1048576
Performance counter stats for './cacheline 128':
1,308,532 L1-dcache-loads-misses # 18.56% of all L1-dcache hits
7,048,673 L1-dcache-load
38,773,055 cycles # 0.000 GHz
0.024586981 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 1024
131072
Performance counter stats for './cacheline 1024':
442,176 L1-dcache-loads-misses # 18.21% of all L1-dcache hits
2,427,631 L1-dcache-load
17,618,913 cycles # 0.000 GHz
0.011433279 seconds time elapsed
L1 cache miss有了非常明显的下降。