False Sharing-阿里云开发者社区

开发者社区> 行者武松> 正文

False Sharing


Memory is stored within the cache system in units know as cache lines.  Cache lines are a power of 2 of contiguous bytes which are typically 32-256 in size.  The most common cache line size is 64 bytes.   False sharing is a term which applies when threads unwittingly impact the performance of each other while modifying independent variables sharing the same cache line.  Write contention on cache lines is the single most limiting factor on achieving scalability for parallel threads of execution in an SMP system.  I’ve heard false sharing described as the silent performance killer because it is far from obvious when looking at code.

To achieve linear scalability with number of threads, we must ensure no two threads write to the same variable or cache line.  Two threads writing to the same variable can be tracked down at a code level.   To be able to know if independent variables share the same cache line we need to know the memory layout, or we can get a tool to tell us.  Intel VTune is such a profiling tool.  In this article I’ll explain how memory is laid out for Java objects and how we can pad out our cache lines to avoid false sharing.

Figure 1.

Figure 1. above illustrates the issue of false sharing.  A thread running on core 1 wants to update variable X while a thread on core 2 wants to update variable Y.  Unfortunately these two hot variables reside in the same cache line.  Each thread will race for ownership of the cache line so they can update it.  If core 1 gets ownership then the cache sub-system will need to invalidate the corresponding cache line for core 2.  When Core 2 gets ownership and performs its update, then core 1 will be told to invalidate its copy of the cache line.  This will ping pong back and forth via the L3 cache greatly impacting performance.  The issue would be further exacerbated if competing cores are on different sockets and additionally have to cross the socket interconnect.Java Memory Layout

For the Hotspot JVM, all objects have a 2-word header.  First is the “mark” word which is made up of 24-bits for the hash code and 8-bits for flags such as the lock state, or it can be swapped for lock objects.  The second is a reference to the class of the object.  Arrays have an additional word for the size of the array.  Every object is aligned to an 8-byte granularity boundary for performance.  Therefore to be efficient when packing, the object fields are re-ordered from declaration order to the following order based on size in bytes:

  1. doubles (8) and longs (8)
  2. ints (4) and floats (4)
  3. shorts (2) and chars (2)
  4. booleans (1) and bytes (1)
  5. references (4/8)
  6. <repeat for sub-class fields>

With this knowledge we can pad a cache line between any fields with 7 longs.  Within the Disruptor we pad cache lines around the RingBuffer cursor and BatchEventProcessor sequences.To show the performance impact let’s take a few threads each updating their own independent counters.  These counters will be volatile longs so the world can see their progress.

public final class FalseSharing
    implements Runnable
    public final static int NUM_THREADS = 4; // change
    public final static long ITERATIONS = 500L * 1000L * 1000L;
    private final int arrayIndex;

    private static VolatileLong[] longs = new VolatileLong[NUM_THREADS];
        for (int i = 0; i < longs.length; i++)
            longs[i] = new VolatileLong();

    public FalseSharing(final int arrayIndex)
        this.arrayIndex = arrayIndex;

    public static void main(final String[] args) throws Exception
        final long start = System.nanoTime();
        System.out.println("duration = " + (System.nanoTime() - start));

    private static void runTest() throws InterruptedException
        Thread[] threads = new Thread[NUM_THREADS];

        for (int i = 0; i < threads.length; i++)
            threads[i] = new Thread(new FalseSharing(i));

        for (Thread t : threads)

        for (Thread t : threads)

    public void run()
        long i = ITERATIONS + 1;
        while (0 != --i)
            longs[arrayIndex].value = i;

    public final static class VolatileLong
        public volatile long value = 0L;
        public long p1, p2, p3, p4, p5, p6; // comment out


Running the above code while ramping the number of threads and adding/removing the cache line padding,  I get the results depicted in Figure 2. below.  This is measuring the duration of test runs on my 4-core Nehalem.

Figure 2.

The impact of false sharing can clearly be seen by the increased execution time required to complete the test.  Without the cache line contention we achieve near linear scale up with threads.

This is not a perfect test because we cannot be sure where the VolatileLongs will be laid out in memory.  They are independent objects.  However experience shows that objects allocated at the same time tend to be co-located.

So there you have it.  False sharing can be a silent performance killer.

Note: Please read my further adventures with false sharing in this follow on blog.

文章转自 并发编程网-ifeve.com


6911 0
7757 0
windows server 2008阿里云ECS服务器安全设置
最近我们Sinesafe安全公司在为客户使用阿里云ecs服务器做安全的过程中,发现服务器基础安全性都没有做。为了为站长们提供更加有效的安全基础解决方案,我们Sinesafe将对阿里云服务器win2008 系统进行基础安全部署实战过程! 比较重要的几部分 1.
5457 0
虽然0.0.0.0/0使用非常方便,但是发现很多同学使用它来做内网互通,这是有安全风险的,实例有可能会在经典网络被内网IP访问到。下面介绍一下四种安全的内网互联设置方法。 购买前请先:领取阿里云幸运券,有很多优惠,可到下文中领取。
9425 0
腾讯云服务器 设置ngxin + fastdfs +tomcat 开机自启动
在tomcat中新建一个可以启动的 .sh 脚本文件 /usr/local/tomcat7/bin/ export JAVA_HOME=/usr/local/java/jdk7 export PATH=$JAVA_HOME/bin/:$PATH export CLASSPATH=.
2140 0
购买阿里云ECS云服务器后如何登录?场景不同,云吞铺子总结大概有三种登录方式: 登录到ECS云服务器控制台 在ECS云服务器控制台用户可以更改密码、更换系统盘、创建快照、配置安全组等操作如何登录ECS云服务器控制台? 1、先登录到阿里云ECS服务器控制台 2、点击顶部的“控制台” 3、通过左侧栏,切换到“云服务器ECS”即可,如下图所示 通过ECS控制台的远程连接来登录到云服务器 阿里云ECS云服务器自带远程连接功能,使用该功能可以登录到云服务器,简单且方便,如下图:点击“远程连接”,第一次连接会自动生成6位数字密码,输入密码即可登录到云服务器上。
16826 0