什么情况会导致 Stop the world 时间过长?
Stop the world 阶段可以简单分为(这段时间内,JVM 都是出于所有线程进入 Safepoint 就 block 的状态):
- 某个操作,需要 Stop the world(就是上面提到的哪些情况下会让所有线程进入 SafePoint, 即发生 Stop the world 的那些操作)
- 向 Signal Dispatcher 这个 JVM 守护线程发起 Safepoint 同步信号并交给对应的模块执行。
- 对应的模块,采集所有线程信息,并对每个线程根据状态做不同的操作以及标记(根据之前源代码那一块的描述,有5种情况)
- 所有线程都进入 Safepoint 并 block。
- 做需要发起 Stop the world 的操作。
- 操作完成,所有线程从 Safepoint 恢复。
基于这些阶段,导致 Stop the world 时间过长的原因有:
- 阶段 4 耗时过长,即等待所有线程中的某些线程进入 Safepoint 的时间过长,这个很可能和有 大有界循环与JIT优化 有关,也很可能是 OpenJDK 11 引入的获取调用堆栈的类
StackWalker
的使用导致的,也可能是系统 CPU 资源问题或者是系统内存脏页过多或者发生 swap 导致的。 - 阶段 5 耗时过长,需要看看是哪些操作导致的,例如偏向锁撤销过多, GC时间过长等等,需要想办法减少这些操作消耗的时间,或者直接关闭这些事件(例如关闭偏向锁,关闭 JFR 的 OldObjectSample 事件采集)减少进入,这个和本篇内容无关,这里不赘述。
- 阶段2,阶段3耗时过长,由于 Signal Dispatcher 是单线程的,可以看看当时 Signal Dispatcher 这个线程在干什么,可能是 Signal Dispatcher 做其他操作导致的。也可能是系统 CPU 资源问题或者是系统内存脏页过多或者发生 swap 导致的。
大有界循环与 JIT 优化会给 SafePoint 带来哪些问题?
已知:只有线程执行到 Safepoint 代码才会知道Thread.intterupted()
的最新状态 ,而不是线程的本地缓存。
我们来看下面一段代码:
static int algorithm(int n) { int bestSoFar = 0; for (int i=0; i<n; ++i) { if (Thread.interrupted()) { System.out.println("broken by interrupted"); break; } //增加pow计算,增加计算量,防止循环执行不超过1s就结束了 bestSoFar = (int) Math.pow(i, 0.3); } return bestSoFar; } public static void main(String[] args) throws InterruptedException { Runnable task = () -> { Instant start = Instant.now(); int bestSoFar = algorithm(1000000000); double durationInMillis = Duration.between(start, Instant.now()).toMillis(); System.out.println("after "+durationInMillis+" ms, the result is "+bestSoFar); }; //延迟1ms之后interrupt Thread t = new Thread(task); t.start(); Thread.sleep(1); t.interrupt(); //延迟10ms之后interrupt t = new Thread(task); t.start(); Thread.sleep(10); t.interrupt(); //延迟100ms之后interrupt t = new Thread(task); t.start(); Thread.sleep(100); t.interrupt(); //延迟1s之后interrupt //这时候 algorithm 里面的for循环调用次数应该足够了,会发生代码即时编译优化并 OSR t = new Thread(task); t.start(); Thread.sleep(1000); //发现线程这次不会对 interrupt 有反应了 t.interrupt(); }
之后利用 JVM 参数 -Xlog:jit+compilation=debug:file=jit_compile%t.log:uptime,level,tags:filecount=10,filesize=100M
打印 JIT 编译日志到另一个文件,便于观察。最后控制台输出:
broken by interrupted broken by interrupted after 10.0 ms, the result is 27 after 1.0 ms, the result is 10 broken by interrupted after 99.0 ms, the result is 69 after 29114.0 ms, the result is 501
可以看出,最后一次循环直接运行结束了,并没有看到线程已经 interrupted 了。并且 JIT 编译日志可以看到,在最后一线程执行循环的时候发生了发生代码即时编译优化并 OSR:
[0.782s][debug][jit,compilation] 460 % 3 com.test.TypeTest::algorithm @ 4 (44 bytes) [0.784s][debug][jit,compilation] 468 3 com.test.TypeTest::algorithm (44 bytes) [0.794s][debug][jit,compilation] 486 % 4 com.test.TypeTest::algorithm @ 4 (44 bytes) [0.797s][debug][jit,compilation] 460 % 3 com.test.TypeTest::algorithm @ 4 (44 bytes) made not entrant [0.799s][debug][jit,compilation] 503 4 com.test.TypeTest::algorithm (44 bytes)
3 还有 4 表示编译级别,% 表示是 OSR 栈上替换方法,也就是 for 循环还在执行的时候,进行了执行代码的机器码替换。在这之后,线程就看不到线程已经 interrupted 了,这说明,** JIT 优化后的代码,for 循环里面的 Safepoint 会被拿掉**。 这样带来的问题,也显而易见了,当需要 Stop the world 的时候,所有线程都会等着这个循环执行完,因为这个线程只有执行完这个大循环,才能进入 Safepoint。
那么,如何优化呢?
第一种方式是修改代码,将 for int 的循环变成 for long 类型:
for (long i=0; i<n; ++i) { if (Thread.interrupted()) { System.out.println("broken by interrupted"); break; } //增加pow计算,增加计算量,防止循环执行不超过1s就结束了 bestSoFar = (int) Math.pow(i, 0.3); }
第二种是通过-XX:+UseCountedLoopSafepoints
参数,让 JIT 优化代码的时候,不会拿掉有界循环里面的 SafePoint
用这两种方式其中一种之后的控制台输出:
broken by interrupted broken by interrupted after 0.0 ms, the result is 0 after 10.0 ms, the result is 29 broken by interrupted after 100.0 ms, the result is 73 broken by interrupted after 998.0 ms, the result is 170
如何通过日志分析 SafePoint?
目前,在 OpenJDK 11 版本,主要有两种 SafePoint 相关的日志。一种基本上只在开发时使用,另一种可以在线上使用持续采集。
第一个是-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1
,这个会定时采集,但是采集的时候会触发所有线程进入 Safepoint,所以,线程一般不打开(之前我们对于定时让所有线程进入 Safepoint 都要关闭,这个就更不可能打开了)。并且,在 Java 12 中已经被移除,并且接下来的日志配置基本上可以替代这个,所以这里我们就不赘述这个了。
另外是通过-Xlog:safepoint=trace:stdout:utctime,level,tags
,对于 OpenJDK 的日志配置,可以参考我的另一篇文章详细解析配置的格式,这里我们直接用。
我们这里配置了所有的 safepoint 相关的 JVM 日志都输出到控制台,一次 Stop the world 的时候,就会像下面这样输出:
[2020-07-14T07:08:26.197+0000][debug][safepoint] Safepoint synchronization initiated. (112 threads) [2020-07-14T07:08:26.197+0000][info ][safepoint] Application time: 12.4565068 seconds [2020-07-14T07:08:26.197+0000][trace][safepoint] Setting thread local yield flag for threads [2020-07-14T07:08:26.197+0000][trace][safepoint] Thread: 0x0000022c7c494b30 [0x61dc] State: _at_safepoint _has_called_back 0 _at_poll_safepoint 0 [2020-07-14T07:08:26.197+0000][trace][safepoint] Thread: 0x0000022c7c497f30 [0x4ff8] State: _at_safepoint _has_called_back 0 _at_poll_safepoint 0 ......省略一些处于 _at_poll_safepoint 的线程 [2020-07-14T07:08:26.197+0000][trace][safepoint] Thread: 0x0000022c10c010b0 [0x5878] State: _call_back _has_called_back 0 _at_poll_safepoint 0 [2020-07-14T07:08:26.348+0000][trace][safepoint] Thread: 0x0000022c10bfe560 [0x5038] State: _at_safepoint _has_called_back 0 _at_poll_safepoint 0 [2020-07-14T07:08:26.197+0000][debug][safepoint] Waiting for 1 thread(s) to block [2020-07-14T07:08:29.348+0000][info ][safepoint] Entering safepoint region: G1CollectForAllocation [2020-07-14T07:08:29.350+0000][info ][safepoint] Leaving safepoint region [2020-07-14T07:08:29.350+0000][info ][safepoint] Total time for which application threads were stopped: 3.1499371 seconds, Stopping threads took: 3.1467255 seconds
首先,阶段 1 会打印日志,这个是 debug 级别的,代表要开始全局所有线程 Safepoint 了,这时候,JVM 就开始无法响应请求了,也就是 Stop the world 开始:
[2020-07-14T07:08:29.347+0000][debug][safepoint] Safepoint synchronization initiated.
阶段 2 不会打印日志,阶段 3 会打印:
[2020-07-14T07:08:26.197+0000][info ][safepoint] Application time: 12.4565068 seconds [2020-07-14T07:08:26.197+0000][trace][safepoint] Setting thread local yield flag for threads [2020-07-14T07:08:26.197+0000][trace][safepoint] Thread: 0x0000022c7c494b30 [0x61dc] State: _at_safepoint _has_called_back 0 _at_poll_safepoint 0 [2020-07-14T07:08:26.197+0000][trace][safepoint] Thread: 0x0000022c7c497f30 [0x4ff8] State: _at_safepoint _has_called_back 0 _at_poll_safepoint 0 ......省略一些处于 _at_poll_safepoint 的线程 [2020-07-14T07:08:26.197+0000][trace][safepoint] Thread: 0x0000022c10c010b0 [0x5878] State: _call_back _has_called_back 0 _at_poll_safepoint 0 [2020-07-14T07:08:26.348+0000][trace][safepoint] Thread: 0x0000022c10bfe560 [0x5038] State: _at_safepoint _has_called_back 0 _at_poll_safepoint 0 [2020-07-14T07:08:26.197+0000][debug][safepoint] Waiting for 1 thread(s) to block
Application time: 12.4565068 seconds
代表上次全局 Safepoint 与这次 Safepoint 间隔了多长时间。后面 trace 的日志表示每个线程的状态,其中没有处于 Safepoint 的只有一个:
Thread: 0x0000022c10c010b0 [0x5878] State: _call_back _has_called_back 0 _at_poll_safepoint 0
这里有详细的线程号,可以通过 jstack 知道这个线程是干啥的。 最后的Waiting for 1 thread(s) to block
也代表到底需要等待几个线程走到 Safepoint。
阶段 4 执行完,开始阶段 5 的时候,会打印:
[2020-07-14T07:08:29.348+0000][info ][safepoint] Entering safepoint region: G1CollectForAllocation
阶段 5 执行完之后,会打印:
[2020-07-14T07:08:29.350+0000][info ][safepoint] Leaving safepoint region
最后阶段 6 开始的时候,会打印:
[2020-07-14T07:08:29.350+0000][info ][safepoint] Total time for which application threads were stopped: 3.1499371 seconds, Stopping threads took: 3.1467255 seconds
Total time for which application threads were stopped
是这次阶段1到阶段6开始,一共过了多长时间,也就是 Stop the world 多长时间。后面的Stopping threads took
是这次等待线程走进 Safepoint 过了多长时间,一般除了 阶段 5 执行触发 Stop the world 以外,都是由于 等待线程走进 Safepoint 时间长。这是就要看 trace 的线程哪些没有处于 Safepoint,看他们干了什么,是否有大循环,或者是使用了StackWalker
这个类.
如何通过 JFR 分析 SafePoint?
JFR 相关的配置以及说明以及如何通过 JFR 分析 SafePoint 相关事件,可以参考我的另一个系列JFR全解系列
常见的 SafePoint 调优参数以及讲解
1. 建议关闭定时让所有线程进入 Safepoint
对于微服务高并发应用,没必要定时进入 Safepoint,所以关闭 -XX:+UnlockDiagnosticVMOptions -XX:GuaranteedSafepointInterval=0
2. 建议取消偏向锁
在高并发应用中,偏向锁并不能带来性能提升,反而因为偏向锁取消带来了很多没必要的某些线程进入Safepoint 或者 Stop the world。所以建议关闭:-XX:-UseBiasedLocking
3. 建议打开循环内添加 Safepoint 参数
防止大循环 JIT 编译导致内部 Safepoint 被优化省略,导致进入 SafePoint 时间变长:-XX:+UseCountedLoopSafepoints
4. 建议打开 debug 级别的 safepoint 日志(和第五个选一个)
debug 级别虽然看不到每次是哪些线程需要等待进入 Safepoint,但是整体每阶段耗时已经很清楚了。如果是 trace 级别,每次都能看到是那些线程,但是这样每次进入 safepoint 时间就会增加几毫秒。
-Xlog:safepoint=debug:file=safepoint.log:utctime,level,tags:filecount=50,filesize=100M
5. 建议打开 JFR 关于 safepoint 的采集(和第四个选一个)
修改,或者新建 jfc 文件:
<event name="jdk.SafepointBegin"> <setting name="enabled">true</setting> <setting name="threshold">10 ms</setting> </event> <event name="jdk.SafepointStateSynchronization"> <setting name="enabled">true</setting> <setting name="threshold">10 ms</setting> </event> <event name="jdk.SafepointWaitBlocked"> <setting name="enabled">true</setting> <setting name="threshold">10 ms</setting> </event> <event name="jdk.SafepointCleanup"> <setting name="enabled">true</setting> <setting name="threshold">10 ms</setting> </event> <event name="jdk.SafepointCleanupTask"> <setting name="enabled">true</setting> <setting name="threshold">10 ms</setting> </event> <event name="jdk.SafepointEnd"> <setting name="enabled">true</setting> <setting name="threshold">10 ms</setting> </event> <event name="jdk.ExecuteVMOperation"> <setting name="enabled">true</setting> <setting name="threshold">10 ms</setting> </event>