起因

最近排查一个历史老问题,历史问题就是大家都知道,但是没人愿意处理。
线上节点 16 和 32 G 内存频繁的出现GC,并且由于 GC 还导至两个问题,1.CPU使用率在GC时变高,2.业务出现异常。
这个问题变得无法忍耐了,跟踪程序线程使用,CMS GC 线程CPU 使用在初始化时都 11-99% 之间,严重影响业务处理。
通过 jstat 观察,FullGC 不多,但是 YoungGC 非常频繁,YoungGC 多不影响服务性能吗?也影响。
计算一个服务GC的好坏,不是看 FullGC 多少,而是看整体服务的吞吐量。

1
2
3
4
5
6
7
8
9
10
11
12
13
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
68096.0 68096.0 16840.1 0.0 545344.0 115481.1 6658496.0 4835834.6 63280.0 60712.9 7084.0 6615.1 1930261 53981.101 4804 1248.371 55229.471
68096.0 68096.0 0.0 17347.7 545344.0 12072.1 6658496.0 4836565.4 63280.0 60712.9 7084.0 6615.1 1930262 53981.125 4804 1248.371 55229.496
68096.0 68096.0 0.0 12854.1 545344.0 252246.5 6658496.0 4838818.8 63280.0 60712.9 7084.0 6615.1 1930264 53981.168 4804 1248.371 55229.539
68096.0 68096.0 19501.0 0.0 545344.0 187933.5 6658496.0 4838869.4 63280.0 60712.9 7084.0 6615.1 1930265 53981.189 4804 1248.371 55229.559
68096.0 68096.0 19501.0 0.0 545344.0 445775.5 6658496.0 4838869.4 63280.0 60712.9 7084.0 6615.1 1930265 53981.189 4804 1248.371 55229.559
68096.0 68096.0 11858.3 0.0 545344.0 515043.9 6658496.0 4845614.9 63280.0 60712.9 7084.0 6615.1 1930267 53981.239 4804 1248.371 55229.610
68096.0 68096.0 0.0 16469.0 545344.0 361227.7 6658496.0 4846356.5 63280.0 60712.9 7084.0 6615.1 1930268 53981.263 4804 1248.371 55229.634
68096.0 68096.0 0.0 20075.6 545344.0 226522.4 6658496.0 4847100.6 63280.0 60712.9 7084.0 6615.1 1930270 53981.316 4804 1248.371 55229.687
68096.0 68096.0 19591.2 0.0 545344.0 325872.6 6658496.0 4852312.8 63280.0 60712.9 7084.0 6615.1 1930271 53981.344 4804 1248.371 55229.715
68096.0 68096.0 0.0 19282.2 545344.0 206195.9 6658496.0 4853497.6 63280.0 60712.9 7084.0 6615.1 1930272 53981.368 4804 1248.371 55229.739
68096.0 68096.0 0.0 16319.6 545344.0 520305.2 6658496.0 4858184.1 63280.0 60712.9 7084.0 6615.1 1930274 53981.414 4804 1248.371 55229.785
68096.0 68096.0 19240.3 0.0 545344.0 349365.5 6658496.0 4859016.3 63280.0 60712.9 7084.0 6615.1 1930275 53981.445 4804 1248.371 55229.815

计算吞吐量

排查

先排除环境原因,从开发环境到线上环境发现情况是相同的,可以初步认定是一个普遍问题。

定位 16G 内存服务情况

先从 16G 内存开始,收集一些基本信息,再观察运行情况。

1
2
3
4
5
6
7
8
9
10
11
java -Xms9g -Xmx9g
-XX:+UseConcMarkSweepGC
-XX:+PrintGCDetails -Xloggc:./gc.log
-XX:+PrintGCDateStamps
-XX:+CMSParallelRemarkEnabled
-XX:ReservedCodeCacheSize=256m
-XX:+CMSScavengeBeforeRemark
-XX:+HeapDumpOnOutOfMemoryError
-XX:MaxDirectMemorySize=1g
-jar Node.jar
-c config.conf

从这上面能看到关于 CMS 关相的几个参数。

1
2
3
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark

内存使用情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 9663676416 (9216.0MB)
NewSize = 697892864 (665.5625MB)
MaxNewSize = 697892864 (665.5625MB)
OldSize = 8965783552 (8550.4375MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 0 (0.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
capacity = 628162560 (599.0625MB)
used = 511774816 (488.0664978027344MB)
free = 116387744 (110.99600219726562MB)
81.47171585648148% used
Eden Space:
capacity = 558432256 (532.5625MB)
used = 487164440 (464.5962142944336MB)
free = 71267816 (67.9662857055664MB)
87.23787617311275% used
From Space:
capacity = 69730304 (66.5MB)
used = 24610376 (23.47028350830078MB)
free = 45119928 (43.02971649169922MB)
35.29365941097862% used
To Space:
capacity = 69730304 (66.5MB)
used = 0 (0.0MB)
free = 69730304 (66.5MB)
0.0% used
concurrent mark-sweep generation:
capacity = 8965783552 (8550.4375MB)
used = 4737287088 (4517.829025268555MB)
free = 4228496464 (4032.6084747314453MB)
52.837401890471156% used

20983 interned Strings occupying 2039024 bytes.

这里看出一个问题

1
2
3
4
5
6
7
8
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 9663676416 (9216.0MB)
NewSize = 697892864 (665.5625MB)
MaxNewSize = 697892864 (665.5625MB)
OldSize = 8965783552 (8550.4375MB)
NewRatio = 2

NewRatio=2 但是实际新生代获得的内存空间只有 665.5625MB,这就有点奇怪了,不是说好的老年代:新生代 2 : 1 吗。
这是一个问题,另外从当前的 GC 回收时间推算出,当前吞吐量为: 96%,还没有达到最优的情况。

修复新生代比例大小

添加启动参数 -XX:NewRatio=2,完整启动命令:

1
2
3
4
5
6
7
8
9
10
11
12
java -Xms9g -Xmx9g
-XX:NewRatio=2
-XX:+UseConcMarkSweepGC
-XX:+PrintGCDetails -Xloggc:./gc.log
-XX:+PrintGCDateStamps
-XX:+CMSParallelRemarkEnabled
-XX:ReservedCodeCacheSize=256m
-XX:+CMSScavengeBeforeRemark
-XX:+HeapDumpOnOutOfMemoryError
-XX:MaxDirectMemorySize=1g
-jar Node.jar
-c config.conf

通过 jstat -gc pid 1000 观察一下 GC 情况
添加 -XX:NewRatio=2 启动参数的机器情况

jmap -heap 查看一下调整后的新生代占比,现在 NewSize=3072.0MB OldSize=6144.0MB
这时比例生效了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 9663676416 (9216.0MB)
NewSize = 3221225472 (3072.0MB)
MaxNewSize = 3221225472 (3072.0MB)
OldSize = 6442450944 (6144.0MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 0 (0.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
capacity = 2899116032 (2764.8125MB)
used = 2516795744 (2400.2034606933594MB)
free = 382320288 (364.6090393066406MB)
86.81252203154317% used
Eden Space:
capacity = 2577006592 (2457.625MB)
used = 2194686304 (2093.0159606933594MB)
free = 382320288 (364.6090393066406MB)
85.16417112836008% used
From Space:
capacity = 322109440 (307.1875MB)
used = 322109440 (307.1875MB)
free = 0 (0.0MB)
100.0% used
To Space:
capacity = 322109440 (307.1875MB)
used = 0 (0.0MB)
free = 322109440 (307.1875MB)
0.0% used
concurrent mark-sweep generation:
capacity = 6442450944 (6144.0MB)
used = 525052336 (500.7289276123047MB)
free = 5917398608 (5643.271072387695MB)
8.149884889523188% used

13727 interned Strings occupying 1224528 bytes.
1
2
3
4
5
6
7
8
9
10
11
12
S0C      S1C      S0U    S1U     EC        EU        OC         OU         MC      MU      CCSC   CCSU     YGC   YGCT     FGC    FGCT    GCT
314560.0 314560.0 0.0 85539.9 2516608.0 1975197.4 6291456.0 4613067.7 63012.0 59752.3 7024.0 6520.3 9317 494.820 222 22.566 517.386
314560.0 314560.0 0.0 85539.9 2516608.0 2228635.2 6291456.0 4613067.7 63012.0 59752.3 7024.0 6520.3 9317 494.820 222 22.566 517.386
314560.0 314560.0 87689.1 0.0 2516608.0 681984.2 6291456.0 4621812.0 63012.0 59752.3 7024.0 6520.3 9318 494.875 222 22.566 517.441
314560.0 314560.0 87689.1 0.0 2516608.0 1965292.1 6291456.0 4621812.0 63012.0 59752.3 7024.0 6520.3 9318 494.875 222 22.566 517.441
314560.0 314560.0 0.0 95789.9 2516608.0 145469.7 6291456.0 4629480.3 63012.0 59752.3 7024.0 6520.3 9319 494.928 222 22.566 517.494
314560.0 314560.0 0.0 95789.9 2516608.0 1741481.8 6291456.0 4629480.3 63012.0 59752.3 7024.0 6520.3 9319 494.928 222 22.566 517.494
314560.0 314560.0 0.0 95789.9 2516608.0 1978876.4 6291456.0 4629480.3 63012.0 59752.3 7024.0 6520.3 9319 494.928 222 22.566 517.494
314560.0 314560.0 74735.8 0.0 2516608.0 306828.0 6291456.0 4638677.5 63012.0 59752.3 7024.0 6520.3 9320 494.982 222 22.566 517.549
314560.0 314560.0 74735.8 0.0 2516608.0 1513196.3 6291456.0 4638677.5 63012.0 59752.3 7024.0 6520.3 9320 494.982 222 22.566 517.549
314560.0 314560.0 74735.8 0.0 2516608.0 1941220.4 6291456.0 4638677.5 63012.0 59752.3 7024.0 6520.3 9320 494.982 222 22.566 517.549
314560.0 314560.0 74735.8 0.0 2516608.0 2415940.5 6291456.0 4638677.5 63012.0 59752.3 7024.0 6520.3 9320 494.982 222 22.566 517.549

对比存在问题的情况

1
2
3
4
5
6
7
8
9
10
11
12
S0C     S1C     S0U    S1U     EC       EU       OC         OU         MC      MU      CCSC   CCSU    YGC    YGCT     FGC     FGCT  GCT
68096.0 68096.0 0.0 16853.2 545344.0 371165.3 8755648.0 5890791.6 62312.0 59746.5 7076.0 6608.5 43879 1312.752 60 5.206 1317.958
68096.0 68096.0 19377.9 0.0 545344.0 420827.3 8755648.0 5891528.9 62312.0 59746.5 7076.0 6608.5 43880 1312.785 60 5.206 1317.991
68096.0 68096.0 0.0 22147.2 545344.0 295435.0 8755648.0 5891577.4 62312.0 59746.5 7076.0 6608.5 43881 1312.816 60 5.206 1318.022
68096.0 68096.0 21257.5 0.0 545344.0 461020.2 8755648.0 5898299.9 62312.0 59746.5 7076.0 6608.5 43882 1312.850 60 5.206 1318.056
68096.0 68096.0 0.0 13895.9 545344.0 419534.8 8755648.0 5901746.3 62312.0 59746.5 7076.0 6608.5 43885 1312.961 60 5.206 1318.167
68096.0 68096.0 0.0 17542.7 545344.0 63902.2 8755648.0 5906856.1 62312.0 59746.5 7076.0 6608.5 43887 1313.028 60 5.206 1318.234
68096.0 68096.0 0.0 17542.7 545344.0 444878.0 8755648.0 5906856.1 62312.0 59746.5 7076.0 6608.5 43887 1313.028 60 5.206 1318.234
68096.0 68096.0 15241.8 0.0 545344.0 403475.0 8755648.0 5909706.0 62312.0 59746.5 7076.0 6608.5 43890 1313.108 60 5.206 1318.314
68096.0 68096.0 24031.2 0.0 545344.0 52019.0 8755648.0 5916571.4 62312.0 59746.5 7076.0 6608.5 43892 1313.174 60 5.206 1318.380
68096.0 68096.0 17381.4 0.0 545344.0 349457.1 8755648.0 5919082.7 62312.0 59746.5 7076.0 6608.5 43894 1313.232 60 5.206 1318.438
68096.0 68096.0 0.0 20887.7 545344.0 442091.1 8755648.0 5919151.3 62312.0 59746.5 7076.0 6608.5 43895 1313.261 60 5.206 1318.467

这两组数据能看出什么?

YGC : 新生代垃圾回收次数
YGCT : 新生代垃圾回收时间
FGC : 老年代垃圾回收次数
FGCT : 老年代垃圾回收时间
GCT : 垃圾回收总消耗时间

对比两组GC回收时间数据

  1. YGC 9320 : 43895,次数减少 4.7倍
  2. YGCT 494.982 : 1313.261, 减少 2.6倍
  3. GCT 517.549: 1318.467,减少 2.5 倍
  4. FGC 222 : 60 却增加了 3.7 倍的次数

这有点出乎意料,如果新生代空间够大之后,对象有足够的新生代空间可以分配对象,保证尽可能多的对象在新生代中被回收,熬不到老年代。