elasticsearch 一次性查询数据量过大 jvm内存快速占用满导致集群无响应

cnamam

3264人浏览 · 2022-07-05 22:27:55

cnamam · 2022-07-05 22:27:55 发布

近期因为生产上es集群出现了内存快速占用满、频繁gc、集群无响应的现象，查看集群日志和满查询日志，发现都是因为频繁gc集群无响应后出现的报错、gc高频的警告以及平时不慢的查询报文出现在慢查询日志里。
看kibana的监控，发现内存几乎是在几秒内就占用满，并且满了以后，由于可用内存不足就开始频繁的full gc，cpu居高不下，集群此时基本在无法响应的状态，有遇到这种情况，只能重启才能解决，无法自己恢复，内存虽然已经占用满，但是集群日志中没有OOM的异常，而且出现的概率比较随机。
由于之前没有遇到过这个问题，所以想能不能通过prometheus的监控来看是不是在集群异常时有一些异常指标，没想到部署好以后，集群就没有这个问题了。
问了一下同事有做了哪些改动：

es的index.max_result_window 从100000改成10000
应用内的es动态查询的做了一些限制

但是由于生产是一个3 * 16c 主机 32g es 16g的一个集群配置，所以比较复杂离谱的动态查询在平时正常的时候性能都比较好，包括10+个Wildcard + 排序的分页查询，总匹配数量大概就几千条，应该不像是动态查询的问题。
会不会是一次性查询过大的量，涌入内存，但是还没查询完整，所以无法响应，导致无法释放且无法响应，导致内存占用满，频繁gc的问题呢。
我在开发环境一个2 * 8c 主机16g es 4g的一个集群想重现一下这个场景，空闲时大约剩余2g左右的内存，由于开发环境单条数据量较小，一次性查询约1000000条，300M+存储的数据量，可以重现内存快速占用满，频繁gc，集群无响应的情况，不同的是本地集群在慢查询内，会输出导致异常的查询报文以及相关fetch数据的日志，生产上是不会的，这点目前不知道为什么，所以也不能确定集群就是这个问题。

以下为集群日志内高频gc的部分日志：

[2022-07-04T11:12:17,141][INFO ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][old][431][8] duration [6.9s], collections [1]/[7.8s], total [6.9s]/[25.8s], memory [3.5gb]->[3.7gb]/[3.9gb], all_pools {[young] [1mb]->[79.7mb]/[266.2mb]}{[survivor] [994.2kb]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.6gb]/[3.6gb]}
[2022-07-04T11:12:17,141][WARN ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][431] overhead, spent [7s] collecting in the last [7.8s]
[2022-07-04T11:12:23,186][INFO ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][old][432][9] duration [5.9s], collections [1]/[6s], total [5.9s]/[31.7s], memory [3.7gb]->[3.6gb]/[3.9gb], all_pools {[young] [79.7mb]->[34.4mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2022-07-04T11:12:23,187][WARN ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][432] overhead, spent [5.9s] collecting in the last [6s]
[2022-07-04T11:12:31,575][INFO ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][old][433][10] duration [8.3s], collections [1]/[8.3s], total [8.3s]/[40.1s], memory [3.6gb]->[3.6gb]/[3.9gb], all_pools {[young] [34.4mb]->[13.5mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2022-07-04T11:12:31,575][WARN ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][433] overhead, spent [8.3s] collecting in the last [8.3s]
[2022-07-04T11:12:37,633][INFO ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][old][434][11] duration [5.9s], collections [1]/[6s], total [5.9s]/[46s], memory [3.6gb]->[3.6gb]/[3.9gb], all_pools {[young] [13.5mb]->[21mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2022-07-04T11:12:37,633][WARN ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][434] overhead, spent [5.9s] collecting in the last [6s]
[2022-07-04T11:12:45,879][INFO ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][old][435][12] duration [8.1s], collections [1]/[8.2s], total [8.1s]/[54.2s], memory [3.6gb]->[3.7gb]/[3.9gb], all_pools {[young] [21mb]->[36.5mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2022-07-04T11:12:45,879][WARN ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][435] overhead, spent [8.1s] collecting in the last [8.2s]
[2022-07-04T11:12:51,901][INFO ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][old][436][13] duration [5.9s], collections [1]/[6s], total [5.9s]/[1m], memory [3.7gb]->[3.6gb]/[3.9gb], all_pools {[young] [36.5mb]->[5.8mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2022-07-04T11:12:51,901][WARN ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][436] overhead, spent [5.9s] collecting in the last [6s]
[2022-07-04T11:12:51,911][WARN ][o.e.c.InternalClusterInfoService] [CqSCISw] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2022-07-04T11:13:00,316][INFO ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][old][437][14] duration [8.3s], collections [1]/[8.4s], total [8.3s]/[1.1m], memory [3.6gb]->[3.6gb]/[3.9gb], all_pools {[young] [5.8mb]->[10.1mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2022-07-04T11:13:00,337][WARN ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][437] overhead, spent [8.3s] collecting in the last [8.4s]
[2022-07-04T11:13:06,533][INFO ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][old][438][15] duration [6s], collections [1]/[6.2s], total [6s]/[1.2m], memory [3.6gb]->[3.7gb]/[3.9gb], all_pools {[young] [10.1mb]->[41.1mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2022-07-04T11:13:06,534][WARN ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][438] overhead, spent [6s] collecting in the last [6.2s]

所以如非必要最好还是不要修改默认10000条的index.max_result_window配置，如果确实有需要要经过慎重的测试以后，再上生产。