docker报错exited(137)

张某某啊哈

5677人浏览 · 2022-09-30 10:42:39

张某某啊哈 · 2022-09-30 10:42:39 发布

项目突然就挂掉了，不清楚原因查看日志为发现报错。之后排查容器异常退出错误码是137 ，排查之后发现是OOM了。

docker容器莫名挂掉，docker ps -a 查看后报错：Exited (137) *** ago

这时通过docker logs查看容器内查不到任何报错日志，从mesos上看stderr相关的只有一句

I0409 16:56:26.408077 8583 executor.cpp:736] Container exited with status 137

通过docker inspect查看container状态为

"State": {
      "Status": "exited",
      "Running": false,
      "Paused": false,
      "Restarting": false,
      "OOMKilled": true,
      "Dead": false,
      "Pid": 0,
      "ExitCode": 137,
      "Error": "",
      "StartedAt": "2019-04-09T08:50:48.058583459Z",
      "FinishedAt": "2019-04-09T08:50:55.456317695Z"
  },

可见是因为OOMKilled，通过journalctl查看oom日志如下：

journalctl -k | grep -i -e memory -e oom
系统日志查看journalctl命令详解：https://blog.csdn.net/qq_36595013/article/details/107318025

[root@centos7 ~]# journalctl -k | grep -i -e memory -e oom
Aug 06 02:08:40 centos7.linuxvmimages.com kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Aug 06 02:08:40 centos7.linuxvmimages.com kernel:  [<ffffffff985c251e>] oom_kill_process+0x25e/0x3f0
Aug 06 02:08:40 centos7.linuxvmimages.com kernel:  [<ffffffff985c2d76>] out_of_memory+0x4b6/0x4f0
Aug 06 02:08:40 centos7.linuxvmimages.com kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Aug 06 02:08:40 centos7.linuxvmimages.com kernel: Out of memory: Kill process 67686 (java) score 71 or sacrifice child
Aug 06 02:08:40 centos7.linuxvmimages.com kernel: RibbonApacheHtt invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Aug 06 02:08:40 centos7.linuxvmimages.com kernel:  [<ffffffff985c251e>] oom_kill_process+0x25e/0x3f0
Aug 06 02:08:40 centos7.linuxvmimages.com kernel:  [<ffffffff985c2d76>] out_of_memory+0x4b6/0x4f0
Aug 06 02:08:40 centos7.linuxvmimages.com kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Aug 06 02:08:40 centos7.linuxvmimages.com kernel: Out of memory: Kill process 67731 (java) score 71 or sacrifice child
Aug 06 02:08:40 centos7.linuxvmimages.com kernel: grpc-default-wo invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Aug 06 02:08:40 centos7.linuxvmimages.com kernel:  [<ffffffff985c251e>] oom_kill_process+0x25e/0x3f0
Aug 06 02:08:40 centos7.linuxvmimages.com kernel:  [<ffffffff985c2d76>] out_of_memory+0x4b6/0x4f0
Aug 06 02:08:40 centos7.linuxvmimages.com kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Aug 06 02:08:40 centos7.linuxvmimages.com kernel: Out of memory: Kill process 67784 (java) score 71 or sacrifice child
Aug 08 04:03:59 centos7.linuxvmimages.com kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Aug 08 04:03:59 centos7.linuxvmimages.com kernel:  [<ffffffff985c251e>] oom_kill_process+0x25e/0x3f0
Aug 08 04:03:59 centos7.linuxvmimages.com kernel:  [<ffffffff985c2d76>] out_of_memory+0x4b6/0x4f0
Aug 08 04:03:59 centos7.linuxvmimages.com kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Aug 08 04:04:00 centos7.linuxvmimages.com kernel: Out of memory: Kill process 124434 (java) score 91 or sacrifice child
Aug 08 04:04:00 centos7.linuxvmimages.com kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Aug 08 04:04:00 centos7.linuxvmimages.com kernel:  [<ffffffff985c251e>] oom_kill_process+0x25e/0x3f0
Aug 08 04:04:00 centos7.linuxvmimages.com kernel:  [<ffffffff985c2d76>] out_of_memory+0x4b6/0x4f0
Aug 08 04:04:00 centos7.linuxvmimages.com kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Aug 08 04:04:00 centos7.linuxvmimages.com kernel: Out of memory: Kill process 124534 (java) score 91 or sacrifice child
Aug 26 02:48:02 centos7.linuxvmimages.com kernel: SimplePauseDete invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Aug 26 02:48:09 centos7.linuxvmimages.com kernel:  [<ffffffff985c251e>] oom_kill_process+0x25e/0x3f0
Aug 26 02:48:10 centos7.linuxvmimages.com kernel:  [<ffffffff985c2d76>] out_of_memory+0x4b6/0x4f0
Aug 26 02:48:25 centos7.linuxvmimages.com kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Aug 26 02:49:10 centos7.linuxvmimages.com kernel: Out of memory: Kill process 96564 (java) score 96 or sacrifice child
Aug 26 13:35:37 centos7.linuxvmimages.com kernel: kworker/13:0 invoked oom-killer: gfp_mask=0x200d2, order=0, oom_score_adj=0
Aug 26 13:35:40 centos7.linuxvmimages.com kernel:  [<ffffffff985c251e>] oom_kill_process+0x25e/0x3f0
Aug 26 13:35:40 centos7.linuxvmimages.com kernel:  [<ffffffff985c2d76>] out_of_memory+0x4b6/0x4f0
Aug 26 13:35:44 centos7.linuxvmimages.com kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Aug 26 13:35:53 centos7.linuxvmimages.com kernel: Out of memory: Kill process 6846 (java) score 88 or sacrifice child
Aug 26 19:11:48 centos7.linuxvmimages.com kernel: lettuce-epollEv invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Aug 26 19:11:49 centos7.linuxvmimages.com kernel:  [<ffffffff985c251e>] oom_kill_process+0x25e/0x3f0
Aug 26 19:11:49 centos7.linuxvmimages.com kernel:  [<ffffffff985c2d76>] out_of_memory+0x4b6/0x4f0
Aug 26 19:11:50 centos7.linuxvmimages.com kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Aug 26 19:11:51 centos7.linuxvmimages.com kernel: Out of memory: Kill process 77873 (java) score 87 or sacrifice child
Aug 26 19:56:19 centos7.linuxvmimages.com kernel: kworker/13:0 invoked oom-killer: gfp_mask=0x200d2, order=0, oom_score_adj=0
Aug 26 19:56:19 centos7.linuxvmimages.com kernel:  [<ffffffff985c251e>] oom_kill_process+0x25e/0x3f0
Aug 26 19:56:19 centos7.linuxvmimages.com kernel:  [<ffffffff985c2d76>] out_of_memory+0x4b6/0x4f0
Aug 26 19:56:20 centos7.linuxvmimages.com kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Aug 26 19:56:21 centos7.linuxvmimages.com kernel: Out of memory: Kill process 25618 (java) score 86 or sacrifice child
Aug 26 22:54:15 centos7.linuxvmimages.com kernel: http-nio-19004- invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Aug 26 22:54:15 centos7.linuxvmimages.com kernel:  [<ffffffff985c251e>] oom_kill_process+0x25e/0x3f0
Aug 26 22:54:15 centos7.linuxvmimages.com kernel:  [<ffffffff985c2d76>] out_of_memory+0x4b6/0x4f0
Aug 26 22:54:15 centos7.linuxvmimages.com kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Aug 26 22:54:17 centos7.linuxvmimages.com kernel: Out of memory: Kill process 4066 (java) score 88 or sacrifice child
Aug 27 03:05:05 centos7.linuxvmimages.com kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Aug 27 03:05:06 centos7.linuxvmimages.com kernel:  [<ffffffff985c251e>] oom_kill_process+0x25e/0x3f0
Aug 27 03:05:06 centos7.linuxvmimages.com kernel:  [<ffffffff985c2d76>] out_of_memory+0x4b6/0x4f0
Aug 27 03:05:07 centos7.linuxvmimages.com kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name

可以看到容器被干掉好多次

看来需要优化内存了

解决方案
1、docker 运行指定内存

-m,--memory                  内存限制，格式是数字加单位，单位可以为 b,k,m,g。最小为 4M
--memory-swap                内存+交换分区大小总限制。格式同上。必须必-m设置的大
--memory-reservation         内存的软性限制。格式同上
--oom-kill-disable           是否阻止 OOM killer 杀死容器，默认没设置
--oom-score-adj              容器被 OOM killer 杀死的优先级，范围是[-1000, 1000]，默认为 0
--memory-swappiness          用于设置容器的虚拟内存控制行为。值为 0~100 之间的整数
--kernel-memory              核心内存限制。格式同上，最小为 4M


[root@sannian ~]# docker run -d -m 1G --memory-swap 3G -p 9999:80  --restart=always --name gitlab twang2218/gitlab-ce-zh
a3254078a79a084f3f3bed5f4ade3e26c7d86951cd822d95b113227d75b00097
[root@sannian ~]# docker ps
CONTAINER ID        IMAGE                    COMMAND             CREATED             STATUS                   PORTS                                   NAMES
a3254078a79a        twang2218/gitlab-ce-zh   "/assets/wrapper"   21 minutes ago      Up 2 minutes (healthy)   22/tcp, 443/tcp, 0.0.0.0:9999->80/tcp   gitlab
[root@sannian ~]# docker images
REPOSITORY                                               TAG                 IMAGE ID            CREATED             SIZE
twang2218/gitlab-ce-zh                                   latest              18da462b5ff5        3 months ago        1.61GB
registry-vpc.cn-hangzhou.aliyuncs.com/wenty/jumpserver   latest              055f42f305f5        7 months ago        1.41GB
registry.cn-hangzhou.aliyuncs.com/wenty/jumpserver       latest              055f42f305f5        7 months ago        1.41GB
registry.jumpserver.org/public/jumpserver                1.0.0               055f42f305f5        7 months ago        1.41GB
registry.jumpserver.org/public/jumpserver                latest              055f42f305f5        7 months ago        1.41GB``

2、内存扩容
3、内存扩容优化代码（附带docker调优方案：https://blog.csdn.net/Rambo_Yang/article/details/118929604）
docker stats监控容器资源消耗找到涉事容器pid
执行jmap -histo pid可以打印出当前堆中所有每个类的实例数量和内存占用，如下，class name是每个类的类名（[B是byte类型，[C是char类型，[I是int类型），bytes是这个类的所有示例占用内存大小，instances是这个类的实例数量。

在这里插入图片描述
6、把当前堆内存的快照转储到dumpfile_jmap.hprof文件中，然后可以对内存快照进行分析

使用jmap -dump:format=b,file=文件名 [pid]，就可以把指定java进程的堆内存快照搞到一个指定的文件里去，但是jmap -dump:format其实一般会比较慢一些，也可以用gcore工具来导出内存快照

例如：jmap -dump:format=b,file=D:/log/jvm/dumpfile_jmap.hprof 20886

接着就是可以用MAT工具，或者是Eclipse MAT的内存分析插件，来对hprof文件进行分析，看看到底是哪个王八蛋对象太多了，导致内存溢出了

或者使用jdk的目录下的bin目录下的：jvisualvm.exe 或者下载JProfiler 工具（可以了解一下个人感觉比较好用）

在这里插入图片描述
8、总结：

一般常见的OOM，要么是短时间内涌入大量的对象，导致你的系统根本支持不住，此时你可以考虑优化代码，或者是加机器；要么是长时间来看，你的很多对象不用了但是还被引用，就是内存泄露了，你也是优化代码就好了；这就会导致大量的对象不断进入老年代，然后频繁full gc之后始终没法回收，就撑爆了

要么是加载的类过多，导致class在永久代理保存的过多，始终无法释放，就会撑爆

我这里可以给大家最后提一点，人家肯定会问你有没有处理过线上的问题，你就说有，最简单的，你说有个小伙子用了本地缓存，就放map里，结果没控制map大小，可以无限扩容，最终导致内存爆了，后来解决方案就是用了一个ehcache框架，自动LRU清理掉旧数据，控制内存占用就好了。

另外，务必提到，线上jvm必须配置-XX:+HeapDumpOnOutOfMemoryError，-XX:HeapDumpPath=/path/heap/dump。因为这样就是说OOM的时候自动导出一份内存快照，你就可以分析发生OOM时的内存快照了，到底是哪里出现的问题。

9、修改代码调优，修改jvm配置调优，部署接口压测

代码进行优化、根据压测的情况去进行一定的jvm参数的调优，一个系统的QPS，一个是系统的接口的性能，压测到一定程度的时候，机器的cpu、内存、io、磁盘的一些负载情况，jvm的表现

10、流程

在这里插入图片描述

附加：系统频繁full gc：

比OOM稍微好点的是频繁full gc，如果OOM就是系统自动就挂了，很惨，你绝对是超级大case，但是频繁full gc会好多，其实就是表现为经常请求系统的时候，很卡，一个请求卡半天没响应，就是会觉得系统性能很差。

首先，你必须先加上一些jvm的参数，让线上系统定期打出来gc的日志：

-XX:+PrintGCTimeStamps

-XX:+PrintGCDeatils

-Xloggc:

这样如果发现线上系统经常卡顿，可以立即去查看gc日志，大概长成这样：

在这里插入图片描述

如果要是发现每次Full GC过后，ParOldGen就是老年代老是下不去，那就是大量的内存一直占据着老年代，啥事儿不干，回收不掉，所以频繁的full gc，每次full gc肯定会导致一定的stop the world卡顿，这是不可能完全避免的

接着采用跟之前一样的方法，就是dump出来一份内存快照，然后用Eclipse MAT插件分析一下好了，看看哪个对象量太大了

接着其实就是跟具体的业务场景相关了，要看具体是怎么回事，常见的其实要么是内存泄露，要么就是类加载过多导致永久代快满了，此时一般就是针对代码逻辑来优化一下。

给大家还是举个例子吧，我们线上系统的一个真实例子，大家可以用这个例子在面试里来说，比如说当时我们有个系统，在后台运行，每次都会一下子从mysql里加载几十万行数据进来各种处理，类似于定时批量处理，这个时候，如果对几十万数据的处理比较慢，就会导致比如几分钟里面，大量数据囤积在老年代，然后没法回收，就会频繁full gc。

当时我们其实就是根据这个发现了当时两台机器已经不够了，因为我们当时线上用了两台4核8G的虚拟机在跑，明显不够了，就要加机器了，所以增加了机器，每台机器处理更少的数据量，那不就ok了，马上就缓解了频繁full gc的问题了。