故障描述

内部环境收到Pod异常告警

[Alerting] Pod 状态告警
集群中存在 Pod 处于异常状态超过  1 分钟
 1. ti-inf/etcd-1 (Pending): 1.000
详请链接, http://xx.xx.xx.xx/grafana/d/default/alert-dashboard?tab=alert&viewPanel=19&orgId=1

查看k8s集群中异常Pod,发现为数据组件pod
在这里插入图片描述

排查思路

1.尝试重启Pod
~]# kubectl delete pod etcd-1 -nti-inf
发现还是处于异常状态。
2.查看pod events事件
~]# kubectl describe pod redis-server-2 -nti-inf
Events:
  Type     Reason       Age                     From     Message
  ----     ------       ----                    ----     -------
  Normal   Scheduled    28m                     volcano  Successfully assigned ti-inf/redis-server-2 to x.x.x.x
  Warning  FailedMount  3m17s (x3599 over 28m)  kubelet  MountVolume.SetUp failed for volume "pvc-9d1c0e76-6d56-439d-8070-741d8846d569" : rpc error: code = Internal desc = stat /csi-data-dir/ti-database/pv: input/output error
从events事件中可以看到,kubelet程序在MountVolume这一步骤Failed,暴露出来的信息为“pvc input/output error”
3.查看kubelet日志
[root@VM-2-29-centos prometheus-db]# grep -i error /var/log/messages| tail -n 5
Jun 28 20:14:13 VM-2-29-centos kubelet: E0628 20:14:13.819828  793997 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-668750fa-cc0a-4105-96f3-7fa184db4ada podName: nodeName:}" failed. No retries permitted until 2022-06-28 20:14:14.319804053 +0800 CST m=+11760883.388055363 (durationBeforeRetry 500ms). Error: "MountVolume.SetUp failed for volume \"pvc-668750fa-cc0a-4105-96f3-7fa184db4ada\" (UniqueName: \"kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-668750fa-cc0a-4105-96f3-7fa184db4ada\") pod \"etcd-1\" (UID: \"1c99773c-3845-4141-ac30-1c3d26f1f30a\") : rpc error: code = Internal desc = stat /csi-data-dir/ti-database/pv: input/output error"
Jun 28 20:14:13 VM-2-29-centos kubelet: E0628 20:14:13.901519  793997 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-668750fa-cc0a-4105-96f3-7fa184db4ada podName:4c5d9bdf-498a-4456-9c6c-e6f7b456e693 nodeName:}" failed. No retries permitted until 2022-06-28 20:14:14.401482582 +0800 CST m=+11760883.469733942 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"data\" (UniqueName: \"kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-668750fa-cc0a-4105-96f3-7fa184db4ada\") pod \"4c5d9bdf-498a-4456-9c6c-e6f7b456e693\" (UID: \"4c5d9bdf-498a-4456-9c6c-e6f7b456e693\") : kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = Internal desc = stat /var/lib/kubelet/pods/4c5d9bdf-498a-4456-9c6c-e6f7b456e693/volumes/kubernetes.io~csi/pvc-668750fa-cc0a-4105-96f3-7fa184db4ada/mount: input/output error"
Jun 28 20:14:14 VM-2-29-centos kubelet: E0628 20:14:14.018249  793997 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-9d1c0e76-6d56-439d-8070-741d8846d569 podName: nodeName:}" failed. No retries permitted until 2022-06-28 20:14:14.518217097 +0800 CST m=+11760883.586468437 (durationBeforeRetry 500ms). Error: "MountVolume.SetUp failed for volume \"pvc-9d1c0e76-6d56-439d-8070-741d8846d569\" (UniqueName: \"kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-9d1c0e76-6d56-439d-8070-741d8846d569\") pod \"redis-server-2\" (UID: \"5550e257-2245-4401-bd9a-cf275ff94675\") : rpc error: code = Internal desc = stat /csi-data-dir/ti-database/pv: input/output error"
Jun 28 20:14:14 VM-2-29-centos kubelet: E0628 20:14:14.102735  793997 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-9d1c0e76-6d56-439d-8070-741d8846d569 podName:daea4ba4-b97c-46c6-866b-aa7cc29af0a8 nodeName:}" failed. No retries permitted until 2022-06-28 20:14:14.602692068 +0800 CST m=+11760883.670943428 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"data\" (UniqueName: \"kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-9d1c0e76-6d56-439d-8070-741d8846d569\") pod \"daea4ba4-b97c-46c6-866b-aa7cc29af0a8\" (UID: \"daea4ba4-b97c-46c6-866b-aa7cc29af0a8\") : kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = Internal desc = stat /var/lib/kubelet/pods/daea4ba4-b97c-46c6-866b-aa7cc29af0a8/volumes/kubernetes.io~csi/pvc-9d1c0e76-6d56-439d-8070-741d8846d569/mount: input/output error"

经过日志分析可以看到是磁盘出现了部分阻塞,出现以上大量报错信息。
4.检查pvc与pv资源对象
[root@VM-2-29-centos ~]# kubectl get pvc -nti-inf |grep redis
data-redis-server-0                  Bound    pvc-59fde781-e03e-4b26-b07c-7de93f608395   10Gi       RWO            csi-localpv-tidb   136d
data-redis-server-1                  Bound    pvc-6bf28ec2-40e1-4b52-8d54-b4ab0aa9f67a   10Gi       RWO            csi-localpv-tidb   136d
data-redis-server-2                  Bound    pvc-9d1c0e76-6d56-439d-8070-741d8846d569   10Gi       RWO            csi-localpv-tidb   136d
[root@VM-2-29-centos ~]# 
[root@VM-2-29-centos ~]# kubectl get pv |grep redis
pvc-59fde781-e03e-4b26-b07c-7de93f608395   10Gi       RWO            Delete           Bound    ti-inf/data-redis-server-0                                                    csi-localpv-tidb            136d
pvc-6bf28ec2-40e1-4b52-8d54-b4ab0aa9f67a   10Gi       RWO            Delete           Bound    ti-inf/data-redis-server-1                                                    csi-localpv-tidb            136d
pvc-9d1c0e76-6d56-439d-8070-741d8846d569   10Gi       RWO            Delete           Bound    ti-inf/data-redis-server-2                                                    csi-localpv-tidb            136d

pvc与pv资源均正常。
5.检查磁盘挂载

在这里插入图片描述

dmesg(display message) [or display driver],即看内核相关信息

[628 20:22:47 2022] buffer_io_error: 6 callbacks suppressed
[628 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971392, async page read
[628 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971393, async page read
[628 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971394, async page read
[628 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971395, async page read
[628 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971396, async page read
[628 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971397, async page read
[628 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971398, async page read
[628 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971399, async page read
[628 20:22:47 2022] Buffer I/O error on dev loop4, logical block 20971392, async page read
[628 20:22:47 2022] Buffer I/O error on dev loop4, logical block 20971393, async page read

因pvc对应磁盘为/dev/vdc,而且系统做了lvm逻辑卷,显然是逻辑卷故障了

通过系统终端查询此目录,已经无法正常访问
~]# ls /data/ti-database 
ls: 无法访问/data/ti-database: 输入/输出错误

说明:缓冲区 I/O 错误,逻辑块20971393 异步页面读取失败

解决方案

因平台数据组件(etcd/redis/es)均为3个副本,可容忍单点故障,并且此逻辑卷在起初规划设计时只给数据组件使用,所以对其他服务没有影响,只需要重新制作lvm逻辑卷即可。

详细操作流程:
1、mysql/etcd/es 数据备份
2、卸载逻辑卷挂载
3、使用lvremove删除逻辑卷LV
4、使用vgremove删除卷组VG
5、使用pvremove删除物理卷设备
在上述操作执行完毕之后,再执行 lvdisplay、vgdisplay、pvdisplay 命令来查看 LVM 的信息时就不会再看到信息了
6、删除此节点pv与pvc
7、重新制作lvm逻辑卷并进行挂载
8、创建pv、pvc资源对象,与Pod进行关联绑定
9、验证Pod状态
10、检查redis与etcd组件集群健康状态,及数据一致性校验

参考资料:
https://github.com/longhorn/longhorn/issues/1210
https://developer.aliyun.com/article/521158

Logo

华为开发者空间,是为全球开发者打造的专属开发空间,汇聚了华为优质开发资源及工具,致力于让每一位开发者拥有一台云主机,基于华为根生态开发、创新。

更多推荐