etcdserver: read wal error (walpb: crc mismatch) and cannot be repaired(1):怎么发现这个问题的?
1、环境介绍问题环境是一个部署在OpenStack虚拟机上的K8S集群,有三个Master节点# kubectl get nodes -o wideNAMESTATUSROLESAGEVERSIONINTERNAL-IPEXTERNAL-IPOS-IMAGE...
1、环境介绍
问题环境是一个部署在OpenStack虚拟机上的K8S集群,有三个Master节点
# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-m1 Ready master 56d v1.17.0 172.0.2.139 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://19.3.6
k8s-m2 Ready master 56d v1.17.0 172.0.2.146 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://19.3.6
k8s-m3 Ready master 56d v1.17.0 172.0.2.234 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://19.3.6
K8S核心组件以及etcd集群都是以静态POD方式部署
# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-564b6667d7-hh4fc 1/1 Running 43 56d
calico-node-27p7r 1/1 Running 43 56d
calico-node-ccqmh 1/1 Running 7 56d
calico-node-s4rkj 1/1 Running 41 56d
coredns-6955765f44-pktkk 1/1 Running 39 56d
coredns-6955765f44-zsqtw 1/1 Running 39 56d
etcd-k8s-m1 1/1 Running 44 56d
etcd-k8s-m2 1/1 Running 2 4h3m
etcd-k8s-m3 1/1 Running 3779 56d
kube-apiserver-k8s-m1 1/1 Running 40 56d
kube-apiserver-k8s-m2 1/1 Running 6544 56d
kube-apiserver-k8s-m3 1/1 Running 17973 56d
kube-controller-manager-k8s-m1 1/1 Running 61 51d
kube-controller-manager-k8s-m2 1/1 Running 123 51d
kube-controller-manager-k8s-m3 1/1 Running 20 51d
kube-proxy-ddn5z 1/1 Running 6 56d
kube-proxy-rxldv 1/1 Running 39 56d
kube-proxy-vsqn5 1/1 Running 39 56d
kube-scheduler-k8s-m1 1/1 Running 61 56d
kube-scheduler-k8s-m2 1/1 Running 61 56d
kube-scheduler-k8s-m3 1/1 Running 24 56d
tiller-deploy-7d7dcb8d58-l8t7k 1/1 Running 39 55d
2、问题现象
问题出现时,k8s-m2节点所在OpenStack计算节点重启过。OpenStack计算节点恢复后,k8s虚拟机无法登陆,通过手工命令硬重启过
$ nova list
| ccec3aee-15e5-4710-8e4c-5a4db6826022 | k8s-m1 | ACTIVE | - | Running | 5gc_mgt=180.0.0.3; clu_net2=.139, 192.168.32.198 |
| 9eacafc1-8ce7-4073-a8c2-73270cffae31 | k8s-m2 | ACTIVE | - | Running | 5gc_mgt=180.0.0.206; clu_net.2.146, 192.168.32.186 |
| 8128acd7-fd05-4367-985d-a36023f87de8 | k8s-m3 | ACTIVE | - | Running | 5gc_mgt=180.0.0.8; clu_net2=.234, 192.168.32.159 |
$ nova reboot ccec3aee-15e5-4710-8e4c-5a4db6826022 --hard
Request to reboot server k8s-m1 (ccec3aee-15e5-4710-8e4c-5a4db6826022) has been accepted.
$ nova reboot 9eacafc1-8ce7-4073-a8c2-73270cffae31 --hard
Request to reboot server k8s-m2 (9eacafc1-8ce7-4073-a8c2-73270cffae31) has been accepted.
K8S集群恢复,但是有问题,问题的最初现象是K8S集群的k8s-m2节点处于NotReady状态
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-m1 Ready master 55d v1.17.0
k8s-m2 NotReady master 55d v1.17.0
k8s-m3 Ready master 55d v1.17.0
3、定位问题
从问题的最初现象来看,似乎和etcd没有什么关系
3.1、k8s-m2节点状态
k8s-m2这个节点状态为Unknown ,从输出信息可以看出节点的kubelet没有上报状态
# kubectl describe nodes k8s-m2
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Fri, 10 Apr 2020 03:47:35 -0400 Fri, 10 Apr 2020 03:47:35 -0400 CalicoIsUp Calico is running on this node
MemoryPressure Unknown Sat, 11 Apr 2020 01:47:00 -0400 Mon, 13 Apr 2020 02:30:06 -0400 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Sat, 11 Apr 2020 01:47:00 -0400 Mon, 13 Apr 2020 02:30:06 -0400 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Sat, 11 Apr 2020 01:47:00 -0400 Mon, 13 Apr 2020 02:30:06 -0400 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Sat, 11 Apr 2020 01:47:00 -0400 Mon, 13 Apr 2020 02:30:06 -0400 NodeStatusUnknown Kubelet stopped posting node status.
3.2、节点的kubelet状态
kubelet服务状态为running,但是日志报node “k8s-m2” not found
[root@k8s-m2 ~]# systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Mon 2020-04-13 04:34:51 EDT; 1min 1s ago
Docs: http://kubernetes.io/docs/
Process: 26143 ExecStartPre=/usr/bin/kubelet-pre-start.sh (code=exited, status=0/SUCCESS)
Main PID: 26156 (kubelet)
Tasks: 18
Memory: 33.1M
CGroup: /system.slice/kubelet.service
└─26156 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib...
Apr 13 04:35:51 k8s-m2 kubelet[26156]: E0413 04:35:51.607268 26156 kubelet.go:2263] node "k8s-m2" not found
Apr 13 04:35:51 k8s-m2 kubelet[26156]: E0413 04:35:51.707388 26156 kubelet.go:2263] node "k8s-m2" not found
Apr 13 04:35:51 k8s-m2 kubelet[26156]: E0413 04:35:51.710741 26156 kubelet_node_status.go:92] Unable to register node "k8s-m2" with API server: P...ke timeout
Apr 13 04:35:51 k8s-m2 kubelet[26156]: E0413 04:35:51.807510 26156 kubelet.go:2263] node "k8s-m2" not found
Apr 13 04:35:51 k8s-m2 kubelet[26156]: E0413 04:35:51.907623 26156 kubelet.go:2263] node "k8s-m2" not found
Apr 13 04:35:52 k8s-m2 kubelet[26156]: E0413 04:35:52.007760 26156 kubelet.go:2263] node "k8s-m2" not found
Apr 13 04:35:52 k8s-m2 kubelet[26156]: E0413 04:35:52.107909 26156 kubelet.go:2263] node "k8s-m2" not found
Apr 13 04:35:52 k8s-m2 kubelet[26156]: E0413 04:35:52.208042 26156 kubelet.go:2263] node "k8s-m2" not found
Apr 13 04:35:52 k8s-m2 kubelet[26156]: E0413 04:35:52.308158 26156 kubelet.go:2263] node "k8s-m2" not found
Apr 13 04:35:52 k8s-m2 kubelet[26156]: E0413 04:35:52.309049 26156 eviction_manager.go:246] eviction manager: failed to get summary stats: failed... not found
3.3、POD状态
选一个k8s-m2节点上的pod查看状态,其中Conditions的Ready为False。
# kubectl describe pods etcd-k8s-m2 -n kube-system
...
Conditions:
Type Status
Initialized True
Ready False
ContainersReady True
PodScheduled True
...
使用kubectl logs无法查看k8s-m2节点上的pod日志
[root@k8s-m1 ~]# kubectl logs kube-apiserver-k8s-m2 -n kube-system
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)
查看另外两个节点上的etcd日志表明无法连接etcd-k8s-m2
# kubectl logs etcd-k8s-m3 -n kube-system
2020-04-14 19:09:22.030794 W | rafthttp: health check for peer 1e2fb9983e528532 could not connect: dial tcp 172.0.2.146:2380: connect: connection refused
2020-04-14 19:09:22.030833 W | rafthttp: health check for peer 1e2fb9983e528532 could not connect: dial tcp 172.0.2.146:2380: connect: connection refused
kubectl logs无法提供更多的信息,尝试直接查看容器日志
3.4、k8s-m2节点容器状态
使用docker ps命令无法看到kube-apiserver和etcd容器。
使用docker ps -a查看发现这两个容器一直在重启
[root@k8s-m2 ~]# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0907f00317b0 0cae8d5cc64c "kube-apiserver --ad…" 34 seconds ago Exited (2) 11 seconds ago k8s_kube-apiserver_kube-apiserver-k8s-m2_kube-system_a6969daa2e4e9a047c11e645ac639c8f_229
ce44cbffdb45 303ce5db0e90 "etcd --advertise-cl…" 39 seconds ago Exited (1) 38 seconds ago k8s_etcd_etcd-k8s-m2_kube-system_c45c8fe716669e896c01df9357b80855_317
抓住这两个容器短暂的重启时间窗口,使用docker logs命令查看日志
kube-apiserver容器日志
apiserver日志表明kube-apiserver连不上本地的etcd
[root@k8s-m2 ~]# docker logs 60e41c29afec
Flag --insecure-port has been deprecated, This flag will be removed in a future version.
I0413 08:17:15.314983 1 server.go:596] external host was not specified, using 172.0.2.146
I0413 08:17:15.315295 1 server.go:150] Version: v1.17.0
I0413 08:17:16.272794 1 plugins.go:158] Loaded 11 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,MutatingAdmissionWebhook,RuntimeClass.
I0413 08:17:16.272823 1 plugins.go:161] Loaded 7 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,RuntimeClass,ResourceQuota.
I0413 08:17:16.273986 1 plugins.go:158] Loaded 11 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,MutatingAdmissionWebhook,RuntimeClass.
I0413 08:17:16.274005 1 plugins.go:161] Loaded 7 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,RuntimeClass,ResourceQuota.
I0413 08:17:16.276810 1 client.go:361] parsed scheme: "endpoint"
I0413 08:17:16.276856 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379 0 <nil>}]
W0413 08:17:16.277260 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
I0413 08:17:17.270943 1 client.go:361] parsed scheme: "endpoint"
I0413 08:17:17.270987 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379 0 <nil>}]
W0413 08:17:17.271304 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0413 08:17:17.277637 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
查看etcd-server容器日志
etcd日志表明本地的wal数据被破坏
[root@k8s-m2 ~]# docker logs ae7571b20883
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-04-13 08:25:50.975264 I | etcdmain: etcd Version: 3.4.3
2020-04-13 08:25:50.975305 I | etcdmain: Git SHA: 3cf2f69b5
2020-04-13 08:25:50.975311 I | etcdmain: Go Version: go1.12.12
2020-04-13 08:25:50.975316 I | etcdmain: Go OS/Arch: linux/amd64
2020-04-13 08:25:50.975321 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2020-04-13 08:25:50.975397 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-04-13 08:25:50.975447 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
2020-04-13 08:25:50.976806 I | embed: name = k8s-m2
2020-04-13 08:25:50.976817 I | embed: data dir = /var/lib/etcd
2020-04-13 08:25:50.976822 I | embed: member dir = /var/lib/etcd/member
2020-04-13 08:25:50.976827 I | embed: heartbeat = 100ms
2020-04-13 08:25:50.976832 I | embed: election = 1000ms
2020-04-13 08:25:50.976836 I | embed: snapshot count = 10000
2020-04-13 08:25:50.976848 I | embed: advertise client URLs = https://172.0.2.146:2379
2020-04-13 08:25:50.976854 I | embed: initial advertise peer URLs = https://172.0.2.146:2380
2020-04-13 08:25:50.976860 I | embed: initial cluster =
2020-04-13 08:25:50.983972 I | etcdserver: recovered store from snapshot at index 10651074
2020-04-13 08:25:50.985535 I | mvcc: restore compact to 8003675
2020-04-13 08:25:51.287881 C | etcdserver: read wal error (walpb: crc mismatch) and cannot be repaired
至此,发现问题为k8s-m2节点etcd数据被破坏,原因应该是k8s-m2节点所在虚拟机硬重启导致
更多推荐
所有评论(0)