当etcd被误删后，kubeadm安装的集群快速恢复集群

etcd被删除后，之前集群的数据如果没有备份，数据将全部丢失，即使集群恢复后，也是一个新的集群。【现象】：# kubectlgetnodesNo resources found【恢复操作】在master节点上# kubeadm reset[reset] Reading configuration from the cluster...[reset] FYI: You can look at thi

楠奕

4965人浏览 · 2022-03-09 17:38:55

楠奕 · 2022-03-09 17:38:55 发布

etcd被删除后，之前集群的数据如果没有备份，数据将全部丢失，即使集群恢复后，也是一个新的集群。

【现象】：

# kubectl  get  nodes
No resources found

【恢复操作】

在master节点上

# kubeadm reset
[reset] Reading configuration from the cluster...
[reset] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
W0309 16:33:41.211977    5001 reset.go:101] [reset] Unable to fetch the kubeadm-config ConfigMap from cluster: failed to get config map: configmaps "kubeadm-config" not found
[reset] WARNING: Changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] Are you sure you want to proceed? [y/N]: y
[preflight] Running pre-flight checks
W0309 16:33:42.783650    5001 removeetcdmember.go:80] [reset] No kubeadm config, using etcd pod spec to get data directory
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
[reset] Deleting contents of stateful directories: [/var/lib/etcd /var/lib/kubelet /var/lib/dockershim /var/run/kubernetes /var/lib/cni]

The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d

The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.

If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.

删除$home/.kube/下所有文件，

# kubeadm init \
> --apiserver-advertise-address=10.1.1.2 \
> --image-repository registry.aliyuncs.com/google_containers \
> --kubernetes-version v1.23.4 \
> --service-cidr=10.96.0.0/12 \
> --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.23.4
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [master kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.1.1.2]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [master localhost] and IPs [10.1.1.2 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [master localhost] and IPs [10.1.1.2 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[apiclient] All control plane components are healthy after 27.904841 seconds
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config-1.23" in namespace kube-system with the configuration for the kubelets in the cluster
NOTE: The "kubelet-config-1.23" naming of the kubelet ConfigMap is deprecated. Once the UnversionedKubeletConfigMap feature gate graduates to Beta the default name will become just "kubelet-config". Kubeadm upgrade will handle this transition transparently.
[upload-certs] Skipping phase. Please see --upload-certs
[mark-control-plane] Marking the node master as control-plane by adding the labels: [node-role.kubernetes.io/master(deprecated) node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]
[mark-control-plane] Marking the node master as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule]
[bootstrap-token] Using token: 1h8pgf.bk8n7d2h254p20b9
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 10.1.1.2:6443 --token 1h8pgf.bk8n7d2h254p20b9 \
        --discovery-token-ca-cert-hash sha256:909c8ab47ada04d43e6e02408ffbfe8efd5019bfce4e8a95401db0f84cb22de5

最后将配置文件拷贝到用户家目录的.kube文件下

sudo cp -i /etc/kubernetes/admin.conf  $home/.kube/config
sudo chown $(id u):$(id g) $home/.kube/config

此时在master上操作，已经能看到node信息

#kubectl get node
NAME                STATUS   ROLES                  AGE     VERSION
master   Ready    control-plane,master   7m20s   v1.23.4

【在node节点上操作】

在node节点上直接操作时，会报错，报错如下

# kubeadm join 10.1.1.2:6443 --token 1h8pgf.bk8n7d2h254p20b9 \
>         --discovery-token-ca-cert-hash sha256:909c8ab47ada04d43e6e02408ffbfe8efd5019bfce4e8a95401db0f84cb22de5 
[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists
        [ERROR Port-10250]: Port 10250 is in use
        [ERROR FileAvailable--etc-kubernetes-pki-ca.crt]: /etc/kubernetes/pki/ca.crt already exists
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

1.停kubelet应用

# systemctl stop kubelet

2.删除已经存在的文件

# rm -rf kubelet.conf 
# rm -rf pki/ca.crt

3.再次执行加入集群的命令，仍然报错，报错如下

error execution phase kubelet-start: error uploading crisocket: Unauthorized

# kubeadm join 10.1.1.2:6443 --token 1h8pgf.bk8n7d2h254p20b9 \
>         --discovery-token-ca-cert-hash sha256:909c8ab47ada04d43e6e02408ffbfe8efd5019bfce4e8a95401db0f84cb22de5 
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[kubelet-check] Initial timeout of 40s passed.
error execution phase kubelet-start: error uploading crisocket: Unauthorized
To see the stack trace of this error execute with --v=5 or higher

4. 这是因为这个node之前已经加入过集群，需要重置下环境，执行如下命令

swapoff -a
kubeadm reset
rm /etc/cni/net.d/* -f
systemctl daemon-reload
systemctl restart kubelet
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X

# swapoff -a
# kubeadm reset
[reset] WARNING: Changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] Are you sure you want to proceed? [y/N]: y
[preflight] Running pre-flight checks
W0309 16:54:22.671794    5840 removeetcdmember.go:80] [reset] No kubeadm config, using etcd pod spec to get data directory
[reset] No etcd config found. Assuming external etcd
[reset] Please, manually reset etcd to prevent further issues
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
[reset] Deleting contents of stateful directories: [/var/lib/kubelet /var/lib/dockershim /var/run/kubernetes /var/lib/cni]

The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d

The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.

If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.
# rm /etc/cni/net.d/* -f
# systemctl daemon-reload
# systemctl restart kubelet
# iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X

5.再次执行kubeadm join命令即可

kubeadm join 10.1.1.2:6443 --token 1h8pgf.bk8n7d2h254p20b9 \
>         --discovery-token-ca-cert-hash sha256:909c8ab47ada04d43e6e02408ffbfe8efd5019bfce4e8a95401db0f84cb22de5
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

此时就已经加入成功，其余的node节点也是如是操作。

6. 在master上执行查看操作

# kubectl  get nodes
NAME                STATUS     ROLES                  AGE     VERSION
master   Ready      control-plane,master   23m     v1.23.4
node1   NotReady   <none>                 7m29s   v1.23.4
node2   NotReady   <none>                 7m22s   v1.23.4

两个节点状态是NotReady，到这两台主机上，查看日志 journalctl -f -u kubelet

报错信息为

"Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"

# journalctl  -f -u kubelet
-- Logs begin at Mon 2022-01-17 14:30:01 CST. --
Mar 09 17:01:45 node1 kubelet[6191]: I0309 17:01:45.427108    6191 cni.go:240] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
Mar 09 17:01:47 node1 kubelet[6191]: E0309 17:01:47.775298    6191 kubelet.go:2347] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
Mar 09 17:01:50 node1 kubelet[6191]: I0309 17:01:50.428224    6191 cni.go:240] "Unable to update cni config" err="no networks found in /etc/cni/net.d"
Mar 09 17:01:52 node1 kubelet[6191]: E0309 17:01:52.805348    6191 kubelet.go:2347] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
Mar 09 17:01:55 node1 kubelet[6191]: I0309 17:01:55.429269    6191 cni.go:240] "Unable to update cni config" err="no networks found in /etc/cni/net.d"

这是因为flannel未运行，在master上运行即可

# kubectl apply -f kube-flannel-rbac.yml 
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created

# kubectl apply -f kube-flannel.yml 
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
podsecuritypolicy.policy/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel configured
clusterrolebinding.rbac.authorization.k8s.io/flannel unchanged
serviceaccount/flannel created
configmap/kube-flannel-cfg created
Warning: spec.template.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].key: beta.kubernetes.io/os is deprecated since v1.14; use "kubernetes.io/os" instead
Warning: spec.template.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[1].key: beta.kubernetes.io/arch is deprecated since v1.14; use "kubernetes.io/arch" instead
daemonset.apps/kube-flannel-ds-amd64 created
daemonset.apps/kube-flannel-ds-arm64 created
daemonset.apps/kube-flannel-ds-arm created
daemonset.apps/kube-flannel-ds-ppc64le created
daemonset.apps/kube-flannel-ds-s390x created

此时集群已经恢复ok

# kubectl  get nodes
NAME                STATUS   ROLES                  AGE     VERSION
master   Ready    control-plane,master   26m     v1.23.4
node1   Ready    <none>                 10m     v1.23.4
node2   Ready    <none>                 9m58s   v1.23.4

# kubectl  get ns
NAME              STATUS   AGE
default           Active   27m
kube-node-lease   Active   27m
kube-public       Active   27m
kube-system       Active   27m

# kubectl  get po -n kube-system
NAME                                        READY   STATUS    RESTARTS        AGE
coredns-6d8c4cb4d-7lswb                     1/1     Running   0               26m
coredns-6d8c4cb4d-84z48                     1/1     Running   0               26m
etcd-host-10-19-83-151                      1/1     Running   3               27m
kube-apiserver-host-10-19-83-151            1/1     Running   0               27m
kube-controller-manager-host-10-19-83-151   1/1     Running   19              27m
kube-flannel-ds-amd64-8j4nd                 1/1     Running   0               82s
kube-flannel-ds-amd64-gbn6r                 1/1     Running   0               82s
kube-flannel-ds-amd64-tx8cd                 1/1     Running   0               82s
kube-proxy-d4bb2                            1/1     Running   0               10m
kube-proxy-k2skv                            1/1     Running   0               26m
kube-proxy-x9k76                            1/1     Running   1 (5m11s ago)   11m
kube-scheduler-host-10-19-83-151            1/1     Running   19              27m