ElasticSearch集群状态异常(Red、Yellow)原因分析

注：部分概念介绍来源于网络一、ElasticSearch集群的三种状态：Green - 所有数据都可用，主副分片都已经分配好Yellow - 所有数据都可用，但尚未分配一些副本，不影响查询，可能影响恢复。如果集群中的某个节点发生故障，则在修复该节点之前，某些数据可能不可用。Red - 某些数据由于某种原因存在主分片未分配，对查询会有影响二、查询索引Yellow状态原因1、查看集群的健康并显示索

努力者Mr李

5347人浏览 · 2022-04-20 14:43:54

努力者Mr李 · 2022-04-20 14:43:54 发布

注：部分概念介绍来源于网络

一、ElasticSearch集群的三种状态：
Green - 所有数据都可用，主副分片都已经分配好
Yellow - 所有数据都可用，但尚未分配一些副本，不影响查询，可能影响恢复。如果集群中的某个节点发生故障，则在修复该节点之前，某些数据可能不可用。
Red - 某些数据由于某种原因存在主分片未分配，对查询会有影响

二、查询索引Yellow状态原因
1、查看集群的健康并显示索引状态

GET /_cluster/health?level=indices
{
  "cluster_name" : "elasticsearch-1",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  #活动主分区数量
  "active_primary_shards" : 28,
  #活动主分区和副本分区的总数
  "active_shards" : 55,
  #正在重定位的分片数量
  "relocating_shards" : 0,
  #正在初始化的分片数量
  "initializing_shards" : 0,
  #未分配的分片数
  "unassigned_shards" : 3,
  #其分配因超时设置而延迟的分片数
  "delayed_unassigned_shards" : 0,
  #尚未执行的集群级别更改的数量
  "number_of_pending_tasks" : 0,
  #为完成的访问数量
  "number_of_in_flight_fetch" : 0,
  #自最早的初始化任务等待执行以来的时间(以毫秒为单位)
  "task_max_waiting_in_queue_millis" : 0,
  #集群中活动碎片的比率，以百分比表示
  "active_shards_percent_as_number" : 100.0,
  "indices" : {
    "elasticsearch-1" : {
      "status" : "green",
      "number_of_shards" : 3,
      "number_of_replicas" : 3,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 3
    }
  }
}

2、查看集群中每个节点的分片分配情况

GET /_cat/allocation?v
shards disk.indices disk.used disk.avail disk.total disk.percent host      ip        node
    19       86.7kb    36.9gb     95.2gb    132.2gb           27 127.0.0.1 127.0.0.1 master
    18       73.1kb    36.9gb     95.2gb    132.2gb           27 127.0.0.1 127.0.0.1 node-003
    18       67.8kb    36.9gb     95.2gb    132.2gb           27 127.0.0.1 127.0.0.1 node-002
     3                                                                               UNASSIGNED
#unassigned_shards=3，确定是副本分片未分配，导致集群状态Yellow

3、查看unassigned的原因

GET /_cluster/allocation/explain?pretty
{
    "index" : "elasticsearch-1",
    "shard" : 3,
    "primary" : false,
    "current_state" : "unassigned",
    "unassigned_info" : {
        "reason" : "CLUSTER_RECOVERED",
        "at" : "2022-04-20T11:01:43.051Z",
        "last_allocation_status" : "no_attempt"
    },
    "can_allocate" : "no",
    #异常原因
    "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
    "node_allocation_decisions" : [
    {
        "node_id" : "NfmBH4nSSpGmtf7aPNuvXQ",
        "node_name" : "master",
        "transport_address" : "127.0.0.1:9300",
        "node_decision" : "no",
        "deciders" : [{
        "decider" : "same_shard",
        "decision" : "NO",
        "explanation" : "the same cannot be allocate to the same node no which a copy of the shard already exists "
        }]
    }]
}

查看每个节点原因说有同样的数据，不能分配。
4、查看所有的分片

GET _cat/shards?h=index,shard,prirep,state,unassigned.reason

5、修改索引副本数

PUT /elasticsearch-1/_settings
{
    "number_of_replicas": 2
}

6、更改完后查询

GET /_cluster/health?level=indices
  "unassigned_shards" : 0

三、总结(Red、Yellow)
遇到集群Red、Yellow时，我们可以从如下方法排查 :

集群层面：curl -s 172.31.30.28:9200/_cat/nodes 或者 GET /_cluster/health
索引层面：GET /_cluster/health?pretty&level=indices
分片层面：GET /_cluster/health?pretty&level=shards
恢复情况：GET /_recovery?pretty

1、有unassigned分片的排查思路：

先诊断：GET /_cluster/allocation/explain
#重新分配： /_cluster/reroute
实在无法分配，索引重建：
1.1、新建备份索引：
curl -XPUT ‘http://xxxx:9200/a_index_copy/‘ -d ‘{ “settings”:{ “index”:{ “number_of_shards”:3, “number_of_replicas”:1 } } }
1.2、通过reindex api将a_index数据copy到a_index_copy：
POST _reindex { "source": { "index": "a_index" }, "dest": { "index": "a_index_copy", "op_type": "create" } }
1.3、删除a_index索引，这个必须要先做，否则别名无法添加
curl -XDELETE 'http://xxxx:9200/a_index'
1.4、给a_index_copy添加别名a_index
curl -XPOST 'http://xxxx:9200/_aliases' -d ' { "actions": [ {"add": {"index": "a_index_copy", "alias": "a_index"}} ] }'