ElasticSearch中的某个index的状态显示为red的问题
错误:Unassigned Shards 4

1.1.1.查看集群状态

GET /_cluster/health?pretty

结果类似:

{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 426,
  "active_shards" : 851,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

从上面可看出,集群的状态为red,其中unassigned_shard为4。错误原因就是有unassigned_shard的索引导致的。

1.1.2.查看索引的状态

GET /_cat/indices?v

结果类似:

health status index                               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   test_test_operation_to              pyz9euqTQQ6GF0ulPsnX4g   3   1          1            0      9.2kb          4.6kb
green  open   cba1                                r5fvWeeAQ7uxQMRtNJxuwA   5   1          1            0     10.2kb          5.1kb
green  open   positiveinfo                        97r4mnToS1OVx04QVzF5Rw   3   1       3091            7    993.4kb        496.7kb
green  open   dc_rep_pub_issue_output_month       lLfxHpsZR8GPecqMLTvLsg   5   1     311845           87    163.3mb         81.5mb
       close  emplyee_test                        bsVDbqFWS0uYekpFI4Wnng                                                          
green  open   .monitoring-kibana-6-2021.05.27     cqlsx2crQyuc0WtSc_74zw   1   1       2711            0      1.8mb        959.1kb
green  open   filtertableinfo                     KGoc6kxqRtuZxPHG7Z6oXw   3   1         67            1      171kb         85.5kb
red  open   sg_house_rent_info_prod             fAVmV5aqTROVbHjqw0GRKg   5   1   60313716     16540955     19.7gb         10.2gb

查看health 为red的,可以定位到是:sg_house_rent_info_prod

1.1.3.查看每个节点分片的分配数量以及它们所使用的硬盘空间大小

我们通过 GET _cat/allocation?v 可以查看每个节点分片的分配数量以及它们所使用的硬盘空间大小

GET _cat/allocation?v

结果类似:

shards disk.indices disk.used disk.avail disk.total disk.percent host    ip            node
   284       88.2gb     218gb      2.7tb      2.9tb            7 hadoop1 xxx.xxx.xxx.xxx hadoop1
   284      104.7gb   248.9gb      2.7tb      2.9tb            8 hadoop3 xxx.xxx.xxx.xxx hadoop3
   283       96.6gb   234.6gb      2.8tb        3tb            7 hadoop2 xxx.xxx.xxx.xxx hadoop2
   4                                                                                     UNASSIGNED                                                                                                            

发现其有4个shard是unassigned状态

再通过GET /_cat/health?v 查看集群健康状态。如果是正常的,显示的结果是如下的:

epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1622101651 07:47:31  elasticsearch green           3         3    851 426    0    0        0             0                  -                100.0%

1.1.4.如何解决呢?

首先精确定位unassigned shard的位置

GET _cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED

然后可以通过以下语句查看具体原因:

GET _cluster/allocation/explain?pretty

笔者查询出的结果是:

{
  "index" : "sg_house_rent_info_prod",
  "shard" : 2,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2021-05-24T20:47:04.790Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [w__xIKWBT5KJZg1CEcmFGA]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[sg_house_rent_info_prod][2]: obtaining shard lock timed out after 5000ms]; ",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
  "node_allocation_decisions" : [
    {
      "node_id" : "BOIgtPqgQSyIfAICLDuEfQ",
      "node_name" : "hadoop1",
      "transport_address" : "xxx.xxx.xxx.xxx:9300",
      "node_attributes" : {
        "ml.machine_memory" : "269924302848",
        "xpack.installed" : "true",
        "ml.max_open_jobs" : "20",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "in_sync" : false,
        "allocation_id" : "AxRuU1gfS3yimqXUd7SoJw"
      }
    },
    {
      "node_id" : "eGv9Jjs_S8GcNLKzkCxzMA",
      "node_name" : "hadoop2",
      "transport_address" : "xxx.xxx.xxx.xxx:9300",
      "node_attributes" : {
        "ml.machine_memory" : "269924302848",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "in_sync" : false,
        "allocation_id" : "wjR9jkfjQ-28OBKl_xFi1A",
        "store_exception" : {
          "type" : "file_not_found_exception",
          "reason" : "no segments* file found in SimpleFSDirectory@/home/admin/es/esdata/nodes/0/indices/fAVmV5aqTROVbHjqw0GRKg/2/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@473b450: files: [write.lock]"
        }
      }
    },
    {
      "node_id" : "w__xIKWBT5KJZg1CEcmFGA",
      "node_name" : "hadoop3",
      "transport_address" : "xxx.xxx.xxx.xxx:9300",
      "node_attributes" : {
        "ml.machine_memory" : "269924302848",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "PQWOPdxnQDqVfQbRLgh32A",
        "store_exception" : {
          "type" : "shard_lock_obtain_failed_exception",
          "reason" : "[sg_house_rent_info_prod][2]: obtaining shard lock timed out after 5000ms",
          "index_uuid" : "fAVmV5aqTROVbHjqw0GRKg",
          "shard" : "2",
          "index" : "sg_house_rent_info_prod"
        }
      },
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2021-05-24T20:47:04.790Z], failed_attempts[5], delayed=false, details[failed shard on node [w__xIKWBT5KJZg1CEcmFGA]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[sg_house_rent_info_prod][2]: obtaining shard lock timed out after 5000ms]; ], allocation_status[deciders_no]]]"
        }
      ]
    }
  ]
}

到网上查找failed to create shard, failure IOException[failed to obtain in-memory shard lock

解决办法:
在kibana中执行如下命令:

POST /_cluster/reroute?retry_failed=true

retry_failed : (可选,布尔值)如果为true,则重试由于后续分配失败过多而阻塞的分片的分配。

附:常见es分配失败原因:

1)INDEX_CREATED:由于创建索引的API导致未分配。
2)CLUSTER_RECOVERED :由于完全集群恢复导致未分配。
3)INDEX_REOPENED :由于打开open或关闭close一个索引导致未分配。
4)DANGLING_INDEX_IMPORTED :由于导入dangling索引的结果导致未分配。
5)NEW_INDEX_RESTORED :由于恢复到新索引导致未分配。
6)EXISTING_INDEX_RESTORED :由于恢复到已关闭的索引导致未分配。
7)REPLICA_ADDED:由于显式添加副本分片导致未分配。
8)ALLOCATION_FAILED :由于分片分配失败导致未分配。
9)NODE_LEFT :由于承载该分片的节点离开集群导致未分配。
10)REINITIALIZED :由于当分片从开始移动到初始化时导致未分配(例如,使用影子shadow副本分片)。
11)REROUTE_CANCELLED :作为显式取消重新路由命令的结果取消分配。
12)REALLOCATED_REPLICA :确定更好的副本位置被标定使用,导致现有的副本分配被取消,出现未分配。

另外一篇比较好的博文:
Elasticsearch 集群和索引健康状态及常见错误说明

elasticsearch-6.7.1版本对应的elasticsearch.yml和jvm.options调整内容调整

# 以下是参数调优的内容
indices.queries.cache.size: 15%

indices.fielddata.cache.size: 20%
indices.memory.index_buffer_size: 20%

discovery.zen.fd.ping_timeout: 120s
discovery.zen.fd.ping_retries: 6
discovery.zen.fd.ping_interval: 30s

thread_pool.index.size: 50
thread_pool.index.queue_size: 500

另外考虑修改:

thread_pool.index.queue_size, 
thread_pool.get.queue_size, 
thread_pool.write.queue_size, 
thread_pool.bulk.queue_size, 
thread_pool.listener.queue_size, 
thread_pool.analyze.queue_size, 
thread_pool.search.queue_size, 
thread_pool.index.size

jvm.options调整内容

-Xms6g
-Xmx6g
-XX:SurvivorRatio=4
-XX:NewRatio=2
Logo

为开发者提供学习成长、分享交流、生态实践、资源工具等服务,帮助开发者快速成长。

更多推荐