ElasticSearch的IK中文分词器

概述本文主要介绍了 ik 分词器在es中的一些配置以及原理，包括下载安装、如何设置es使用ik分词器、ik分词器与默认分词器的差异、ik分词自定义设置以及热更等等。一、安装下载ik分词器:https://github.com/medcl/elasticsearch-analysis-ikik-es插件下载地址:https://github.com/medcl/elasticsearch-anal

神秘的渊虹

1862人浏览 · 2021-03-06 14:53:53

神秘的渊虹 · 2021-03-06 14:53:53 发布

概述

本文主要介绍了 ik 分词器在es中的一些配置以及原理，包括下载安装、如何设置es使用ik分词器、ik分词器与默认分词器的差异、ik分词自定义设置以及热更等等。

至于为什么需要分词，这里不再赘述，可自行搜索，这里放上百度百科的解释中文分词

一、安装下载

ik分词器: https://github.com/medcl/elasticsearch-analysis-ik

ik-es插件下载地址: https://github.com/medcl/elasticsearch-analysis-ik/releases

下载完成后，将其解压到 elasticsearch 中的 plugins 目录，es 启动时就会自动加载此插件

⚠️ 下载的ik版本需与es版本相同

二、设置es使用ik分词器

es中分词器主要有两种情况会被使用:
- 插入文档时，将text类型的字段做分词然后插入倒排索引
- 查询时，先对要查询的text类型的输入做分词，再去倒排索引搜索
es中分词器的选择策略:
- 在索引时(插入文档)，会检查字段是否定义analyzer属性，优先使用定义的，否则使用es预设的
- 在查询时，会检查被搜索的字段是否定义了search_analyzer，优先使用定义的，否则使用es预设的

⚠️ analyzer 和 search_analyzer 是作用到字段上面的，每个字段都可以设置不同的索引分词器和查询分词器

## 创建一个 text_index 索引
PUT /test_index

## 设置test_index索引的mapping
## 将 索引中的content字段设置为 
## 插入数据建立索引时使用 ik_max_word 模式进行分词
## 对该字段进行搜索时，使用 ik_smart模式进行搜索词的分词

## 两种分词器使用的最佳实践是：索引时用ik_max_word，在搜索时用ik_smart。
## 即：索引时最大化的将文章内容分词，搜索时更精确的搜索到想要的结果。

POST /test_index/_mapping
{
    "properties":{
        "content":{
            "type":"text",
            "analyzer":"ik_max_word",
            "search_analyzer":"ik_smart"
        }
    }
}

三、效果对比

我们会用两个索引 ik_test_index 与 default_test_index 来进行对比，通过直观的效果来看下 ik 分词器的针对中文分词的优势

注意: 以下涉及的所有的es操作都可在 kibana 中进行执行, 如需要使用curl, 可自行转换

首先，分别创建索引，并指定 ik_test_index 索引中的content字段使用ik分词器

PUT /ik_test_index
PUT /default_test_index

POST /ik_test_index/_mapping
{
    "properties":{
        "content":{
            "type":"text",
            "analyzer":"ik_max_word",
            "search_analyzer":"ik_smart"
        }
    }
}

初始化相同的数据到两个索引中

## 初始化 ik_test_index
POST /ik_test_index/_create/1
{
  "content":"美国留给伊拉克的是个烂摊子吗"
}


POST /ik_test_index/_create/2
{"content":"公安部：各地校车将享最高路权"}


POST /ik_test_index/_create/3
{
  "content":"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
}


POST /ik_test_index/_create/4
{
  "content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
}



## 初始化 default_test_index
POST /default_test_index/_create/1
{
  "content":"美国留给伊拉克的是个烂摊子吗"
}


POST /default_test_index/_create/2
{"content":"公安部：各地校车将享最高路权"}


POST /default_test_index/_create/3
{
  "content":"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
}


POST /default_test_index/_create/4
{
  "content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
}

建立针对 default_test_index 的搜索

POST /default_test_index/_search
{
    "query" : { "match" : { "content" : "中国" }},
    "highlight" : {
        "pre_tags" : ["<font color='#21ABA0'>"],
        "post_tags" : ["</font>"],
        "fields" : {
            "content" : {}
        }
    }
}

## 以下是搜索结果 ##
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.2257079,
    "hits" : [
      {
        "_index" : "default_test_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.2257079,
        "_source" : {
          "content" : "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
        },
        "highlight" : {
          "content" : [
            "<font color='#21ABA0'>中</font>韩渔警冲突调查：韩警平均每天扣1艘<font color='#21ABA0'>中</font><font color='#21ABA0'>国</font>渔船"
          ]
        }
      },
      {
        "_index" : "default_test_index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.9640833,
        "_source" : {
          "content" : "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
        },
        "highlight" : {
          "content" : [
            "<font color='#21ABA0'>中</font><font color='#21ABA0'>国</font>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
          ]
        }
      },
      {
        "_index" : "default_test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.3864615,
        "_source" : {
          "content" : "美国留给伊拉克的是个烂摊子吗"
        },
        "highlight" : {
          "content" : [
            "美<font color='#21ABA0'>国</font>留给伊拉克的是个烂摊子吗"
          ]
        }
      }
    ]
  }
}

建立针对 ik_test_index 的搜索

POST /ik_test_index/_search
{
    "query" : { "match" : { "content" : "中国" }},
    "highlight" : {
        "pre_tags" : ["<font color='#21ABA0'>"],
        "post_tags" : ["</font>"],
        "fields" : {
            "content" : {}
        }
    }
}

## 以下是搜索结果 ##
{
  "took" : 11,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.642793,
    "hits" : [
      {
        "_index" : "ik_test_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.642793,
        "_source" : {
          "content" : "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
        },
        "highlight" : {
          "content" : [
            "中韩渔警冲突调查：韩警平均每天扣1艘<font color='#21ABA0'>中国</font>渔船"
          ]
        }
      },
      {
        "_index" : "ik_test_index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.642793,
        "_source" : {
          "content" : "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
        },
        "highlight" : {
          "content" : [
            "<font color='#21ABA0'>中国</font>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
          ]
        }
      }
    ]
  }
}

综上，可以发现，设置了ik分词的 content字段，能够准确的命中 “中国”，而使用默认分词的content字段，则把 “中”、”国” 进行了独立分词，导致查询过程中是分别查询了单个词语，并且查询到的结果中最后一条数据，不符合我们搜索的预期，其内容和中国没有任何关系，显然，ik分词器能更好的将中文进行分词，达到我们预期的效果。

我们可以看下，这两种分词器对相同文字的分词结果：

## 使用 ik 分词器进行分词
POST /ik_test_index/_analyze
{
  "text": "中华人民共和国",
  "analyzer": "ik_max_word" ## 指定使用 ik分词器中的 ik_max_word 来进行分词
}

## 以下为分词结果 ##
{
  "tokens" : [
    {
      "token" : "中华人民共和国",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "中华人民",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "中华",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "华人",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "人民共和国",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "人民",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "共和国",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "共和",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "国",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 8
    }
  ]
}

我们可以看出，ik分词器可以将一些我们常用或者有意义的短语进行分词，剔除掉一些无意义的短语或者单字

我们再看下默认的分词器

POST /ik_test_index/_analyze
{
  "text": "中华人民共和国"
}

## 以下为分词结果 ##
{
  "tokens" : [
    {
      "token" : "中",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "华",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "人",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "民",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "共",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "和",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "国",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    }
  ]
}

默认的分词器，将所有文字都分成了单字，并无短语，这在我们搜索中文的时候是很难达到预期的

试想，如果你在搜索“中国”的时候，使用默认的分词器，搜索词会被拆成 “中”、“国”两个字去es里查询，得到的内容大部分都是超出期望的，又或者，你在向es中插入一条内容时，使用默认的分词器，那么分词后产生的倒排索引也是帮助不大的。

所以，综上，我们在es中进行汉字中文搜索时，推荐使用ik分词器来进行搜索词的分词以及索引数据的分词。

⚠️ 搜索词的分词与索引数据的分词最好是使用同一种分词器，这样能保证分词策略的一致性。

四、ik分词器自定义字典

ik分词器可以进行分词字典的补充扩展，虽然ik的字典目前收入的词语很多，但是一些新兴的短语或者特殊的业务短语还是需要我们人为的去扩展，ik目前是允许两种字典的扩展

ext_dict
ext_stopwords

ext_dict 是允许我们去扩展一些特定的短语，便于ik进行固定分词，比如 “中华人民共和国”，可分词成 “中华人民共和国”、“中华人民”、“中华”、“华人”、“人民过和国”、“人民”、“共和国”、“共和”、“国”，这9个词，如果我们在扩展字典中扩展一个 “中华人”, 那么 ik 分词后，就会多出来一个 “中华人” 的词语。

ext_stopwords 属于屏蔽某些词，还是上面的例子，如果我们扩展了 “华人” 这个词语到 ext_stopwords 中, 那么 ik 分词后，“华人” 这个词语就不存在了

五、ik分词器自定义字典的配置

针对 ext_dict 与 ext_stopwords 分别有两种方式进行配置

本地文件配置（新增词汇时，需要重启es使其生效）
远程文件配置（新增词汇时，无需重启es，动态生效，但是有条件）。推荐使用

配置本地 ext_dict:

ik插件目录下，config/IKAnalyzer.cfg.xml 文件用于配置信息，在 <properties> 节点下新增 <entry key="ext_dict"></entry> 数据

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!-- 用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict">/opt/ik_ext/my_ext.dic</entry>
     <!-- 用户可以在这里配置自己的扩展停止词字典 -->
    <entry key="ext_stopwords">/opt/ik_ext/my_ext_stopwords.dic</entry>
</properties>

配置远程ext_dict:

建立远程文件服务器，通过远程文件服务器获取到扩展字典文件

⚠️ 使其生效的条件如下:

该 http 请求需要返回两个头部(header)，一个是 Last-Modified，一个是 ETag，这两者都是字符串类型，只要有一个发生变化，该插件就会去抓取新的分词进而更新词库。这一条件基本是全部都符合的。
该 http 请求返回的内容格式是一行一个分词，换行符用 \n 即可

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置远程扩展字典 -->
    <entry key="remote_ext_dict">http://127.0.0.1:8888/my_ext.dic</entry>
    <!--用户可以在这里配置远程扩展停止词字典-->
    <entry key="remote_ext_stopwords">http://127.0.0.1:8888/my_ext_stopwords.dic</entry>
</properties>

至此，关于ik分词器的内容就已经结束了。

Tips：