es自定义分词器对数字分词

es自定义分词器处理数字类型

白衣神棍

2870人浏览 · 2022-08-03 10:18:08

白衣神棍 · 2022-08-03 10:18:08 发布

背景：就是一个搜索框，可以输入产品名称、产品code、产品拼音、产品缩写等内容来查询。

问题：就是像产品code这种，比如00034，分词完还是00034，直接查询00是查不到的。

一开始的方案一，就是multi_match产品名称、产品缩写这些，然后前缀查询产品code，然后再should一下。但是前缀查询默认是只查50条，可以通过修改max_expansions增加。

但是不建议，因为前缀查询是全索引扫描，如果查询的结果集要求太大会有性能问题。

所以优化的思路还是在分词这块，让产品code能够按照预想的分词，就需要自定义分词器。

{
    "settings": {
        "index": {
            "number_of_shards": "1",
            "number_of_replicas": "0"
        },
        "index.max_ngram_diff": 6,
        "analysis": {
            "analyzer": {
                "ngram_analyzer": {
                    "tokenizer": "ngram_tokenizer"
                }
            },
            "tokenizer": {
                "ngram_tokenizer": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 6,
                    "token_chars": [
                        "letter",
                        "digit"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "dataProductId": {
                    "type": "keyword"
                },
                "endedDate": {
                    "type": "date"
                },
                "endedFlag": {
                    "type": "keyword"
                },
                "firstSpellLetter": {
                    "type": "text",
                    "analyzer": "ik_smart"
                },
                "foundDate": {
                    "type": "date"
                },
                "fullName": {
                    "type": "text",
                    "analyzer": "ik_smart"
                },
                "lastModifyTime": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "liteName": {
                    "type": "text",
                    "analyzer": "ik_smart"
                },
                "offlineDate": {
                    "type": "date"
                },
                "onlineDate": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "onlineStatus": {
                    "type": "keyword"
                },
                "productCode": {
                    "type": "text",
                    "analyzer": "ngram_analyzer"
                },
                "productId": {
                    "type": "keyword"
                },
                "productType": {
                    "type": "keyword"
                },
                "rowHash": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "salesSystemCreateTime": {
                    "type": "date"
                },
                "salesSystemFlag": {
                    "type": "keyword"
                },
                "spell": {
                    "type": "text",
                    "analyzer": "ik_smart"
                }
        }
    }
}

这里我用的是edge_ngram，00034分词结果如下

0 00 000 0003 00034

如果用 ngram，max_ngrm=3,分词结果就是

0 00 000 0 03 003 034 3 34

因为我的需求是左匹配，所以优化一下，用了 edge_ngram

附录：

查询具体某条记录某个字段的分词结果

GET product_info_search/_doc/1011241001/_termvectors?fields=productCodeGET 索引名/type/id/_termvectors?fields=字段名GET product_info_search/_doc/1011241001/_termvectors?fields=productCode

查询对应分词器的分词结果

指定索引的

http://10.105.100.70:9200/product_info_search/_analyze

全局的

http://10.105.100.70:9200/_analyze

欢迎关注本人公众号“IT小白”，一起交流，一起成长

华为开发者空间

华为开发者空间，是为全球开发者打造的专属开发空间，汇聚了华为优质开发资源及工具，致力于让每一位开发者拥有一台云主机，基于华为根生态开发、创新。

更多推荐

cover

GaussDB Ustore存储引擎解读

华为开发者空间

如何在鲲鹏平台上快速上手应用开发？鲲鹏DevKit给你答案

鲲鹏DevKit针对不同的业务场景，提供了应用迁移和系统迁移两套解决方案，帮忙开发者快速从X86平台迁移至鲲鹏平台，通过详细的迁移建议降低迁移门槛，可视化展示迁移进度，打消鲲鹏平台开发的顾虑。

华为开发者空间

cover

华为云开源项目Sermant正式成为CNCF官方项目

华为开发者空间

所有评论(0)

查看更多评论

白衣神棍

已为社区贡献1条内容