【ES实战】ES分页与去重

介绍ES分页和ES去重的实现方式。

顧棟

2657人浏览 · 2022-09-17 16:17:47

顧棟 · 2022-09-17 16:17:47 发布

Search 的分页与去重

介绍ES分页和ES去重的实现方式。

文章目录

Search 的分页与去重

简单分页

`from + size`

每次对结果集都是全面分页，数量少时可以用，10000条以内(index.max_result_window的默认值)。对很占内存。10个一条深度就是1000。控制分页的深度意在控制协调节点上的队列长度。

GET  /sales/_search
{
    "from" : 0, "size" : 2,
    "query" : {
        "term" : { "type" : "hat" }
    }
}

问题说明

一个有 5 个主分片的索引中搜索。当我们请求结果的第一页（结果从 1 到 10 ），每一个分片产生前 10 的结果，并且返回给协调节点，协调节点对 50 个结果排序得到全部结果的前 10 个。

现在假设我们请求第 990 页–结果从 990 到 1000 。所有都以相同的方式工作除了每个分片不得不产生前1000个结果以外。然后协调节点对全部 5000 个结果排序最后丢弃掉这些结果中的 4990 个结果。

可以看到，在分布式系统中，对结果排序的成本随分页的深度成指数上升。这就是 web 搜索引擎对任何查询都不要返回超过 10000 个结果的原因。
在这里插入图片描述

深度分页

`search after`

分页都是伴随着排序的，这里游标尽量使用一个具有唯一性的字段来担任。如果使用具有相同排序值的字段，可能会导致结果的丢失或重复（分页不准）。search_after 会查找与tiebreaker提供的值完全或部分匹配的第一个文档。

search_after 不是自由跳转到随机页面的解决方案，而是并行滚动多个查询的解决方案。它与 scroll API 非常相似，但不同的是，search_after 参数是无状态的，它总是针对最新版本的搜索器进行解析。因此，排序顺序可能会在遍历期间根据索引的更新和删除而改变。

在这里插入图片描述
❗️search after字段要求

排序字段要求开启doc_value。
from的值必须是0或者-1。

GET  /sales/_search
{
    "size": 2,
    "query": {
        "match" : {
            "type" : "hat"
        }
    },
    "search_after":[1422748800000],
    "sort": [
        {"date": "asc"}
    ]
}

`scroll`

scrolling用于处理大量数据。滚动上下文成本高昂，不推荐将其用于实时用户请求。在使用srcoll时，是创建一个快照，所以当有新的数据写入，无法被查询到。

POST /sales/_search?scroll=1m
{
    "size": 100,
    "query": {
        "match" : {
            "type" : "elasticsearch"
        }
    }
}

上述请求的结果包括一个 _scroll_id，它应该传递给滚动 API 以检索下一批结果。

POST /_search/scroll 
{
    "scroll" : "1m", 
    "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" 
}

当超过滚动超时时长，搜索上下文会自动删除。然而，保持分页打开是有代价的，如上一节所述，因此一旦不再使用该分页，就应该使用 clear-scroll API 明确清除该分页：

DELETE /_search/scroll
{
    "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}

详细使用可以参考Search的滚动查询（Scroll）

字段折叠（去重）

`Field Collapsing`

允许根据字段值折叠搜索结果。折叠是通过每个折叠键仅选择排序最靠前的文档来完成的。例如，下面的查询检索每个用户的最佳推文，并按喜欢的数量对其进行排序。

❗️ 折叠字段要求：

类型必须是 [keyword]或 [numeric]
[doc_values]已激活

GET /twitter/_search
{
    "query": {
        "match": {
            "message": "elasticsearch"
        }
    },
    "collapse" : {
        "field" : "user" 
    },
    "sort": ["likes"], 
    "from": 10 
}

collapse：使用“user”字段折叠结果集。
sort：按喜欢的数量对顶级文档进行排序。
from：定义第一个折叠结果的偏移量。

⚠️响应中的总命中数表示没有折叠的匹配文档的数量。各个分组的数量是未知的。

下方例子是去查询sales，按type分组，在以price倒序排序，获取每组最大的商品价格。

GET /sales/_search
{
    "collapse" : {
        "field" : "type" 
    },
    "sort": [{"price":{"order" : "desc"}}]
}

展开折叠结果 Expand collapse results

可以使用 inner_hits 选项展开每个折叠的热门点击。

GET /twitter/_search
{
    "query": {
        "match": {
            "message": "elasticsearch"
        }
    },
    "collapse" : {
        "field" : "user", 
        "inner_hits": {
            "name": "last_tweets", 
            "size": 5, 
            "sort": [{ "date": "asc" }] 
        },
        "max_concurrent_group_searches": 4 
    },
    "sort": ["likes"]
}

collapse.field：使用“user”字段折叠结果集。
collapse.inner_hits.name：用于响应中内部命中部分的名称。
collapse.inner_hits.size：每个折叠键要检索的 inner_hits 数。
collapse.inner_hits.sort：如何对每个组内的文档进行排序。
max_concurrent_group_searches：每组允许检索 inner_hits 的并发请求数。

下方例子是去查询sales，按type分组，在以price倒序排序，获取每组最大的商品价格。同时通过inner_hits将每组的的Top 5展示出来。

GET /sales/_search
{
    "collapse" : {
        "field" : "type",
        "inner_hits": {
            "name": "XXX", 
            "size": 5, 
            "sort": [{ "price": "desc" }] 
        }
    },
    "sort": [{"price":{"order" : "desc"}}]
}

Possible response:

{
    "took": 6,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 7,
        "max_score": null,
        "hits": [
            {
                "_index": "sales",
                "_type": "_doc",
                "_id": "N1wXRYMBCLf16WD0DCm6",
                "_score": null,
                "_source": {
                    "date": "2015/01/01 00:00:00",
                    "price": 200,
                    "promoted": true,
                    "rating": 1,
                    "type": "hat"
                },
                "fields": {
                    "type": [
                        "hat"
                    ]
                },
                "sort": [
                    200
                ],
                "inner_hits": {
                    "XXX": {
                        "hits": {
                            "total": 3,
                            "max_score": null,
                            "hits": [
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "N1wXRYMBCLf16WD0DCm6",
                                    "_score": null,
                                    "_source": {
                                        "date": "2015/01/01 00:00:00",
                                        "price": 200,
                                        "promoted": true,
                                        "rating": 1,
                                        "type": "hat"
                                    },
                                    "sort": [
                                        200
                                    ]
                                },
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "PFwXRYMBCLf16WD0DCm6",
                                    "_score": null,
                                    "_source": {
                                        "date": "2015/03/01 00:00:00",
                                        "price": 200,
                                        "promoted": true,
                                        "rating": 1,
                                        "type": "hat"
                                    },
                                    "sort": [
                                        200
                                    ]
                                },
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "OlwXRYMBCLf16WD0DCm6",
                                    "_score": null,
                                    "_source": {
                                        "date": "2015/02/01 00:00:00",
                                        "price": 50,
                                        "promoted": false,
                                        "rating": 1,
                                        "type": "hat"
                                    },
                                    "sort": [
                                        50
                                    ]
                                }
                            ]
                        }
                    }
                }
            },
            {
                "_index": "sales",
                "_type": "_doc",
                "_id": "OFwXRYMBCLf16WD0DCm6",
                "_score": null,
                "_source": {
                    "date": "2015/01/01 00:00:00",
                    "price": 200,
                    "promoted": true,
                    "rating": 1,
                    "type": "t-shirt"
                },
                "fields": {
                    "type": [
                        "t-shirt"
                    ]
                },
                "sort": [
                    200
                ],
                "inner_hits": {
                    "XXX": {
                        "hits": {
                            "total": 3,
                            "max_score": null,
                            "hits": [
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "OFwXRYMBCLf16WD0DCm6",
                                    "_score": null,
                                    "_source": {
                                        "date": "2015/01/01 00:00:00",
                                        "price": 200,
                                        "promoted": true,
                                        "rating": 1,
                                        "type": "t-shirt"
                                    },
                                    "sort": [
                                        200
                                    ]
                                },
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "PVwXRYMBCLf16WD0DCm6",
                                    "_score": null,
                                    "_source": {
                                        "date": "2015/03/01 00:00:00",
                                        "price": 175,
                                        "promoted": false,
                                        "rating": 2,
                                        "type": "t-shirt"
                                    },
                                    "sort": [
                                        175
                                    ]
                                },
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "O1wXRYMBCLf16WD0DCm6",
                                    "_score": null,
                                    "_source": {
                                        "date": "2015/02/01 00:00:00",
                                        "price": 10,
                                        "promoted": true,
                                        "rating": 4,
                                        "type": "t-shirt"
                                    },
                                    "sort": [
                                        10
                                    ]
                                }
                            ]
                        }
                    }
                }
            },
            {
                "_index": "sales",
                "_type": "_doc",
                "_id": "OVwXRYMBCLf16WD0DCm6",
                "_score": null,
                "_source": {
                    "date": "2015/01/01 00:00:00",
                    "price": 150,
                    "promoted": true,
                    "rating": 5,
                    "type": "bag"
                },
                "fields": {
                    "type": [
                        "bag"
                    ]
                },
                "sort": [
                    150
                ],
                "inner_hits": {
                    "XXX": {
                        "hits": {
                            "total": 1,
                            "max_score": null,
                            "hits": [
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "OVwXRYMBCLf16WD0DCm6",
                                    "_score": null,
                                    "_source": {
                                        "date": "2015/01/01 00:00:00",
                                        "price": 150,
                                        "promoted": true,
                                        "rating": 5,
                                        "type": "bag"
                                    },
                                    "sort": [
                                        150
                                    ]
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

二级折叠 Second level of collapsing 二次去重

还支持第二级折叠并应用于inner_hits。例如，以下请求查找每个国家/地区得分最高的推文，并在每个国家/地区内查找每个用户得分最高的推文。

GET /twitter/_search
{
    "query": {
        "match": {
            "message": "elasticsearch"
        }
    },
    "collapse" : {
        "field" : "country",
        "inner_hits" : {
            "name": "by_location",
            "collapse" : {"field" : "user"},
            "size": 3
        }
    }
}

collapse.field：使用“country”字段折叠结果集，一级折叠字段。
collapse.inner_hits.name：用于响应中内部命中部分的名称。
collapse.inner_hits.size：每个折叠键要检索的 inner_hits 数。
collapse.inner_hits.collapse：使用“user”字段折叠结果集，二级折叠字段。

`Top Hits Aggregation`

top_hits 指标聚合器跟踪要聚合的最相关文档。该聚合器旨在用作子聚合器，以便可以按桶分区汇总最匹配的文档。

top_hits 聚合器可以有效地用于通过存储桶聚合器按某些字段对结果集进行分组。一个或多个存储桶聚合器确定将结果集切成哪些属性。

Options

from - 获取的第一个结果的偏移量。
size - 每个桶返回的最大匹配命中数。默认情况下，返回前三个匹配的命中。
sort - 应该如何对最匹配的匹配项进行排序。默认情况下，命中按主查询的分数排序。

在下面的示例中，我们按类型对销售进行分组，并按商品类型显示最后一次销售。对于每次销售，源中仅包含日期和价格字段。

POST /sales/_search?size=0
{
    "aggs": {
        "top_tags": {
            "terms": {
                "field": "type",
                "size": 3
            },
            "aggs": {
                "top_sales_hits": {
                    "top_hits": {
                        "sort": [
                            {
                                "date": {
                                    "order": "desc"
                                }
                            }
                        ],
                        "_source": {
                            "includes": [ "date", "price" ]
                        },
                        "size" : 1
                    }
                }
            }
        }
    }
}

Possible response:

{
  ...
  "aggregations": {
    "top_tags": {
       "doc_count_error_upper_bound": 0,
       "sum_other_doc_count": 0,
       "buckets": [
          {
             "key": "hat",
             "doc_count": 3,
             "top_sales_hits": {
                "hits": {
                   "total": 3,
                   "max_score": null,
                   "hits": [
                      {
                         "_index": "sales",
                         "_type": "_doc",
                         "_id": "AVnNBmauCQpcRyxw6ChK",
                         "_source": {
                            "date": "2015/03/01 00:00:00",
                            "price": 200
                         },
                         "sort": [
                            1425168000000
                         ],
                         "_score": null
                      }
                   ]
                }
             }
          },
          {
             "key": "t-shirt",
             "doc_count": 3,
             "top_sales_hits": {
                "hits": {
                   "total": 3,
                   "max_score": null,
                   "hits": [
                      {
                         "_index": "sales",
                         "_type": "_doc",
                         "_id": "AVnNBmauCQpcRyxw6ChL",
                         "_source": {
                            "date": "2015/03/01 00:00:00",
                            "price": 175
                         },
                         "sort": [
                            1425168000000
                         ],
                         "_score": null
                      }
                   ]
                }
             }
          },
          {
             "key": "bag",
             "doc_count": 1,
             "top_sales_hits": {
                "hits": {
                   "total": 1,
                   "max_score": null,
                   "hits": [
                      {
                         "_index": "sales",
                         "_type": "_doc",
                         "_id": "AVnNBmatCQpcRyxw6ChH",
                         "_source": {
                            "date": "2015/01/01 00:00:00",
                            "price": 150
                         },
                         "sort": [
                            1420070400000
                         ],
                         "_score": null
                      }
                   ]
                }
             }
          }
       ]
    }
  }
}

字段折叠

字段折叠或结果分组是一种将结果集逻辑分组并按组返回顶部文档的功能。组的顺序由组中第一个文档的相关性决定。在 Elasticsearch 中，这可以通过将 top_hits 聚合器包装为子聚合器的存储桶聚合器来实现。

在下面的示例中，我们按商品类型对销售进行分组，分组时返回Top 3，再对每组商品取价格最大的记录，从结果集中展示。top_hit中只是取最大的那个值。

POST /sales/_search
{
    "aggs": {
        "top_tags": {
            "terms": {
                "field": "type",
                "size": 3
            },
            "aggs": {
                "top_sales_hits": {
                    "top_hits": {
                        "size": 1,
                        "_source": {
                            "includes": [
                                "date",
                                "price"
                            ]
                        }
                    }
                },
                "top_hit": {
                    "max": {
                        "field": "price"
                    }
                }
            }
        }
    }
}

Possible response:

{
	...
    "aggregations": {
        "top_tags": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": "hat",
                    "doc_count": 3,
                    "top_sales_hits": {
                        "hits": {
                            "total": 3,
                            "max_score": 1,
                            "hits": [
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "N1wXRYMBCLf16WD0DCm6",
                                    "_score": 1,
                                    "_source": {
                                        "date": "2015/01/01 00:00:00",
                                        "price": 200,
                                        "promoted": true,
                                        "rating": 1,
                                        "type": "hat"
                                    }
                                },
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "OlwXRYMBCLf16WD0DCm6",
                                    "_score": 1,
                                    "_source": {
                                        "date": "2015/02/01 00:00:00",
                                        "price": 50,
                                        "promoted": false,
                                        "rating": 1,
                                        "type": "hat"
                                    }
                                },
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "PFwXRYMBCLf16WD0DCm6",
                                    "_score": 1,
                                    "_source": {
                                        "date": "2015/03/01 00:00:00",
                                        "price": 200,
                                        "promoted": true,
                                        "rating": 1,
                                        "type": "hat"
                                    }
                                }
                            ]
                        }
                    },
                    "top_hit": {
                        "value": 200
                    }
                },
                {
                    "key": "t-shirt",
                    "doc_count": 3,
                    "top_sales_hits": {
                        "hits": {
                            "total": 3,
                            "max_score": 1,
                            "hits": [
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "O1wXRYMBCLf16WD0DCm6",
                                    "_score": 1,
                                    "_source": {
                                        "date": "2015/02/01 00:00:00",
                                        "price": 10,
                                        "promoted": true,
                                        "rating": 4,
                                        "type": "t-shirt"
                                    }
                                },
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "OFwXRYMBCLf16WD0DCm6",
                                    "_score": 1,
                                    "_source": {
                                        "date": "2015/01/01 00:00:00",
                                        "price": 200,
                                        "promoted": true,
                                        "rating": 1,
                                        "type": "t-shirt"
                                    }
                                },
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "PVwXRYMBCLf16WD0DCm6",
                                    "_score": 1,
                                    "_source": {
                                        "date": "2015/03/01 00:00:00",
                                        "price": 175,
                                        "promoted": false,
                                        "rating": 2,
                                        "type": "t-shirt"
                                    }
                                }
                            ]
                        }
                    },
                    "top_hit": {
                        "value": 200
                    }
                },
                {
                    "key": "bag",
                    "doc_count": 1,
                    "top_sales_hits": {
                        "hits": {
                            "total": 1,
                            "max_score": 1,
                            "hits": [
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "OVwXRYMBCLf16WD0DCm6",
                                    "_score": 1,
                                    "_source": {
                                        "date": "2015/01/01 00:00:00",
                                        "price": 150,
                                        "promoted": true,
                                        "rating": 5,
                                        "type": "bag"
                                    }
                                }
                            ]
                        }
                    },
                    "top_hit": {
                        "value": 150
                    }
                }
            ]
        }
    }
}

Cardinality Aggregation 去重后的计数

计算不同值的近似计数的single-value指标聚合。值可以从文档中的特定字段中提取或由脚本生成。

以下例子是，查询销售索引中商品类型一种几种，先对type进行去重，在对结果进行计数。

POST /sales/_search
{
    "aggs" : {
        "type_count" : {
            "cardinality" : {
                "field" : "type"
            }
        }
    }
}

Response:

{
    ...
    "aggregations" : {
        "type_count" : {
            "value" : 3
        }
    }
}

测试数据

新建索引

PUT sales
{
    "settings:": {
        "number_of_shards": 2,
        "number_of_replicas": 1
    },
    "mappings": {
        "_doc": {
            "properties": {
                "type": {
                    "type": "keyword"
                }
            }
        }
    }
}

写入数据

POST /sales/_doc/_bulk?refresh
{"index": {}}
{
    "date": "2015/01/01 00:00:00",
    "price": 200,
    "promoted": true,
    "rating": 1,
    "type": "hat"
}
{"index": {}}
{
    "date": "2015/01/01 00:00:00",
    "price": 200,
    "promoted": true,
    "rating": 1,
    "type": "t-shirt"
}
{"index": {}}
{
    "date": "2015/01/01 00:00:00",
    "price": 150,
    "promoted": true,
    "rating": 5,
    "type": "bag"
}
{"index": {}}
{
    "date": "2015/02/01 00:00:00",
    "price": 50,
    "promoted": false,
    "rating": 1,
    "type": "hat"
}
{"index": {}}
{
    "date": "2015/02/01 00:00:00",
    "price": 10,
    "promoted": true,
    "rating": 4,
    "type": "t-shirt"
}
{"index": {}}
{
    "date": "2015/03/01 00:00:00",
    "price": 200,
    "promoted": true,
    "rating": 1,
    "type": "hat"
}
{"index": {}}
{
    "date": "2015/03/01 00:00:00",
    "price": 175,
    "promoted": false,
    "rating": 2,
    "type": "t-shirt"
}

注意事项

collapse无法在深度分页中使用

collapse不可以和search_after一起使用
collapse不可以和scroll一起使用

{"type":"search_context_exception","reason":"cannot use `collapse` in conjunction with `search_after`"}

{"type":"search_context_exception","reason":"cannot use `collapse` in a scroll context"}

collapse与Aggregation 联用时计算的并非当页的数据而是全部数据。比如下例：

POST /sales/_search
{
    "from": 0,
    "size": 2,
    "sort": [
        "date"
    ],
    "collapse": {
        "field": "type"
    },
    "aggs": {
        "type_count": {
            "cardinality": {
                "field": "type"
            }
        }
    }
}

Response:

{
	...
    "hits": {
        "total": 7,
        "max_score": null,
        "hits": [
            {
                "_index": "sales",
                "_type": "_doc",
                "_id": "N1wXRYMBCLf16WD0DCm6",
                "_score": null,
                "_source": {
                    "date": "2015/01/01 00:00:00",
                    "price": 200,
                    "promoted": true,
                    "rating": 1,
                    "type": "hat"
                },
                "fields": {
                    "type": [
                        "hat"
                    ]
                },
                "sort": [
                    1420070400000
                ]
            },
            {
                "_index": "sales",
                "_type": "_doc",
                "_id": "OVwXRYMBCLf16WD0DCm6",
                "_score": null,
                "_source": {
                    "date": "2015/01/01 00:00:00",
                    "price": 150,
                    "promoted": true,
                    "rating": 5,
                    "type": "bag"
                },
                "fields": {
                    "type": [
                        "bag"
                    ]
                },
                "sort": [
                    1420070400000
                ]
            }
        ]
    },
    "aggregations": {
        "type_count": {
            "value": 3
        }
    }
}