Search 的分页与去重

介绍ES分页和ES去重的实现方式。

简单分页

from + size

每次对结果集都是全面分页,数量少时可以用,10000条以内(index.max_result_window的默认值)。对很占内存。10个一条 深度就是1000。控制分页的深度意在控制协调节点上的队列长度。

GET  /sales/_search
{
    "from" : 0, "size" : 2,
    "query" : {
        "term" : { "type" : "hat" }
    }
}

问题说明

一个有 5 个主分片的索引中搜索。 当我们请求结果的第一页(结果从 1 到 10 ),每一个分片产生前 10 的结果,并且返回给 协调节点 ,协调节点对 50 个结果排序得到全部结果的前 10 个。

现在假设我们请求第 990 页–结果从 990 到 1000 。所有都以相同的方式工作除了每个分片不得不产生前1000个结果以外。 然后协调节点对全部 5000 个结果排序最后丢弃掉这些结果中的 4990 个结果。

可以看到,在分布式系统中,对结果排序的成本随分页的深度成指数上升。这就是 web 搜索引擎对任何查询都不要返回超过 10000 个结果的原因。
在这里插入图片描述

深度分页

search after

使用上一页的结果来帮助实时检索下一页。

分页都是伴随着排序的,这里游标尽量使用一个具有唯一性的字段来担任。如果使用具有相同排序值的字段,可能会导致结果的丢失或重复(分页不准)。search_after 会查找与tiebreaker提供的值完全或部分匹配的第一个文档。

search_after 不是自由跳转到随机页面的解决方案,而是并行滚动多个查询的解决方案。 它与 scroll API 非常相似,但不同的是,search_after 参数是无状态的,它总是针对最新版本的搜索器进行解析。 因此,排序顺序可能会在遍历期间根据索引的更新和删除而改变。

在这里插入图片描述
❗️search after字段要求

  • 排序字段要求开启doc_value
  • from的值必须是0或者-1
GET  /sales/_search
{
    "size": 2,
    "query": {
        "match" : {
            "type" : "hat"
        }
    },
    "search_after":[1422748800000],
    "sort": [
        {"date": "asc"}
    ]
}

scroll

scrolling用于处理大量数据。滚动上下文成本高昂,不推荐将其用于实时用户请求。在使用srcoll时,是创建一个快照,所以当有新的数据写入,无法被查询到。

POST /sales/_search?scroll=1m
{
    "size": 100,
    "query": {
        "match" : {
            "type" : "elasticsearch"
        }
    }
}

上述请求的结果包括一个 _scroll_id,它应该传递给滚动 API 以检索下一批结果。

POST /_search/scroll 
{
    "scroll" : "1m", 
    "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" 
}

当超过滚动超时时长,搜索上下文会自动删除。 然而,保持分页打开是有代价的,如上一节所述,因此一旦不再使用该分页,就应该使用 clear-scroll API 明确清除该分页:

DELETE /_search/scroll
{
    "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}

详细使用可以参考Search的滚动查询(Scroll)

字段折叠(去重)

Field Collapsing

允许根据字段值折叠搜索结果。 折叠是通过每个折叠键仅选择排序最靠前的文档来完成的。 例如,下面的查询检索每个用户的最佳推文,并按喜欢的数量对其进行排序。

❗️ 折叠字段要求:

  • 类型必须是 [keyword]或 [numeric]
  • [doc_values]已激活
GET /twitter/_search
{
    "query": {
        "match": {
            "message": "elasticsearch"
        }
    },
    "collapse" : {
        "field" : "user" 
    },
    "sort": ["likes"], 
    "from": 10 
}
  1. collapse:使用“user”字段折叠结果集。
  2. sort:按喜欢的数量对顶级文档进行排序。
  3. from:定义第一个折叠结果的偏移量。

⚠️响应中的总命中数表示没有折叠的匹配文档的数量。 各个分组的数量是未知的。

下方例子是去查询sales,按type分组,在以price倒序排序,获取每组最大的商品价格。

GET /sales/_search
{
    "collapse" : {
        "field" : "type" 
    },
    "sort": [{"price":{"order" : "desc"}}]
}
展开折叠结果 Expand collapse results

可以使用 inner_hits 选项展开每个折叠的热门点击。

GET /twitter/_search
{
    "query": {
        "match": {
            "message": "elasticsearch"
        }
    },
    "collapse" : {
        "field" : "user", 
        "inner_hits": {
            "name": "last_tweets", 
            "size": 5, 
            "sort": [{ "date": "asc" }] 
        },
        "max_concurrent_group_searches": 4 
    },
    "sort": ["likes"]
}
  1. collapse.field:使用“user”字段折叠结果集。
  2. collapse.inner_hits.name:用于响应中内部命中部分的名称。
  3. collapse.inner_hits.size:每个折叠键要检索的 inner_hits 数。
  4. collapse.inner_hits.sort:如何对每个组内的文档进行排序。
  5. max_concurrent_group_searches: 每组允许检索 inner_hits 的并发请求数。

下方例子是去查询sales,按type分组,在以price倒序排序,获取每组最大的商品价格。同时通过inner_hits将每组的的Top 5展示出来。

GET /sales/_search
{
    "collapse" : {
        "field" : "type",
        "inner_hits": {
            "name": "XXX", 
            "size": 5, 
            "sort": [{ "price": "desc" }] 
        }
    },
    "sort": [{"price":{"order" : "desc"}}]
}

Possible response:

{
    "took": 6,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 7,
        "max_score": null,
        "hits": [
            {
                "_index": "sales",
                "_type": "_doc",
                "_id": "N1wXRYMBCLf16WD0DCm6",
                "_score": null,
                "_source": {
                    "date": "2015/01/01 00:00:00",
                    "price": 200,
                    "promoted": true,
                    "rating": 1,
                    "type": "hat"
                },
                "fields": {
                    "type": [
                        "hat"
                    ]
                },
                "sort": [
                    200
                ],
                "inner_hits": {
                    "XXX": {
                        "hits": {
                            "total": 3,
                            "max_score": null,
                            "hits": [
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "N1wXRYMBCLf16WD0DCm6",
                                    "_score": null,
                                    "_source": {
                                        "date": "2015/01/01 00:00:00",
                                        "price": 200,
                                        "promoted": true,
                                        "rating": 1,
                                        "type": "hat"
                                    },
                                    "sort": [
                                        200
                                    ]
                                },
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "PFwXRYMBCLf16WD0DCm6",
                                    "_score": null,
                                    "_source": {
                                        "date": "2015/03/01 00:00:00",
                                        "price": 200,
                                        "promoted": true,
                                        "rating": 1,
                                        "type": "hat"
                                    },
                                    "sort": [
                                        200
                                    ]
                                },
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "OlwXRYMBCLf16WD0DCm6",
                                    "_score": null,
                                    "_source": {
                                        "date": "2015/02/01 00:00:00",
                                        "price": 50,
                                        "promoted": false,
                                        "rating": 1,
                                        "type": "hat"
                                    },
                                    "sort": [
                                        50
                                    ]
                                }
                            ]
                        }
                    }
                }
            },
            {
                "_index": "sales",
                "_type": "_doc",
                "_id": "OFwXRYMBCLf16WD0DCm6",
                "_score": null,
                "_source": {
                    "date": "2015/01/01 00:00:00",
                    "price": 200,
                    "promoted": true,
                    "rating": 1,
                    "type": "t-shirt"
                },
                "fields": {
                    "type": [
                        "t-shirt"
                    ]
                },
                "sort": [
                    200
                ],
                "inner_hits": {
                    "XXX": {
                        "hits": {
                            "total": 3,
                            "max_score": null,
                            "hits": [
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "OFwXRYMBCLf16WD0DCm6",
                                    "_score": null,
                                    "_source": {
                                        "date": "2015/01/01 00:00:00",
                                        "price": 200,
                                        "promoted": true,
                                        "rating": 1,
                                        "type": "t-shirt"
                                    },
                                    "sort": [
                                        200
                                    ]
                                },
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "PVwXRYMBCLf16WD0DCm6",
                                    "_score": null,
                                    "_source": {
                                        "date": "2015/03/01 00:00:00",
                                        "price": 175,
                                        "promoted": false,
                                        "rating": 2,
                                        "type": "t-shirt"
                                    },
                                    "sort": [
                                        175
                                    ]
                                },
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "O1wXRYMBCLf16WD0DCm6",
                                    "_score": null,
                                    "_source": {
                                        "date": "2015/02/01 00:00:00",
                                        "price": 10,
                                        "promoted": true,
                                        "rating": 4,
                                        "type": "t-shirt"
                                    },
                                    "sort": [
                                        10
                                    ]
                                }
                            ]
                        }
                    }
                }
            },
            {
                "_index": "sales",
                "_type": "_doc",
                "_id": "OVwXRYMBCLf16WD0DCm6",
                "_score": null,
                "_source": {
                    "date": "2015/01/01 00:00:00",
                    "price": 150,
                    "promoted": true,
                    "rating": 5,
                    "type": "bag"
                },
                "fields": {
                    "type": [
                        "bag"
                    ]
                },
                "sort": [
                    150
                ],
                "inner_hits": {
                    "XXX": {
                        "hits": {
                            "total": 1,
                            "max_score": null,
                            "hits": [
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "OVwXRYMBCLf16WD0DCm6",
                                    "_score": null,
                                    "_source": {
                                        "date": "2015/01/01 00:00:00",
                                        "price": 150,
                                        "promoted": true,
                                        "rating": 5,
                                        "type": "bag"
                                    },
                                    "sort": [
                                        150
                                    ]
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}
二级折叠 Second level of collapsing 二次去重

还支持第二级折叠并应用于inner_hits。 例如,以下请求查找每个国家/地区得分最高的推文,并在每个国家/地区内查找每个用户得分最高的推文。

GET /twitter/_search
{
    "query": {
        "match": {
            "message": "elasticsearch"
        }
    },
    "collapse" : {
        "field" : "country",
        "inner_hits" : {
            "name": "by_location",
            "collapse" : {"field" : "user"},
            "size": 3
        }
    }
}
  1. collapse.field:使用“country”字段折叠结果集,一级折叠字段。
  2. collapse.inner_hits.name:用于响应中内部命中部分的名称。
  3. collapse.inner_hits.size:每个折叠键要检索的 inner_hits 数。
  4. collapse.inner_hits.collapse:使用“user”字段折叠结果集,二级折叠字段。

Top Hits Aggregation

top_hits 指标聚合器跟踪要聚合的最相关文档。 该聚合器旨在用作子聚合器,以便可以按桶分区汇总最匹配的文档。

top_hits 聚合器可以有效地用于通过存储桶聚合器按某些字段对结果集进行分组。 一个或多个存储桶聚合器确定将结果集切成哪些属性。

Options

  • from - 获取的第一个结果的偏移量。
  • size - 每个桶返回的最大匹配命中数。 默认情况下,返回前三个匹配的命中。
  • sort - 应该如何对最匹配的匹配项进行排序。 默认情况下,命中按主查询的分数排序。

在下面的示例中,我们按类型对销售进行分组,并按商品类型显示最后一次销售。 对于每次销售,源中仅包含日期和价格字段。

POST /sales/_search?size=0
{
    "aggs": {
        "top_tags": {
            "terms": {
                "field": "type",
                "size": 3
            },
            "aggs": {
                "top_sales_hits": {
                    "top_hits": {
                        "sort": [
                            {
                                "date": {
                                    "order": "desc"
                                }
                            }
                        ],
                        "_source": {
                            "includes": [ "date", "price" ]
                        },
                        "size" : 1
                    }
                }
            }
        }
    }
}

Possible response:

{
  ...
  "aggregations": {
    "top_tags": {
       "doc_count_error_upper_bound": 0,
       "sum_other_doc_count": 0,
       "buckets": [
          {
             "key": "hat",
             "doc_count": 3,
             "top_sales_hits": {
                "hits": {
                   "total": 3,
                   "max_score": null,
                   "hits": [
                      {
                         "_index": "sales",
                         "_type": "_doc",
                         "_id": "AVnNBmauCQpcRyxw6ChK",
                         "_source": {
                            "date": "2015/03/01 00:00:00",
                            "price": 200
                         },
                         "sort": [
                            1425168000000
                         ],
                         "_score": null
                      }
                   ]
                }
             }
          },
          {
             "key": "t-shirt",
             "doc_count": 3,
             "top_sales_hits": {
                "hits": {
                   "total": 3,
                   "max_score": null,
                   "hits": [
                      {
                         "_index": "sales",
                         "_type": "_doc",
                         "_id": "AVnNBmauCQpcRyxw6ChL",
                         "_source": {
                            "date": "2015/03/01 00:00:00",
                            "price": 175
                         },
                         "sort": [
                            1425168000000
                         ],
                         "_score": null
                      }
                   ]
                }
             }
          },
          {
             "key": "bag",
             "doc_count": 1,
             "top_sales_hits": {
                "hits": {
                   "total": 1,
                   "max_score": null,
                   "hits": [
                      {
                         "_index": "sales",
                         "_type": "_doc",
                         "_id": "AVnNBmatCQpcRyxw6ChH",
                         "_source": {
                            "date": "2015/01/01 00:00:00",
                            "price": 150
                         },
                         "sort": [
                            1420070400000
                         ],
                         "_score": null
                      }
                   ]
                }
             }
          }
       ]
    }
  }
}
字段折叠

字段折叠或结果分组是一种将结果集逻辑分组并按组返回顶部文档的功能。 组的顺序由组中第一个文档的相关性决定。 在 Elasticsearch 中,这可以通过将 top_hits 聚合器包装为子聚合器的存储桶聚合器来实现。

在下面的示例中,我们按商品类型对销售进行分组,分组时返回Top 3,再对每组商品取价格最大的记录,从结果集中展示。top_hit中只是取最大的那个值。

POST /sales/_search
{
    "aggs": {
        "top_tags": {
            "terms": {
                "field": "type",
                "size": 3
            },
            "aggs": {
                "top_sales_hits": {
                    "top_hits": {
                        "size": 1,
                        "_source": {
                            "includes": [
                                "date",
                                "price"
                            ]
                        }
                    }
                },
                "top_hit": {
                    "max": {
                        "field": "price"
                    }
                }
            }
        }
    }
}

Possible response:

{
	...
    "aggregations": {
        "top_tags": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": "hat",
                    "doc_count": 3,
                    "top_sales_hits": {
                        "hits": {
                            "total": 3,
                            "max_score": 1,
                            "hits": [
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "N1wXRYMBCLf16WD0DCm6",
                                    "_score": 1,
                                    "_source": {
                                        "date": "2015/01/01 00:00:00",
                                        "price": 200,
                                        "promoted": true,
                                        "rating": 1,
                                        "type": "hat"
                                    }
                                },
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "OlwXRYMBCLf16WD0DCm6",
                                    "_score": 1,
                                    "_source": {
                                        "date": "2015/02/01 00:00:00",
                                        "price": 50,
                                        "promoted": false,
                                        "rating": 1,
                                        "type": "hat"
                                    }
                                },
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "PFwXRYMBCLf16WD0DCm6",
                                    "_score": 1,
                                    "_source": {
                                        "date": "2015/03/01 00:00:00",
                                        "price": 200,
                                        "promoted": true,
                                        "rating": 1,
                                        "type": "hat"
                                    }
                                }
                            ]
                        }
                    },
                    "top_hit": {
                        "value": 200
                    }
                },
                {
                    "key": "t-shirt",
                    "doc_count": 3,
                    "top_sales_hits": {
                        "hits": {
                            "total": 3,
                            "max_score": 1,
                            "hits": [
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "O1wXRYMBCLf16WD0DCm6",
                                    "_score": 1,
                                    "_source": {
                                        "date": "2015/02/01 00:00:00",
                                        "price": 10,
                                        "promoted": true,
                                        "rating": 4,
                                        "type": "t-shirt"
                                    }
                                },
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "OFwXRYMBCLf16WD0DCm6",
                                    "_score": 1,
                                    "_source": {
                                        "date": "2015/01/01 00:00:00",
                                        "price": 200,
                                        "promoted": true,
                                        "rating": 1,
                                        "type": "t-shirt"
                                    }
                                },
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "PVwXRYMBCLf16WD0DCm6",
                                    "_score": 1,
                                    "_source": {
                                        "date": "2015/03/01 00:00:00",
                                        "price": 175,
                                        "promoted": false,
                                        "rating": 2,
                                        "type": "t-shirt"
                                    }
                                }
                            ]
                        }
                    },
                    "top_hit": {
                        "value": 200
                    }
                },
                {
                    "key": "bag",
                    "doc_count": 1,
                    "top_sales_hits": {
                        "hits": {
                            "total": 1,
                            "max_score": 1,
                            "hits": [
                                {
                                    "_index": "sales",
                                    "_type": "_doc",
                                    "_id": "OVwXRYMBCLf16WD0DCm6",
                                    "_score": 1,
                                    "_source": {
                                        "date": "2015/01/01 00:00:00",
                                        "price": 150,
                                        "promoted": true,
                                        "rating": 5,
                                        "type": "bag"
                                    }
                                }
                            ]
                        }
                    },
                    "top_hit": {
                        "value": 150
                    }
                }
            ]
        }
    }
}

Cardinality Aggregation 去重后的计数

计算不同值的近似计数的single-value指标聚合。 值可以从文档中的特定字段中提取或由脚本生成。

以下例子是,查询销售索引中商品类型一种几种,先对type进行去重,在对结果进行计数。

POST /sales/_search
{
    "aggs" : {
        "type_count" : {
            "cardinality" : {
                "field" : "type"
            }
        }
    }
}

Response:

{
    ...
    "aggregations" : {
        "type_count" : {
            "value" : 3
        }
    }
}

测试数据

新建索引

PUT sales
{
    "settings:": {
        "number_of_shards": 2,
        "number_of_replicas": 1
    },
    "mappings": {
        "_doc": {
            "properties": {
                "type": {
                    "type": "keyword"
                }
            }
        }
    }
}

写入数据

POST /sales/_doc/_bulk?refresh
{"index": {}}
{
    "date": "2015/01/01 00:00:00",
    "price": 200,
    "promoted": true,
    "rating": 1,
    "type": "hat"
}
{"index": {}}
{
    "date": "2015/01/01 00:00:00",
    "price": 200,
    "promoted": true,
    "rating": 1,
    "type": "t-shirt"
}
{"index": {}}
{
    "date": "2015/01/01 00:00:00",
    "price": 150,
    "promoted": true,
    "rating": 5,
    "type": "bag"
}
{"index": {}}
{
    "date": "2015/02/01 00:00:00",
    "price": 50,
    "promoted": false,
    "rating": 1,
    "type": "hat"
}
{"index": {}}
{
    "date": "2015/02/01 00:00:00",
    "price": 10,
    "promoted": true,
    "rating": 4,
    "type": "t-shirt"
}
{"index": {}}
{
    "date": "2015/03/01 00:00:00",
    "price": 200,
    "promoted": true,
    "rating": 1,
    "type": "hat"
}
{"index": {}}
{
    "date": "2015/03/01 00:00:00",
    "price": 175,
    "promoted": false,
    "rating": 2,
    "type": "t-shirt"
}

注意事项

  1. collapse无法在深度分页中使用

    • collapse不可以和search_after一起使用

    • collapse不可以和scroll一起使用

    {"type":"search_context_exception","reason":"cannot use `collapse` in conjunction with `search_after`"}
    
    {"type":"search_context_exception","reason":"cannot use `collapse` in a scroll context"}
    
  2. collapseAggregation 联用时计算的并非当页的数据而是全部数据。比如下例:

    POST /sales/_search
    {
        "from": 0,
        "size": 2,
        "sort": [
            "date"
        ],
        "collapse": {
            "field": "type"
        },
        "aggs": {
            "type_count": {
                "cardinality": {
                    "field": "type"
                }
            }
        }
    }
    

    Response:

    {
    	...
        "hits": {
            "total": 7,
            "max_score": null,
            "hits": [
                {
                    "_index": "sales",
                    "_type": "_doc",
                    "_id": "N1wXRYMBCLf16WD0DCm6",
                    "_score": null,
                    "_source": {
                        "date": "2015/01/01 00:00:00",
                        "price": 200,
                        "promoted": true,
                        "rating": 1,
                        "type": "hat"
                    },
                    "fields": {
                        "type": [
                            "hat"
                        ]
                    },
                    "sort": [
                        1420070400000
                    ]
                },
                {
                    "_index": "sales",
                    "_type": "_doc",
                    "_id": "OVwXRYMBCLf16WD0DCm6",
                    "_score": null,
                    "_source": {
                        "date": "2015/01/01 00:00:00",
                        "price": 150,
                        "promoted": true,
                        "rating": 5,
                        "type": "bag"
                    },
                    "fields": {
                        "type": [
                            "bag"
                        ]
                    },
                    "sort": [
                        1420070400000
                    ]
                }
            ]
        },
        "aggregations": {
            "type_count": {
                "value": 3
            }
        }
    }
    
Logo

为开发者提供学习成长、分享交流、生态实践、资源工具等服务,帮助开发者快速成长。

更多推荐