elasticsearch使用collapse对内容去重
使用cardinality去重统计一般我们在使用elasticsearch进行去重是通过在聚合里使用cardinality对统计结果的去重,比如有个字段“one_account.one_account_no”,有两个文档的“one_account.one_account_no”值都是111,那么对“uid“”去重后结果是2。dsl语句:POST user_onoffline_log/_search
一、使用cardinality去重统计
一般我们在使用elasticsearch进行去重是通过在聚合里使用cardinality对统计结果的去重,比如有个字段“one_account.one_account_no”,有两个文档的“one_account.one_account_no”值都是111,那么对“one_account.one_account_no“”去重后结果是1。
dsl语句:
POST user_onoffline_log/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"one_account_no_aggs": {
"cardinality": {
"field": "one_account.one_account_no"
}
}
}
}
结果:
{
"took": 565,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 21,
"max_score": 0,
"hits": []
},
"aggregations": {
"one_account_no_aggs": {
"value": 14
}
}
}
可以看到使用cardinality对one_account.one_account_no字段去重后的计数值为14。
二、collapse对内容去重
上面的使用cardinality去重是作为统计来使用,如果我们想查询所有去除重复后的one_account_no有哪些而不是仅仅得到一个数字14的话,可以使用collapse对内容去重。
dsl语句:
GET customer/_search
{
"from": 0,
"size": 5,
"query": {
"bool": {
"filter": [
{
"exists": {
"field": "one_account.one_account_no",
"boost": 1
}
}
]
}
},
"collapse": {
"field": "one_account.one_account_no"
},
"_source": {
"includes": [
"one_account.one_account_no"
],"excludes": []
},
"aggregations": {
"count": {
"cardinality": {
"field": "one_account.one_account_no"
}
}
}
}
说明:exists筛选出字段one_account.one_account_no存在的,collapse对多个相同的只取其中一个展示,_source要返回展示的字段,cardinality去重统计one_account.one_account_no的数量
结果:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 6,
"successful" : 6,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 21,
"max_score" : 0.0,
"hits" : [
{
"_index" : "customer_v2025",
"_type" : "customer_info",
"_id" : "105100015130",
"_score" : 0.0,
"_source" : {
"one_account" : {
"one_account_no" : "105100015130"
}
},
"fields" : {
"one_account.one_account_no" : [
"105100015130"
]
}
},
{
"_index" : "customer_v2025",
"_type" : "customer_info",
"_id" : "99522458",
"_score" : 0.0,
"_source" : {
"one_account" : {
"one_account_no" : "99522458"
}
},
"fields" : {
"one_account.one_account_no" : [
"99522458"
]
}
},
{
"_index" : "customer_v2025",
"_type" : "customer_info",
"_id" : "105500032110",
"_score" : 0.0,
"_source" : {
"one_account" : {
"one_account_no" : "105500032110"
}
},
"fields" : {
"one_account.one_account_no" : [
"105500032110"
]
}
},
{
"_index" : "customer_v2025",
"_type" : "customer_info",
"_id" : "110600001247",
"_score" : 0.0,
"_routing" : "110600001247",
"_source" : {
"one_account" : {
"one_account_no" : "110600001248"
}
},
"fields" : {
"one_account.one_account_no" : [
"110600001248"
]
}
},
{
"_index" : "customer_v2025",
"_type" : "customer_info",
"_id" : "110600000858",
"_score" : 0.0,
"_source" : {
"one_account" : {
"one_account_no" : "110600000858"
}
},
"fields" : {
"one_account.one_account_no" : [
"110600000858"
]
}
}
]
},
"aggregations" : {
"count" : {
"value" : 14
}
}
}
可以看出总的含有one_account.one_account_no的有21个,去除重复的后有14个,
因为采用了from-size分页,hits里只返回了5条,要把14条都显示出来,可以把from改为5或10,来查看第2页和第3页的。
Java代码:
SearchRequest searchRequest = new SearchRequest("customer");
searchRequest.types("customer_info");
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
boolQueryBuilder.filter(QueryBuilders.existsQuery("one_account.one_account_no"));
searchSourceBuilder.collapse(new CollapseBuilder("one_account.one_account_no"));
searchSourceBuilder.aggregation(AggregationBuilders.cardinality("count").field("one_account.one_account_no"));
searchSourceBuilder.from(0);
searchSourceBuilder.size(5);
searchSourceBuilder.query(boolQueryBuilder);
searchRequest.source(searchSourceBuilder);
SearchResponse searchresponse = restHighLevelClient.search(searchRequest);
更多推荐
所有评论(0)