ElasticSearch 分组统计,每组取最新数据

数据统计需求

ElasticSearch按任务id分组统计
查询方法:
任务ID一个,网站ID若干
求:
按网站ID分组,crawTotal最大,且时间为最新的一条数据。

实现方法


# query中限制结果数据的查询条件,提供任务ID(taskId)和网站ID(siteId)
# aggs 根据网站ID进行聚合,关键函数【top_hits】
# top_hits.size 控制分组内部每个分组数据的数量
# top_hits.sort 控制分组内数据排序规则,可使用多个排序属性


GET stat_craw_page/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "taskId": {
              "value": 227796352
            }
          }
        },
        {
          "terms": {
            "siteId": [
              "46871",
              "2810"
            ]
          }
        }
      ]
    }
  },
  "size": 0, 
  "track_total_hits": true,
  "aggs": {
    "group_by_siteid": {
      "aggs": {
        "latestRecord": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "crawTotal": {
                  "order": "desc"
                }
              },
              {
                "statTime": {
                  "order": "desc"
                }
              },
              {
                "hour": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      },
      "terms": {
        "field": "siteId",
        "size": 10000
      }
    }
  }
}

输出结果

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 15,
    "successful" : 15,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 72,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_siteid" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 2810,
          "doc_count" : 48,
          "latestRecord" : {
            "hits" : {
              "total" : 48,
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "sz_stat_craw_page",
                  "_type" : "json",
                  "_id" : "227796352-2810-2425",
                  "_score" : null,
                  "_source" : {
                    "insertTime" : 1655091229871,
                    "crawTotal" : 1056,
                    "crawAdd" : 44,
                    "crawNew" : 22,
                    "hour" : 24,
                    "analysisTotal" : 1056,
                    "siteId" : 2810,
                    "analysisAdd" : 44,
                    "statTime" : "2022-06-13",
                    "taskId" : 227796352
                  },
                  "sort" : [
                    1056,
                    1655078400000,
                    24
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : 46871,
          "doc_count" : 24,
          "latestRecord" : {
            "hits" : {
              "total" : 24,
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "sz_stat_craw_page",
                  "_type" : "json",
                  "_id" : "227796352-46871-24",
                  "_score" : null,
                  "_source" : {
                    "insertTime" : 1654848313146,
                    "crawTotal" : 768,
                    "crawAdd" : 32,
                    "crawNew" : 16,
                    "hour" : 24,
                    "analysisTotal" : 552,
                    "siteId" : 46871,
                    "analysisAdd" : 23,
                    "statTime" : 1654790400000,
                    "taskId" : 227796352
                  },
                  "sort" : [
                    768,
                    1654790400000,
                    24
                  ]
                }
              ]
            }
          }
        }
      ]
    }
  }
}

Logo

华为开发者空间,是为全球开发者打造的专属开发空间,汇聚了华为优质开发资源及工具,致力于让每一位开发者拥有一台云主机,基于华为根生态开发、创新。

更多推荐