ElasticSearch Aggregation(三)

桶聚合

date histogram聚合

日期直方图聚合。这date histogram多桶聚合与histogram非常类似,但是date histogram多桶聚合只能使用在日期或者日期范围上。因为在ElasticSearch内部日期是用long来表示的。这两种API最大的不同就是,date histograminternal参数可以使用日期或者时间表达式。

像直方图以上,值被四舍五入到最近的桶中。例如,如果interval的间隔时间为1天,那么2020-01-03T07:00:01Z会四舍五入为 2020-01-03T00:00:00Z。对值的四舍五入的计算公式为:

bucket_key = Math.floor(value / interval) * interval

配置日期直方图聚合时,可以通过两种方式指定时间间隔:日历感知时间间隔和固定时间间隔。

日历感知的间隔可以理解夏令时改变特定天数的长度,月份有不同的天数,并且闰秒可以附加到特定的一年。

相比之下,固定间隔总是国际单位制的倍数,并且不会根据日历上下文而改变。

日历间隔

日历感知间隔可以通过calendar_interval参数来配置。你可以使用单位名称,例如month,或者使用数量单位,例如1m来指定日历间隔。例如day1d是等价的。不支持多个数量,例如2d

日历间隔接收以下参数:

  • minute,1m
  • hour,1h
  • day,1d
  • week,1w
  • month,1m
  • quarter,1q
  • year,1y
日历间隔例子

以下例子是一个聚合请求,分桶间隔是以日历一个月为时间单位。

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      }
    }
  }
}
'

如果你尝试使用多个日历单位,那么聚合将会运行失败,因为日历间隔支持单个日历单位。

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "2d"
      }
    }
  }
}
'
{
  "error" : {
    "root_cause" : [...],
    "type" : "x_content_parse_exception",
    "reason" : "[1:82] [date_histogram] failed to parse field [calendar_interval]",
    "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "The supplied interval [2d] could not be parsed as a calendar interval.",
      "stack_trace" : "java.lang.IllegalArgumentException: The supplied interval [2d] could not be parsed as a calendar interval."
    }
  }
}
固定间隔

固定间隔可以通过fixed_interval参数来设置。

与日历感知相比,固定间隔是一个固定数量的国际制单位,并且从不偏离。允许以支持单位的倍数指定固定时间。

然而固定时间不能表达其他单位,例如月,因为月的周期是不固定的。你尝试指定日历间隔时间将会引发异常。

固定间隔接收以下参数:

  • milliseconds (ms)
  • seconds (s)
  • minutes (m)
  • hours (h)
  • days (d)
固定间隔例子
curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date",
        "fixed_interval": "30d"
      }
    }
  }
}
'
keys

在ElasticSearch内部,时间以64位长整数表示,以毫秒为单位。这些时间戳作为桶的key名返回。key_as_string是同一个时间戳被转换成特定格式的字符串。

提示:如果比指定format格式,则使用字段映射中指定的第一个日期格式。

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "1M",
        "format": "yyyy-MM-dd" 
      }
    }
  }
}
'

响应为:

{
  ...
  "aggregations": {
    "sales_over_time": {
      "buckets": [
        {
          "key_as_string": "2015-01-01",
          "key": 1420070400000,
          "doc_count": 3
        },
        {
          "key_as_string": "2015-02-01",
          "key": 1422748800000,
          "doc_count": 2
        },
        {
          "key_as_string": "2015-03-01",
          "key": 1425168000000,
          "doc_count": 2
        }
      ]
    }
  }
}
keyed response

keyed设置为true时,会将一个唯一字符串与每个桶相关联,并且以散列的形式返回,而不是以数组的形式返回。

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "1M",
        "format": "yyyy-MM-dd",
        "keyed": true
      }
    }
  }
}
'

响应

{
  ...
  "aggregations": {
    "sales_over_time": {
      "buckets": {
        "2015-01-01": {
          "key_as_string": "2015-01-01",
          "key": 1420070400000,
          "doc_count": 3
        },
        "2015-02-01": {
          "key_as_string": "2015-02-01",
          "key": 1422748800000,
          "doc_count": 2
        },
        "2015-03-01": {
          "key_as_string": "2015-03-01",
          "key": 1425168000000,
          "doc_count": 2
        }
      }
    }
  }
}
脚本

如果文档中的数据与您想要聚合的数据不完全匹配,请使用运行时字段。例如,促销销售的收入应在销售日期后一天确认:

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "runtime_mappings": {
    "date.promoted_is_tomorrow": {
      "type": "date",
      "script": "long date = doc[\u0027date\u0027].value.toInstant().toEpochMilli();\nif (doc[\u0027promoted\u0027].value) {\n  date += 86400;\n}\nemit(date);"
    }
  },
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date.promoted_is_tomorrow",
        "calendar_interval": "1M"
      }
    }
  }
}
'
缺失值

missing 参数定义了如何处理缺失值的文档。默认情况下,它们会被忽略,但也可以将它们视为具有值。

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "sale_date": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "year",
        "missing": "2000/01/01" 
      }
    }
  }
}
'
排序

默认情况下,返回的存储桶按其键升序排序,但您可以使用 order 设置控制顺序。

date range聚合

专用于日期值的范围聚合。此聚合与常规范围聚合的主要区别在于,fromto值可以用Date Math表达式表示,还可以指定一种日期格式,通过该格式返回fromto响应字段。请注意,此聚合包括每个范围的from值,不包括to值。

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "range": {
      "date_range": {
        "field": "date",
        "format": "MM-yyyy",
        "ranges": [
          { "to": "now-10M/M" },  
          { "from": "now-10M/M" } 
        ]
      }
    }
  }
}
'

以上例子会创建两个桶

  • 十个月之前的
  • 前十个月到现在的
GET my-index-000001/_search
{
  "size": 0,
  "aggs": {
    "range": {
      "date_range": {
        "field": "birthday",
        "ranges": [
          {
            "from": "2015-01",
            "to": "2015-12"
          }
        ]
      }
    }
  }
}

响应值:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "range" : {
      "buckets" : [
        {
          "key" : "2015-01-01T00:00:00.000Z-2015-12-01T00:00:00.000Z",
          "from" : 1.4200704E12,
          "from_as_string" : "2015-01-01T00:00:00.000Z",
          "to" : 1.448928E12,
          "to_as_string" : "2015-12-01T00:00:00.000Z",
          "doc_count" : 3
        }
      ]
    }
  }
}
缺失值

missing 参数定义应如何处理缺少值的文档。默认情况下,它们将被忽略,但也可以将它们视为具有值。这是通过添加一组 fieldname : value 映射来指定每个字段的默认值来完成的。

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
   "aggs": {
       "range": {
           "date_range": {
               "field": "date",
               "missing": "1976/11/30",
               "ranges": [
                  {
                    "key": "Older",
                    "to": "2016/02/01"
                  }, 
                  {
                    "key": "Newer",
                    "from": "2016/02/01",
                    "to" : "now/d"
                  }
              ]
          }
      }
   }
}
'

以上"missing": "1976/11/30"是指某些文档中缺失date字段的时候,在日期范围分桶聚合的时候默认赋值为1976/11/30

keyed response

将 keyed 标志设置为 true 会将唯一的字符串键与每个存储桶关联,并将范围作为散列而不是数组返回:

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "range": {
      "date_range": {
        "field": "date",
        "format": "MM-yyy",
        "ranges": [
          { "to": "now-10M/M" },
          { "from": "now-10M/M" }
        ],
        "keyed": true
      }
    }
  }
}
'

响应:

{
  ...
  "aggregations": {
    "range": {
      "buckets": {
        "*-10-2015": {
          "to": 1.4436576E12,
          "to_as_string": "10-2015",
          "doc_count": 7
        },
        "10-2015-*": {
          "from": 1.4436576E12,
          "from_as_string": "10-2015",
          "doc_count": 0
        }
      }
    }
  }
}

你也可以为每个范围自定义key名称

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "range": {
      "date_range": {
        "field": "date",
        "format": "MM-yyy",
        "ranges": [
          { "from": "01-2015", "to": "03-2015", "key": "quarter_01" },
          { "from": "03-2015", "to": "06-2015", "key": "quarter_02" }
        ],
        "keyed": true
      }
    }
  }
}
'

响应为:

{
  ...
  "aggregations": {
    "range": {
      "buckets": {
        "quarter_01": {
          "from": 1.4200704E12,
          "from_as_string": "01-2015",
          "to": 1.425168E12,
          "to_as_string": "03-2015",
          "doc_count": 5
        },
        "quarter_02": {
          "from": 1.425168E12,
          "from_as_string": "03-2015",
          "to": 1.4331168E12,
          "to_as_string": "06-2015",
          "doc_count": 2
        }
      }
    }
  }
}

filter聚合

过滤器聚合。就是在进行桶聚合之前,对文档进行过滤,利用过滤后的文档集合来进行单桶聚合。例如:

curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0, 
  "aggs": {
    "avg_age":{
      "avg": {
        "field": "age"
      }
    },
    "filter_agg": {
      "filter": {
        "term": {
          "email": "123456@qq.com"
        }
      },
      "aggs": {
        "avg_age": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}'

响应值为:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_age" : {
      "value" : 39.75
    },
    "filter_agg" : {
      "doc_count" : 2,
      "avg_age" : {
        "value" : 60.0
      }
    }
  }
}

使用顶级query来限制所有的聚合

在运行搜索的时候可以利用顶级查询来限制所有文档的聚合。这种方式比单独使用filter聚合跟快。例如:

curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0, 
  "query": {
    "term": {
      "email": {
        "value": "123456@qq.com"
      }
    }
  },
  "aggs": {
    "age_avg": {
      "avg": {
        "field": "age"
      }
    }
  }
}'

响应为:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_avg" : {
      "value" : 60.0
    }
  }
}

在多个过滤器上使用filters

使用filter aggregation来分组文档,这种方式要比使用多个单独的filter要快的多。例如:

curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0, 
  "aggs": {
    "myfilters": {
      "filters": {
        "filters": {
          "a": {"term": {
            "email": "123456@qq.com"
          }},
          "b":{
            "term": {
              "email": "110@qq.com"
            }
          }
        }
      },
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}'

以下写法与上边的写法是等价的。但是上边的性能要比下面的性能高得多:

curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "a": {
      "filter": {"term": {
        "email": "123456@qq.com"
      }},
      "aggs": {
        "avg_age": {
          "avg": {
            "field": "age"
          }
        }
      }
    },
    "b":{
      "filter": {"term": {
        "email": "110@qq.com"
      }},
      "aggs": {
        "avg_age": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}'

响应为:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "myfilters" : {
      "buckets" : {
        "a" : {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 60.0
          }
        },
        "b" : {
          "doc_count" : 1,
          "age_avg" : {
            "value" : 13.0
          }
        }
      }
    }
  }
}

filters聚合

多桶聚合,其中每个桶包含与查询匹配的文档。例如:

curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0, 
  "aggs": {
    "my_filters": {
      "filters": {
        "filters": {
          "li": {
            "match": {
              "name": "li"
            }
          },
          "wang":{
            "match":{
              "name":"wang"
            }
          }
        }
      },
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}'

响应:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my_filters" : {
      "buckets" : {
        "li" : {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 27.5
          }
        },
        "wang" : {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 31.5
          }
        }
      }
    }
  }
}
匿名过滤器

过滤器字段也可以作为过滤器数组提供,如以下请求所示:

GET my-index-000001/_search
{
  "size": 0,
  "aggs": {
    "my_filters": {
      "filters": {
        "filters": [
          {"match":{"name":"li"}},
          {"match":{"name":"wang"}}
        ]
      },
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}

响应:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my_filters" : {
      "buckets" : [
        {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 27.5
          }
        },
        {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 31.5
          }
        }
      ]
    }
  }
}
Other桶

设置other_bucket参数可向响应中添加一个桶,该桶包含与filters中都不匹配的所有文档。参数值可以是以下:

  • false:并计算其他桶

  • true:如果使用的是匿名过滤器,那么最后一个桶就是other桶。如果不是匿名过滤器,那么other桶的名称由other_bucket_key参数指定。

以下例子中,other桶被命名为other_messages

GET my-index-000001/_search
{
  "size": 0,
  "aggs": {
    "my_filters": {
      "filters": {
        "other_bucket_key": "other_messages",
        "filters": [
          {
            "match": {
              "name": "li"
            }
          },
          {
            "match": {
              "name": "wang"
            }
          }
        ]
      },
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}

响应:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my_filters" : {
      "buckets" : {
        "li" : {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 27.5
          }
        },
        "wang" : {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 31.5
          }
        },
        "other_messages" : {
          "doc_count" : 0,
          "age_avg" : {
            "value" : null
          }
        }
      }
    }
  }
}

以下例子使用匿名过滤器,那么最后一个桶就是other

curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "my_filters": {
      "filters": {
        "other_bucket_key": "other_messages",
        "filters": [
          {
            "match": {
              "name": "li"
            }
          },
          {
            "match": {
              "name": "wang"
            }
          }
        ]
      },
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}'

响应:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my_filters" : {
      "buckets" : [
        {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 27.5
          }
        },
        {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 31.5
          }
        },
        {
          "doc_count" : 0,
          "age_avg" : {
            "value" : null
          }
        }
      ]
    }
  }
}
Logo

华为开发者空间,是为全球开发者打造的专属开发空间,汇聚了华为优质开发资源及工具,致力于让每一位开发者拥有一台云主机,基于华为根生态开发、创新。

更多推荐