ElasticSearch Aggregation(三)

文章目录ElasticSearch Aggregation(三)桶聚合date histogram聚合日历间隔日历间隔例子固定间隔固定间隔例子keyskeyed response脚本缺失值排序date range聚合缺失值keyed responsefilter聚合使用顶级`query`来限制所有的聚合在多个过滤器上使用`filters`filters聚合匿名过滤器Other桶ElasticSea

666呀

1167人浏览 · 2021-07-15 11:16:54

666呀 · 2021-07-15 11:16:54 发布

ElasticSearch Aggregation(三)

桶聚合

date histogram聚合

日期直方图聚合。这date histogram多桶聚合与histogram非常类似，但是date histogram多桶聚合只能使用在日期或者日期范围上。因为在ElasticSearch内部日期是用long来表示的。这两种API最大的不同就是，date histogram的internal参数可以使用日期或者时间表达式。

像直方图以上，值被四舍五入到最近的桶中。例如，如果interval的间隔时间为1天，那么2020-01-03T07:00:01Z会四舍五入为 2020-01-03T00:00:00Z。对值的四舍五入的计算公式为：

bucket_key = Math.floor(value / interval) * interval

配置日期直方图聚合时，可以通过两种方式指定时间间隔：日历感知时间间隔和固定时间间隔。

日历感知的间隔可以理解夏令时改变特定天数的长度，月份有不同的天数，并且闰秒可以附加到特定的一年。

相比之下，固定间隔总是国际单位制的倍数，并且不会根据日历上下文而改变。

日历间隔

日历感知间隔可以通过calendar_interval参数来配置。你可以使用单位名称，例如month，或者使用数量单位，例如1m来指定日历间隔。例如day和1d是等价的。不支持多个数量，例如2d。

日历间隔接收以下参数：

minute,1m
hour,1h
day,1d
week,1w
month,1m
quarter,1q
year,1y

日历间隔例子

以下例子是一个聚合请求，分桶间隔是以日历一个月为时间单位。

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      }
    }
  }
}
'

如果你尝试使用多个日历单位，那么聚合将会运行失败，因为日历间隔支持单个日历单位。

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "2d"
      }
    }
  }
}
'

{
  "error" : {
    "root_cause" : [...],
    "type" : "x_content_parse_exception",
    "reason" : "[1:82] [date_histogram] failed to parse field [calendar_interval]",
    "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "The supplied interval [2d] could not be parsed as a calendar interval.",
      "stack_trace" : "java.lang.IllegalArgumentException: The supplied interval [2d] could not be parsed as a calendar interval."
    }
  }
}

固定间隔

固定间隔可以通过fixed_interval参数来设置。

与日历感知相比，固定间隔是一个固定数量的国际制单位，并且从不偏离。允许以支持单位的倍数指定固定时间。

然而固定时间不能表达其他单位，例如月，因为月的周期是不固定的。你尝试指定日历间隔时间将会引发异常。

固定间隔接收以下参数：

milliseconds (ms)
seconds (s)
minutes (m)
hours (h)
days (d)

固定间隔例子

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date",
        "fixed_interval": "30d"
      }
    }
  }
}
'

keys

在ElasticSearch内部，时间以64位长整数表示，以毫秒为单位。这些时间戳作为桶的key名返回。key_as_string是同一个时间戳被转换成特定格式的字符串。

提示：如果比指定format格式，则使用字段映射中指定的第一个日期格式。

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "1M",
        "format": "yyyy-MM-dd" 
      }
    }
  }
}
'

响应为：

{
  ...
  "aggregations": {
    "sales_over_time": {
      "buckets": [
        {
          "key_as_string": "2015-01-01",
          "key": 1420070400000,
          "doc_count": 3
        },
        {
          "key_as_string": "2015-02-01",
          "key": 1422748800000,
          "doc_count": 2
        },
        {
          "key_as_string": "2015-03-01",
          "key": 1425168000000,
          "doc_count": 2
        }
      ]
    }
  }
}

keyed response

keyed设置为true时，会将一个唯一字符串与每个桶相关联，并且以散列的形式返回，而不是以数组的形式返回。

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "1M",
        "format": "yyyy-MM-dd",
        "keyed": true
      }
    }
  }
}
'

响应

{
  ...
  "aggregations": {
    "sales_over_time": {
      "buckets": {
        "2015-01-01": {
          "key_as_string": "2015-01-01",
          "key": 1420070400000,
          "doc_count": 3
        },
        "2015-02-01": {
          "key_as_string": "2015-02-01",
          "key": 1422748800000,
          "doc_count": 2
        },
        "2015-03-01": {
          "key_as_string": "2015-03-01",
          "key": 1425168000000,
          "doc_count": 2
        }
      }
    }
  }
}

脚本

如果文档中的数据与您想要聚合的数据不完全匹配，请使用运行时字段。例如，促销销售的收入应在销售日期后一天确认：

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "runtime_mappings": {
    "date.promoted_is_tomorrow": {
      "type": "date",
      "script": "long date = doc[\u0027date\u0027].value.toInstant().toEpochMilli();\nif (doc[\u0027promoted\u0027].value) {\n  date += 86400;\n}\nemit(date);"
    }
  },
  "aggs": {
    "sales_over_time": {
      "date_histogram": {
        "field": "date.promoted_is_tomorrow",
        "calendar_interval": "1M"
      }
    }
  }
}
'

缺失值

missing 参数定义了如何处理缺失值的文档。默认情况下，它们会被忽略，但也可以将它们视为具有值。

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "sale_date": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "year",
        "missing": "2000/01/01" 
      }
    }
  }
}
'

排序

默认情况下，返回的存储桶按其键升序排序，但您可以使用 order 设置控制顺序。

date range聚合

专用于日期值的范围聚合。此聚合与常规范围聚合的主要区别在于，from和to值可以用Date Math表达式表示，还可以指定一种日期格式，通过该格式返回from和to响应字段。请注意，此聚合包括每个范围的from值，不包括to值。

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "range": {
      "date_range": {
        "field": "date",
        "format": "MM-yyyy",
        "ranges": [
          { "to": "now-10M/M" },  
          { "from": "now-10M/M" } 
        ]
      }
    }
  }
}
'

以上例子会创建两个桶

十个月之前的
前十个月到现在的

GET my-index-000001/_search
{
  "size": 0,
  "aggs": {
    "range": {
      "date_range": {
        "field": "birthday",
        "ranges": [
          {
            "from": "2015-01",
            "to": "2015-12"
          }
        ]
      }
    }
  }
}

响应值：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "range" : {
      "buckets" : [
        {
          "key" : "2015-01-01T00:00:00.000Z-2015-12-01T00:00:00.000Z",
          "from" : 1.4200704E12,
          "from_as_string" : "2015-01-01T00:00:00.000Z",
          "to" : 1.448928E12,
          "to_as_string" : "2015-12-01T00:00:00.000Z",
          "doc_count" : 3
        }
      ]
    }
  }
}

缺失值

missing 参数定义应如何处理缺少值的文档。默认情况下，它们将被忽略，但也可以将它们视为具有值。这是通过添加一组 fieldname : value 映射来指定每个字段的默认值来完成的。

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
   "aggs": {
       "range": {
           "date_range": {
               "field": "date",
               "missing": "1976/11/30",
               "ranges": [
                  {
                    "key": "Older",
                    "to": "2016/02/01"
                  }, 
                  {
                    "key": "Newer",
                    "from": "2016/02/01",
                    "to" : "now/d"
                  }
              ]
          }
      }
   }
}
'

以上"missing": "1976/11/30"是指某些文档中缺失date字段的时候，在日期范围分桶聚合的时候默认赋值为1976/11/30。

keyed response

将 keyed 标志设置为 true 会将唯一的字符串键与每个存储桶关联，并将范围作为散列而不是数组返回：

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "range": {
      "date_range": {
        "field": "date",
        "format": "MM-yyy",
        "ranges": [
          { "to": "now-10M/M" },
          { "from": "now-10M/M" }
        ],
        "keyed": true
      }
    }
  }
}
'

响应：

{
  ...
  "aggregations": {
    "range": {
      "buckets": {
        "*-10-2015": {
          "to": 1.4436576E12,
          "to_as_string": "10-2015",
          "doc_count": 7
        },
        "10-2015-*": {
          "from": 1.4436576E12,
          "from_as_string": "10-2015",
          "doc_count": 0
        }
      }
    }
  }
}

你也可以为每个范围自定义key名称

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "range": {
      "date_range": {
        "field": "date",
        "format": "MM-yyy",
        "ranges": [
          { "from": "01-2015", "to": "03-2015", "key": "quarter_01" },
          { "from": "03-2015", "to": "06-2015", "key": "quarter_02" }
        ],
        "keyed": true
      }
    }
  }
}
'

响应为：

{
  ...
  "aggregations": {
    "range": {
      "buckets": {
        "quarter_01": {
          "from": 1.4200704E12,
          "from_as_string": "01-2015",
          "to": 1.425168E12,
          "to_as_string": "03-2015",
          "doc_count": 5
        },
        "quarter_02": {
          "from": 1.425168E12,
          "from_as_string": "03-2015",
          "to": 1.4331168E12,
          "to_as_string": "06-2015",
          "doc_count": 2
        }
      }
    }
  }
}

filter聚合

过滤器聚合。就是在进行桶聚合之前，对文档进行过滤，利用过滤后的文档集合来进行单桶聚合。例如：

curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0, 
  "aggs": {
    "avg_age":{
      "avg": {
        "field": "age"
      }
    },
    "filter_agg": {
      "filter": {
        "term": {
          "email": "123456@qq.com"
        }
      },
      "aggs": {
        "avg_age": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}'

响应值为：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_age" : {
      "value" : 39.75
    },
    "filter_agg" : {
      "doc_count" : 2,
      "avg_age" : {
        "value" : 60.0
      }
    }
  }
}

使用顶级`query`来限制所有的聚合

在运行搜索的时候可以利用顶级查询来限制所有文档的聚合。这种方式比单独使用filter聚合跟快。例如：

curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0, 
  "query": {
    "term": {
      "email": {
        "value": "123456@qq.com"
      }
    }
  },
  "aggs": {
    "age_avg": {
      "avg": {
        "field": "age"
      }
    }
  }
}'

响应为：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_avg" : {
      "value" : 60.0
    }
  }
}

在多个过滤器上使用`filters`

使用filter aggregation来分组文档，这种方式要比使用多个单独的filter要快的多。例如：

curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0, 
  "aggs": {
    "myfilters": {
      "filters": {
        "filters": {
          "a": {"term": {
            "email": "123456@qq.com"
          }},
          "b":{
            "term": {
              "email": "110@qq.com"
            }
          }
        }
      },
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}'

以下写法与上边的写法是等价的。但是上边的性能要比下面的性能高得多：

curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "a": {
      "filter": {"term": {
        "email": "123456@qq.com"
      }},
      "aggs": {
        "avg_age": {
          "avg": {
            "field": "age"
          }
        }
      }
    },
    "b":{
      "filter": {"term": {
        "email": "110@qq.com"
      }},
      "aggs": {
        "avg_age": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}'

响应为：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "myfilters" : {
      "buckets" : {
        "a" : {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 60.0
          }
        },
        "b" : {
          "doc_count" : 1,
          "age_avg" : {
            "value" : 13.0
          }
        }
      }
    }
  }
}

filters聚合

多桶聚合，其中每个桶包含与查询匹配的文档。例如：

curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0, 
  "aggs": {
    "my_filters": {
      "filters": {
        "filters": {
          "li": {
            "match": {
              "name": "li"
            }
          },
          "wang":{
            "match":{
              "name":"wang"
            }
          }
        }
      },
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}'

响应：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my_filters" : {
      "buckets" : {
        "li" : {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 27.5
          }
        },
        "wang" : {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 31.5
          }
        }
      }
    }
  }
}

匿名过滤器

过滤器字段也可以作为过滤器数组提供，如以下请求所示：

GET my-index-000001/_search
{
  "size": 0,
  "aggs": {
    "my_filters": {
      "filters": {
        "filters": [
          {"match":{"name":"li"}},
          {"match":{"name":"wang"}}
        ]
      },
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}

响应：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my_filters" : {
      "buckets" : [
        {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 27.5
          }
        },
        {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 31.5
          }
        }
      ]
    }
  }
}

Other桶

设置other_bucket参数可向响应中添加一个桶，该桶包含与filters中都不匹配的所有文档。参数值可以是以下：

false：并计算其他桶
true：如果使用的是匿名过滤器，那么最后一个桶就是other桶。如果不是匿名过滤器，那么other桶的名称由other_bucket_key参数指定。

以下例子中，other桶被命名为other_messages。

GET my-index-000001/_search
{
  "size": 0,
  "aggs": {
    "my_filters": {
      "filters": {
        "other_bucket_key": "other_messages",
        "filters": [
          {
            "match": {
              "name": "li"
            }
          },
          {
            "match": {
              "name": "wang"
            }
          }
        ]
      },
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}

响应：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my_filters" : {
      "buckets" : {
        "li" : {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 27.5
          }
        },
        "wang" : {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 31.5
          }
        },
        "other_messages" : {
          "doc_count" : 0,
          "age_avg" : {
            "value" : null
          }
        }
      }
    }
  }
}

以下例子使用匿名过滤器，那么最后一个桶就是other桶

curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "my_filters": {
      "filters": {
        "other_bucket_key": "other_messages",
        "filters": [
          {
            "match": {
              "name": "li"
            }
          },
          {
            "match": {
              "name": "wang"
            }
          }
        ]
      },
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}'

响应：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my_filters" : {
      "buckets" : [
        {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 27.5
          }
        },
        {
          "doc_count" : 2,
          "age_avg" : {
            "value" : 31.5
          }
        },
        {
          "doc_count" : 0,
          "age_avg" : {
            "value" : null
          }
        }
      ]
    }
  }
}