衡量分布式统计算法的指标有3个:数据量、实时性和精准性。任何算法只能满足其中2个指标,ES为了数据的实时性,降低了聚合分析的精准性。由于ES的数据是分布在各个分片上的,coordinating节点无法获取数据的概览,ES提供了一个参数返回遗漏的term分组上的文档数,这个值越小精准度越高,为0表示结果是精准的。为了让统计数据是精准的,有两种方式:1 当数据量较少时,只设置一个分片;2 增加shard_size参数,直到term分组遗漏文档数为0,表示聚合结果是精准的。

初始化数据
DELETE my_flights
PUT my_flights
{
  "settings": {
    "number_of_shards": 20
  },
  "mappings" : {
      "properties" : {
        "AvgTicketPrice" : {
          "type" : "float"
        },
        "Cancelled" : {
          "type" : "boolean"
        },
        "Carrier" : {
          "type" : "keyword"
        },
        "Dest" : {
          "type" : "keyword"
        },
        "DestAirportID" : {
          "type" : "keyword"
        },
        "DestCityName" : {
          "type" : "keyword"
        },
        "DestCountry" : {
          "type" : "keyword"
        },
        "DestLocation" : {
          "type" : "geo_point"
        },
        "DestRegion" : {
          "type" : "keyword"
        },
        "DestWeather" : {
          "type" : "keyword"
        },
        "DistanceKilometers" : {
          "type" : "float"
        },
        "DistanceMiles" : {
          "type" : "float"
        },
        "FlightDelay" : {
          "type" : "boolean"
        },
        "FlightDelayMin" : {
          "type" : "integer"
        },
        "FlightDelayType" : {
          "type" : "keyword"
        },
        "FlightNum" : {
          "type" : "keyword"
        },
        "FlightTimeHour" : {
          "type" : "keyword"
        },
        "FlightTimeMin" : {
          "type" : "float"
        },
        "Origin" : {
          "type" : "keyword"
        },
        "OriginAirportID" : {
          "type" : "keyword"
        },
        "OriginCityName" : {
          "type" : "keyword"
        },
        "OriginCountry" : {
          "type" : "keyword"
        },
        "OriginLocation" : {
          "type" : "geo_point"
        },
        "OriginRegion" : {
          "type" : "keyword"
        },
        "OriginWeather" : {
          "tykibana_sample_data_flightspe" : "keyword"
        },
        "dayOfWeek" : {
          "type" : "integer"
        },
        "timestamp" : {
          "type" : "date"
        }
      }
    }
}

重建索引
post _reindex
{
  "source":{
    "index":"kibana_sample_data_flights"
  },
  "dest":{
    "index":"my_flights"
  }
}

查询文档总数
get kibana_sample_data_flights/_count

查询文档总数
get my_flights/_count


get kibana_sample_data_flights/_search

get kibana_sample_data_flights/_search
{
  "size":0,
  "aggs":{
    "weather":{
      "terms": {
        "field": "OriginWeather",
        "size": 5,
        "show_term_doc_count_error":true
      }
    }
  }
}

增大shard_size,直到doc_count_error_upper_bound为0 解决聚合不精准问题
get my_flights/_search
{
  "size":0,
  "aggs":{
    "weather":{
      "terms": {
        "field": "OriginWeather",
        "size": 5,
        "shard_size": 10, 
        "show_term_doc_count_error":true
      }
    }
  }
}

 

Logo

华为开发者空间,是为全球开发者打造的专属开发空间,汇聚了华为优质开发资源及工具,致力于让每一位开发者拥有一台云主机,基于华为根生态开发、创新。

更多推荐