在《 Elastic中index与document基本操作》中介绍了Elastic的基本知识,及索引与文档操作;本节将介绍Elasticsearch中常用的查询与聚合操作。

search基础

Elasticsearch会对文档内容进行分词,并根据分词建立倒排序索引;可使用keyword({field}.keyword),匹配某字段的完整输入。

所有示例以如下结构为例:

{
	"number": 1802,
	"name": "Name 36zou",
	"age": 28,
	"courses": [
		{
			"name": "maths",
			"hours": 160,
			"teacher": "mike"
		},
		{
			"name": "english",
			"hours": 120,
			"teacher": "tom"
		}
	]
}

分词器

分词器(analyzer) 接受一个字符串作为输入,将这个字符串拆分成独立的词或 语汇单元(token)(可能会丢弃一些标点符号等字符),然后输出一个 语汇单元流(token stream) 。

内置分词器:

分词器说明
standard默认分词器:将词汇转换为小写,并去除停用词、标点符号(除下划线_外常见的分隔符),中文做单字切分
simple通过非字母字符来分割文本信息,然后将词汇单元统一转换为小写形式,会去除掉数字类型的字符
whitespace仅仅是去除空格、不支持中文;对分割的词汇单元不做标准化的处理,也不会将字符转换成小写
pattern正则表达式分词,默认\W+(非字符分割)
keyword不做分词,直接输入做输出(做整体查询时,使用{field}.keyword
language特定语言分词器
customer自定义分词

request请求

ES通过SearchRequest构造请求,并通过search返回SearchResponse;response中包含查询的记录信息以及满足条件的总量信息。构造SearchRequest时需要提供index名:

  • index可以有多个:此时查询所有指定索引;
  • 可通过*模糊匹配:如test*会匹配所有以test开始的index;

SearchHit即为请求的文档内容:

  • getSourceAsString:把内容转换为字符串(Json格式);
  • getSourceAsMap:把内容转换为Map,方便获取(数值默认都是long类型);
  • getHits:获取查询的结果数组;其方法.getTotalHits().value获取总的记录条数(ES中满足条件的记录总数),而其属性.length为当前返回的条数;
public void searchQuery(String index, SearchSourceBuilder sourceBuilder) {
    try (RestHighLevelClient rhlClient = ESClient.getClient()) {
        SearchRequest reqSearch = new SearchRequest(index);
        reqSearch.source(sourceBuilder);

        SearchResponse respSearch = rhlClient.search(reqSearch, RequestOptions.DEFAULT);
        SearchHits gotHits = respSearch.getHits();
        System.out.printf("get size: %d, total size: %d\n", gotHits.getHits().length, gotHits.getTotalHits().value);
        for (SearchHit hit : gotHits) {
            // System.out.println(hit.getSourceAsString());
            Map<String, Object> mapHit = hit.getSourceAsMap();
            String name = mapHit.get("name").toString();
            Integer age = Integer.valueOf(mapHit.get("age").toString());
            System.out.printf("name: %s, age: %d", name, age);
        }
    } catch (Exception ex) {
        System.out.println(index + " query fail: " + ex);
    }
}

SearchSourceBuilder

SearchSourceBuilder作为SearchRequest的source内容,决定了查询条件、数量、排序方式、获取内容等:

  • from/size:用于分页获取(从from开始获取size条);默认从0获取10条;最大10000(from+size<=10000;超过此值,只能通过scroll方式获取);
  • sort:设定排序方式;
  • query(QueryBuilder):设定查询过滤器;
  • aggregation(AggregationBuilder):设定聚合方式;
  • collapse(CollapseBuilder):折叠去重;
  • suggest(SuggestBuilder):设定提示建议(根据匹配给出输入提示);
  • postFilter(QueryBuilder):设定后置过滤器(可以不影响聚合,聚合后再过滤);
  • fetchSource:设定要查询的列;包含列(includes)必须是ES中真实存在的列(使用new String[0]可只返回系统列),排除列(excludes)可以包含不存在的列(系统列:_doc/_score/_id等无法排除);
  • timeout:设定查询超时时间;
  • terminateAfter(int n):检索结果数量达到n时,提前终止检索;
  • highlighter(HighlightBuilder):设定高亮显示;
private SearchSourceBuilder buildTermQuerySource(String field, String word) {
    SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
    BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
    boolQuery.must(QueryBuilders.termQuery(field, word));
    sourceBuilder.query(boolQuery);

    sourceBuilder.from(0);
    sourceBuilder.size(5);
    sourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS));
    sourceBuilder.sort("name.keyword");

    String[] excludeFields = new String[]{"@time", "@version"};
    sourceBuilder.fetchSource(null, excludeFields);

    return sourceBuilder;
}

查询

在这里插入图片描述

Elastic默认分词后以小写存放(field.keyword则原样存放完整内容);若要匹配分词则需要小写(若方法支持分词,会在分词时自动转为小写)。

QueryBuilders

通过QueryBuilders可方便地构造查询条件:

  • matchAllQuery:匹配所有;
  • termQuery:精确匹配,且大小写敏感;termsQuery可一次匹配多个值;
  • matchPhraseQuery:分词(且要求顺序一致),方便用于中文精确匹配;
  • queryStringQuery:可匹配多个,并支持AND/OR
  • fuzzyQuery:模糊匹配(可以设定相差多少个字符);
  • prefixQuery:前缀匹配;
  • wildcardQuery:模糊匹配(*匹配0个或多个字符,?匹配一个字符);对于斜线\\,需要做转义处理(\\\\);
  • rangeQuery:范围匹配
    • from/to:设定范围的开始与结束,.from("fieldValue1").to("fieldValue2").includeUpper(false).includeLower(false)
    • gt/gte/lt/lte:大于小于比较(相等直接用termQuery);
  • boolQuery:组合条件查询:
    • must:相当于AND;
    • mustNot:相当于NOT
    • should:相当于OR
    • filter:过滤;返回值必须满足filter子句的条件,但不会像must一样,参与计算分值;

QueryStringQuery

QueryStringQuery 通过 fields 可以指定多个字段对索引中的文档进行查询(不指定时,对所有字段进行查询)!查询字符串中的多个词语(term)在查询匹配时,默认是OR(或)的运算关系(通过 default_operator 可以可修改查询字符串默认使用的运算方式)。

QueryStringQuery 通过指定多个查询字段以及复杂的布尔运算,可以精确的获取文档数据;在查询字符串中:

  • 支持AND/OR/NOT进行布尔运算:如big AND fat(符号前后要有空格);
  • 支持+(must)-(must not):如+dog -cat(有狗,没猫的);
  • 通过:限定列:如
    • age:18,查询age为18记录;
    • name:?*,查询name不为空的记录(?匹配任一字符,*匹配零或任意数量字符);
    • age:18 AND NOT name:?*,查询age为18且name为空的记录;
private SearchSourceBuilder buildQueryStringSource(String field, String word) {
    SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
    BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
    QueryBuilders.queryStringQuery(word)
            .field(field)
            .analyzeWildcard(true)
            .defaultOperator(Operator.AND);
    boolQuery.must(QueryBuilders.termQuery(field, word));
    sourceBuilder.query(boolQuery);

    sourceBuilder.from(0);
    sourceBuilder.size(5);
    sourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS));

    return sourceBuilder;
}

SimpleQueryStringQuery是QueryStringQuery的简化版,其本身不支持 AND OR NOT 布尔运算关键字,这些关键字会被当做普通词语进行处理。

排序

ES默认都是按照_score来排序的,可通过sourceBuilder.sort(field,SortOrder.DESC)来自定义排序,可以对多个字段进行排序(最先加入的字段优先级最高)。SortBuilder有四种特殊的实现:

  • FieldSortBuilder:根据某个特殊字段排序;对于文本字段排序,需要**使用field.keyword**作为排序的字段名;
  • ScoreSortBuilder:根据score排序;
  • GeoDistanceSortBuilder:根据地理位置排序;
  • ScriptSortBuilder:根据自定义脚本排序;
private SearchSourceBuilder buildSortSource(String ...field){
    SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
    sourceBuilder.from(0);
    sourceBuilder.size(20);

    for(String f:field)
        sourceBuilder.sort(f, SortOrder.DESC);

    return sourceBuilder;
}

buildSortSource("age", "name.keyword");
// 先以age字段排序,age相同的使用根据name排序

游标Scroll

ES查询每次最多返回10000条记录,要获取其后的数据,就需要使用Scroll查询;在游标使用完成后,需要清理,避免影响后续其他查询,及释放资源。

public void scrollSearch(String index) {
    // 设定游标每次查询的超时时间
    final Scroll scroll = new Scroll(TimeValue.timeValueSeconds(60));
    SearchRequest searchRequest = new SearchRequest(index);
    searchRequest.scroll(scroll);

    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    searchSourceBuilder.size(5);

    BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
    boolQuery.must(QueryBuilders.rangeQuery("age").gt(0));
    searchSourceBuilder.query(boolQuery);

    searchRequest.source(searchSourceBuilder);
    try (RestHighLevelClient rhlClient = ESClient.getClient()) {
        SearchResponse response = rhlClient.search(searchRequest, RequestOptions.DEFAULT);
        SearchHits gotHits = response.getHits();
        while (gotHits.getHits().length>0){
            System.out.printf("ScrollId: %s, size: %d, total: %d\n", response.getScrollId(),
                    gotHits.getHits().length, gotHits.getTotalHits().value);
            for(SearchHit hit : gotHits.getHits()){
                System.out.println(hit.getSourceAsString());
            }

            // scroll query
            SearchScrollRequest scrollRequest = new SearchScrollRequest(response.getScrollId());
            scrollRequest.scroll(scroll);
            response = rhlClient.scroll(scrollRequest, RequestOptions.DEFAULT);
            gotHits = response.getHits();
            System.out.println();
        }

        // clear scroll query(must clear to avoid affect other query)
        ClearScrollRequest clearRequest = new ClearScrollRequest();
        clearRequest.addScrollId(response.getScrollId());
        ClearScrollResponse clearResponse = rhlClient.clearScroll(clearRequest, RequestOptions.DEFAULT);
        System.out.println("clear scroll: " + clearResponse.isSucceeded());
    }catch (Exception ex) {
        System.out.println(index + " query fail: " + ex);
    }
}

聚合

RestHighLevelClient中用AggregationBuilder构造组条件:

  • Buckets(桶):满足某个条件的文档集合;通过getDocCount()可获取桶中文档数量;
  • Metrics(指标):为某个同种的文档计算得到的统计信息;

一个聚合就是一些桶和指标的组合。一个聚合可以只有一个桶,或者一个指标,或者每样一个;在桶中甚至可以有多个嵌套的桶。

AggregationBuilders

AggregationBuilders用于构造聚合条件;构造参数为名称,用于标识此聚合(后续获取次聚合时需要此name);对应的列通过.field(f)设定;通过subAggregation组合子聚合:

  • count(name):统计数量;
  • avg(name):平均值;
  • max(name):最大值;
  • min(name):最小值;
  • sum(name):累加值;
  • stats(name):统计信息(均值、方差等);
  • filter(name, QueryBuilder):过滤条件;多个条件时,使用filters;
  • range(name):统计一个范围,通过addUnboundedTo/addRange/addUnboundedFrom分别设定上限、范围与下限[from,to);
  • missing(name):对应字段缺失的分组聚合;
  • terms(name):按指定字段聚合;
  • topHists(name):获取聚合(桶)里面的文档详情信息;
  • histogram(name):直方图聚合,通过interval设定间隔;
  • dateHistogram(name):时间直方图聚合查询;字段需是日期时间类型;通过.dateHistogramInterval设定聚合粒度(时分秒等),format设定日期格式(以 key_as_string 的字符串类型返回);
  • dateRange(name):对日期范围聚合,通过format设定日期格式,范围设定通过range;
  • ipRange(name, GeoPoint):对IP地址范围聚合;
  • geoDistance(name):地理距离聚合;
  • nested(name,path):嵌入式子对象(子类)聚合;

常见参数参数说明:

  • field:聚合对应的字段;对于文本,很可能需要使用field.keyword
  • size:返回桶的个数,默认10;
  • min_doc_count:最少文档过滤,文档数少于指定值的桶不会被返回;
  • order:桶排序;
  • missing:为缺失字段设定默认值;
  • ranges:配置区间数组,如[{from:0}, {from:50, to:100}, {to:200}];
  • subAggregation:添加子桶;

对于指标(count/avg/max/min/stats)只是做统计,需要在某个分组下,且不会分组(不会生成新的子分组):

private static void AggregateSumAndAvgByAge(String index) {
    try (RestHighLevelClient rhlClient = ESClient.getClient()) {
        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
        TermsAggregationBuilder termAggregation = AggregationBuilders.terms("ageTerm").field("age")
                .subAggregation(AggregationBuilders.sum("sum").field("number"))
                .subAggregation(AggregationBuilders.avg("avg").field("number"))
                .subAggregation(AggregationBuilders.topHits("details").size(2));
        sourceBuilder.aggregation(termAggregation);

        SearchRequest searchRequest = new SearchRequest(index);
        searchRequest.source(sourceBuilder);

        SearchResponse response = rhlClient.search(searchRequest, RequestOptions.DEFAULT);
        // getAggregations获取聚合后的数据
        Aggregations aggAge = response.getAggregations();
        Terms ageTerms = aggAge.get("ageTerm");
        for (Terms.Bucket bucket : ageTerms.getBuckets()) {
            System.out.println("bucket of " + bucket.getKeyAsString()+ ", count: " + bucket.getDocCount());

            Aggregations aggNumber = bucket.getAggregations();
            ParsedSum sumTerm = aggNumber.get("sum");
            ParsedAvg avgTerm = aggNumber.get("avg");
            System.out.println("\tsum: " + sumTerm.getValue() + ", avg: " + avgTerm.getValue());

            ParsedTopHits topHits = aggNumber.get("details");
            for(SearchHit detail : topHits.getHits()){
                System.out.println("\t" + detail.getSourceAsString());
            }
        }
    } catch (Exception ex) {
        System.out.println(index + " query fail: " + ex);
    }
}

nested嵌套聚合

netsted相当于文档中的子文档(类似字表);其的查询和聚合性能很好;更新性能一般。示例中的courses即为子文档,处理其内容就需要使用嵌套查询、聚合。

AggregationBuilder aggregation = AggregationBuilders.nested("course", "courses")
                .subAggregation(AggregationBuilders.terms("hour").field("courses.hours"));

在nested中设定路径,创建聚合时的field名称要携带路径;

排序

聚合排序通过order设定,但TopHits要使用order(与查询类似):

private static void AggregateByAge(String index) {
    try (RestHighLevelClient rhlClient = ESClient.getClient()) {
        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
        TermsAggregationBuilder termAggregation = AggregationBuilders
                .terms("ageTerm")
                .field("age")
                .order(BucketOrder.key(true))
                .subAggregation(
                        AggregationBuilders
                                .topHits("details")
                                .sort("name.keyword", SortOrder.ASC)
                                .size(10)
                );
        sourceBuilder.aggregation(termAggregation);

        SearchRequest searchRequest = new SearchRequest(index);
        searchRequest.source(sourceBuilder);

        SearchResponse response = rhlClient.search(searchRequest, RequestOptions.DEFAULT);
       Aggregations aggAge = response.getAggregations();
        Terms ageTerms = aggAge.get("ageTerm");
        for (Terms.Bucket bucket : ageTerms.getBuckets()) {
            System.out.println("bucket of " + bucket.getKeyAsString() + ", count: " + bucket.getDocCount());

            Aggregations aggDetail = bucket.getAggregations();
            ParsedTopHits topHits = aggDetail.get("details");
            for (SearchHit detail : topHits.getHits()) {
                System.out.println("\t" + detail.getSourceAsString());
            }
        }
    } catch (Exception ex) {
        System.out.println(index + " query fail: " + ex);
    }
}

查询聚合

查询与聚合可一起使用,只聚合满足条件的记录:

private static void FilterAndAggregate(String index) {
    try (RestHighLevelClient rhlClient = ESClient.getClient()) {
        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
        sourceBuilder.size(0); // 不需要查询内容,设定为0

        BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
        boolQuery.must(QueryBuilders.rangeQuery("number").gt(1500));
        sourceBuilder.query(boolQuery);

        TermsAggregationBuilder termAggregation = AggregationBuilders
                .terms("ageTerm")
                .field("age")
                .order(BucketOrder.key(true));
        sourceBuilder.aggregation(termAggregation);

        SearchRequest searchRequest = new SearchRequest(index);
        searchRequest.source(sourceBuilder);

        SearchResponse response = rhlClient.search(searchRequest, RequestOptions.DEFAULT);
        // getAggregations获取聚合后的数据
        Aggregations aggAge = response.getAggregations();
        Terms ageTerms = aggAge.get("ageTerm");
        for (Terms.Bucket bucket : ageTerms.getBuckets()) {
            System.out.println("bucket of " + bucket.getKeyAsString() + ", count: " + bucket.getDocCount());
        }
        
    } catch (Exception ex) {
        System.out.println(index + " query fail: " + ex);
    }
}

collapse去重

聚合去重时,默认返回统计数量;而collapse去重后,则从相同数据中选择一条返回;而且collapse可以与from/size配合进行分页处理:

  • getHits().length:返回的是当次查询结果数量(去重后);
  • getTotalHits().value:是满足条件的所有记录条数;没有去重后总体的数量
public void collapseSearch(String index, String field) {
    try (RestHighLevelClient rhlClient = ESClient.getClient()) {
        SearchRequest searchRequest = new SearchRequest(index);
        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
        sourceBuilder.collapse(new CollapseBuilder(field));
        sourceBuilder.from(0);
        sourceBuilder.size(20);
        searchRequest.source(sourceBuilder);

        SearchResponse response = rhlClient.search(searchRequest, RequestOptions.DEFAULT);

        SearchHits gotHits = response.getHits();
        System.out.printf("get size: %d, total size: %d\n", gotHits.getHits().length, gotHits.getTotalHits().value);
        for (SearchHit hit : gotHits) {
            System.out.println(hit.getSourceAsString());
        }
    } catch (Exception ex) {
        System.out.println(index + " query fail: " + ex);
    }
}
Logo

为开发者提供学习成长、分享交流、生态实践、资源工具等服务,帮助开发者快速成长。

更多推荐