Elasticsearch 数组、对象和嵌套类型的使用和区别

总结，对象字段与嵌套字段的区别：1. mapping设置不同。ES可不设置字段type，则默认object类型，嵌套需要显示设置为"type":"nested".2. 存储方式不同。对象数组在后台是扁平化存储，嵌套对象数组是每个对象独立成文档存储。因此，对象数据有时会有"且"条件查询出"或"结果，嵌套对象的文档聚合可能会多计数(除非加reverse_nested)，想保持数组中对象的独立性，就需要

Ingenuity1992

3381人浏览 · 2021-12-19 19:46:30

Ingenuity1992 · 2021-12-19 19:46:30 发布

文章目录

本文基于 Elasticsearch 6.4.2

1、数据准备

1.1.创建索引

创建一个包含复合数据类型的索引：

PUT test_index_202112
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 1
  },
  "mappings": {
    "e-com": {
      "dynamic": "strict",
      "properties": {
      	// 商品ID
        "id": {
          "type": "keyword"
        },
         // 商品名称
        "name": {
          "type": "text",
          "analyzer": "ik_max_word"
        },
        // 品牌
        "brand": {
          "type": "keyword"
        },
        // 价格
        "price": {
          "type": "double"
        },
        // 描述
        "desc": {
          "type": "text",
          "analyzer": "ik_max_word"
        },
        // 所属类目，对象类型，注意这里没有写明type，ES则会默认为object
        "categoryObj": {
        	//"type":"object",
          "properties": {
            "class1": {
              "type": "keyword"
            },
            "class2": {
              "type": "keyword"
            },
            "class3": {
              "type": "keyword"
            }
          }
        },
        // 所属类目，嵌套类型
        "categoryNst": {
          "type": "nested",
          "properties": {
            "class1": {
              "type": "keyword"
            },
            "class2": {
              "type": "keyword"
            },
            "class3": {
              "type": "keyword"
            }
          }
        },
        // 商品评论
        "comments": {
          "type": "keyword"
        }
      }
    }
  }
}

注意，如果没有写明type，比如categoryObj，ES会默认object类型，并且就算查看mapping，也不会显示出来：

GET test_index_20211220/_mapping
{
  "test_index_20211220": {
    "mappings": {
      "e-com": {
        "dynamic": "strict",
        "properties": {
		  .....................
		  //就算查看mapping也看不到type，但确实是object
          "categoryObj": {
            "properties": {
              "class1": {
                "type": "keyword"
              },
              "class2": {
                "type": "keyword"
              },
              "class3": {
                "type": "keyword"
              }
            }
            .....................
          }
        }
      }
    }
  }
}

1.2 添加一条样例数据

PUT test_index_20211220/e-com/1
{
  "id": "1",
  "name": "L＇oreal/欧莱雅复颜玻尿酸水光充盈导入膨润精华液",
  "brand": "欧莱雅",
  "price": 279,
  "desc": "补水 提拉紧致 淡化细纹",
  "categoryObj": {
    "class1": "欧莱雅",
    "class2": "补水",
    "class3": "面部护理"
  },
  "categoryNst": {
    "class1": "欧莱雅",
    "class2": "补水",
    "class3": "面部护理"
  },
  "comments": "还没有用，赠品跟欧莱雅旗舰店的同款赠品有差异。味道也不一样"
}

2、数组类型

ES中没有专门的数组类型，默认情况下任何字段都可以包含一个或者多个值，但一个数组中的值必须是同一种类型。以上面的商品评论"comments"字段为例，查看刚才的样例数据，comments字段是这样的：

GET test_index_20211220/e-com/1
{
  "_index": "test_index_20211220",
  "_type": "e-com",
  "_id": "1",
  "_version": 1,
  "found": true,
  "_source": {
    "id": "1",
    "name": "L＇oreal/欧莱雅复颜玻尿酸水光充盈导入膨润精华液",
    "brand": "欧莱雅",
    "price": 279,
    "desc": "补水 提拉紧致 淡化细纹",
    "categoryObj": {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部护理"
    },
    "categoryNst": {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部护理"
    },
    "comments": "还没有用，赠品跟欧莱雅旗舰店的同款赠品有差异。味道也不一样"
  }
}

可以看到此时的comments还不是数组，现在我们增加一条评论，覆盖写入一次：

PUT test_index_20211220/e-com/1
{
  "id": "1",
  "name": "L＇oreal/欧莱雅复颜玻尿酸水光充盈导入膨润精华液",
  "brand": "欧莱雅",
  "price": 279,
  "desc": "补水 提拉紧致 淡化细纹",
  "categoryObj": {
    "class1": "欧莱雅",
    "class2": "补水",
    "class3": "面部护理"
  },
  "categoryNst": {
    "class1": "欧莱雅",
    "class2": "补水",
    "class3": "面部护理"
  },
  "comments": [
    "还没有用，赠品跟欧莱雅旗舰店的同款赠品有差异。味道也不一样",
    "只有这支玻璃尿酸水光充盈是真的"
  ]
}

重新查询，可以看到，"commts"在索引的时候，如果有多个值，则会自动转化成了数组，且文档版本号+1：

GET test_index_20211220/e-com/1 
{
  "_index": "test_index_20211220",
  "_type": "e-com",
  "_id": "1",
  "_version": 2,
  "found": true,
  "_source": {
    "id": "1",
    "name": "L＇oreal/欧莱雅复颜玻尿酸水光充盈导入膨润精华液",
    "brand": "欧莱雅",
    "price": 279,
    "desc": "补水 提拉紧致 淡化细纹",
    "categoryObj": {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部护理"
    },
    "categoryNst": {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部护理"
    },
    "comments": [
      "还没有用，赠品跟欧莱雅旗舰店的同款赠品有差异。味道也不一样",
      "只有这支玻璃尿酸水光充盈是真的"
    ]
  }
}

3. 对象类型

在样例数据中，“categoryObj"字段被默认设置为object类型(没有显示设置type)，对于对象类型，在查询时需要用”."号连接整个字段：

GET test_index_20211220/_search
{
  "query": {
    "term": {
      "categoryObj.class1": "欧莱雅"
    }
  }
}

{
  "took": 154,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "test_index_20211220",
        "_type": "e-com",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "id": "1",
          "name": "L＇oreal/欧莱雅复颜玻尿酸水光充盈导入膨润精华液",
          "brand": "欧莱雅",
          "price": 279,
          "desc": "补水 提拉紧致 淡化细纹",
          "categoryObj": {
            "class1": "欧莱雅",
            "class2": "补水",
            "class3": "面部护理"
          },
          "categoryNst": {
            "class1": "欧莱雅",
            "class2": "补水",
            "class3": "面部护理"
          },
          "comments": [
            "还没有用，赠品跟欧莱雅旗舰店的同款赠品有差异。味道也不一样",
            "只有这支玻璃尿酸水光充盈是真的"
          ]
        }
      }
    ]
  }
}

同样，如果一次写入多个对象，"categoryObj"也会自动转化为对象数组：

PUT test_index_20211220/e-com/1
{
  "id": "1",
  "name": "L＇oreal/欧莱雅复颜玻尿酸水光充盈导入膨润精华液",
  "brand": "欧莱雅",
  "price": 279,
  "desc": "补水 提拉紧致 淡化细纹",
  "categoryObj": [
    {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部护理"
    },
    {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部精华"
    }
  ],
  "categoryNst": {
    "class1": "欧莱雅",
    "class2": "补水",
    "class3": "面部护理"
  },
  "comments": [
    "还没有用，赠品跟欧莱雅旗舰店的同款赠品有差异。味道也不一样",
    "只有这支玻璃尿酸水光充盈是真的"
  ]
}

从响应中可以看到，"categoryObj"已经转化成了对象数组：

{
  "_index": "test_index_20211220",
  "_type": "e-com",
  "_id": "1",
  "_version": 3,
  "found": true,
  "_source": {
    "id": "1",
    "name": "L＇oreal/欧莱雅复颜玻尿酸水光充盈导入膨润精华液",
    "brand": "欧莱雅",
    "price": 279,
    "desc": "补水 提拉紧致 淡化细纹",
    "categoryObj": [
      {
        "class1": "欧莱雅",
        "class2": "补水",
        "class3": "面部护理"
      },
      {
        "class1": "欧莱雅",
        "class2": "补水",
        "class3": "面部精华"
      }
    ],
    "categoryNst": {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部护理"
    },
    "comments": [
      "还没有用，赠品跟欧莱雅旗舰店的同款赠品有差异。味道也不一样",
      "只有这支玻璃尿酸水光充盈是真的"
    ]
  }
}

4. 嵌套类型

我们继续覆写样例数据，这次将嵌套字段"categoryNst"转化为数组：

PUT test_index_20211220/e-com/1
{
  "id": "1",
  "name": "L＇oreal/欧莱雅复颜玻尿酸水光充盈导入膨润精华液",
  "brand": "欧莱雅",
  "price": 279,
  "desc": "补水 提拉紧致 淡化细纹",
  "categoryObj": [
    {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部护理"
    },
    {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部精华"
    }
  ],
  "categoryNst": [
    {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部护理"
    },
    {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部精华"
    }
  ],
  "comments": [
    "还没有用，赠品跟欧莱雅旗舰店的同款赠品有差异。味道也不一样",
    "只有这支玻璃尿酸水光充盈是真的"
  ]
}

通过文档ID查看写入结果，可以看到，字段"categoryNst"已经转化为数组，并且如果仅从返回的文档结构来看，是无法区分嵌套数组和对象数组的：

GET test_index_20211220/e-com/1
{
  "_index": "test_index_20211220",
  "_type": "e-com",
  "_id": "1",
  "_version": 4,
  "found": true,
  "_source": {
    "id": "1",
    "name": "L＇oreal/欧莱雅复颜玻尿酸水光充盈导入膨润精华液",
    "brand": "欧莱雅",
    "price": 279,
    "desc": "补水 提拉紧致 淡化细纹",
    "categoryObj": [
      {
        "class1": "欧莱雅",
        "class2": "补水",
        "class3": "面部护理"
      },
      {
        "class1": "欧莱雅",
        "class2": "补水",
        "class3": "面部精华"
      }
    ],
    "categoryNst": [
      {
        "class1": "欧莱雅",
        "class2": "补水",
        "class3": "面部护理"
      },
      {
        "class1": "欧莱雅",
        "class2": "补水",
        "class3": "面部精华"
      }
    ],
    "comments": [
      "还没有用，赠品跟欧莱雅旗舰店的同款赠品有差异。味道也不一样",
      "只有这支玻璃尿酸水光充盈是真的"
    ]
  }
}

但是通过mapping可以区分，写明了 “type”: "nested"的为嵌套字段：

GET test_index_20211220/_mapping
{
  "test_index_20211220": {
    "mappings": {
      "e-com": {
        "dynamic": "strict",
        "properties": {
          "brand": {
            "type": "keyword"
          },
          "categoryNst": {
            "type": "nested",
            "properties": {
              "class1": {
                "type": "keyword"
              },
              "class2": {
                "type": "keyword"
              },
              "class3": {
                "type": "keyword"
              }
            }
          },
          "categoryObj": {
            "properties": {
              "class1": {
                "type": "keyword"
              },
              "class2": {
                "type": "keyword"
              },
              "class3": {
                "type": "keyword"
              }
            }
          },
          "comments": {
            "type": "keyword"
          },
          "desc": {
            "type": "text",
            "analyzer": "ik_max_word"
          },
          "id": {
            "type": "keyword"
          },
          "name": {
            "type": "text",
            "analyzer": "ik_max_word"
          },
          "price": {
            "type": "double"
          }
        }
      }
    }
  }
}

4.1 嵌套字段的查询

从返回的文档结构来看，很容易凭直觉用查询对象字段的方式(也就是用"."号连接字段)来查询嵌套字段，包括Kibana的自动提示都给出了这样的形式：
在这里插入图片描述

我们先看下这样可以不可以：

GET test_index_20211220/_search
{
  "query": {
    "term": {
      "categoryNst.class2": "补水"
    }
  }
}

一看结果，居然没有命中任何文档：

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

正确的嵌套字段查询，应该使用ES的"nested"查询语句：

GET test_index_20211220/_search
{
  "query": {
    "nested": {
      "path": "categoryNst",
      "query": {
        "term": {
         // 在以前的版本中直接写 "class2": "补水"也是可以的，因为已经在外部声明了path
         // 不知道从哪个版本改了，现在必须写 "categoryNst.class2": "补水"，否则报错
          "categoryNst.class2": "补水"
        }
      }
    }
  }
}

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.18232156,
    "hits": [
      {
        "_index": "test_index_20211220",
        "_type": "e-com",
        "_id": "1",
        "_score": 0.18232156,
        "_source": {
          "id": "1",
          "name": "L＇oreal/欧莱雅复颜玻尿酸水光充盈导入膨润精华液",
          "brand": "欧莱雅",
          "price": 279,
          "desc": "补水 提拉紧致 淡化细纹",
          "categoryObj": [
            {
              "class1": "欧莱雅",
              "class2": "补水",
              "class3": "面部护理"
            },
            {
              "class1": "欧莱雅",
              "class2": "补水",
              "class3": "面部精华"
            }
          ],
          "categoryNst": [
            {
              "class1": "欧莱雅",
              "class2": "补水",
              "class3": "面部护理"
            },
            {
              "class1": "欧莱雅",
              "class2": "补水",
              "class3": "面部精华"
            }
          ],
          "comments": [
            "还没有用，赠品跟欧莱雅旗舰店的同款赠品有差异。味道也不一样",
            "只有这支玻璃尿酸水光充盈是真的"
          ]
        }
      }
    ]
  }
}

4.2 嵌套字段的特性

嵌套字段其实是把其内部成员当做了一条独立文档进行了索引。如何理解这句话呢？在上面的数据中，"categoryNst"数组已经有两个对象成员了，ES在后台其实将这两个对象成员当成了两条独立文档进行索引，所以ES一共索引了3条文档(一条外部文档，两条嵌套字段对象的文档)，这点可以从对嵌套字段的terms聚合中看出来：

GET test_index_20211220/_search
{
  "query": {
    "nested": {
      "path": "categoryNst",
      "query": {
        "term": {
          "categoryNst.class2": "补水"
        }
      }
    }
  },
  "aggs": {
    "nestedAgg":{
      "nested": {
        "path": "categoryNst"
      },
      "aggs": {
        "termAgg": {
          "terms": {
          	// 这里一样不能写成"class2"，否则虽不报错，但聚合无结果。
            "field": "categoryNst.class2"
          }
        }
      }
    }
  }
}

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.18232156,
    "hits": [
      {
        "_index": "test_index_20211220",
        "_type": "e-com",
        "_id": "1",
        "_score": 0.18232156,
        "_source": {
          "id": "1",
          "name": "L＇oreal/欧莱雅复颜玻尿酸水光充盈导入膨润精华液",
          "brand": "欧莱雅",
          "price": 279,
          "desc": "补水 提拉紧致 淡化细纹",
          "categoryObj": [
            {
              "class1": "欧莱雅",
              "class2": "补水",
              "class3": "面部护理"
            },
            {
              "class1": "欧莱雅",
              "class2": "补水",
              "class3": "面部精华"
            }
          ],
          "categoryNst": [
            {
              "class1": "欧莱雅",
              "class2": "补水",
              "class3": "面部护理"
            },
            {
              "class1": "欧莱雅",
              "class2": "补水",
              "class3": "面部精华"
            }
          ],
          "comments": [
            "还没有用，赠品跟欧莱雅旗舰店的同款赠品有差异。味道也不一样",
            "只有这支玻璃尿酸水光充盈是真的"
          ]
        }
      }
    ]
  },
  "aggregations": {
    "nestedAgg": {
      "doc_count": 2,
      "termAgg": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "补水",
            "doc_count": 2
          }
        ]
      }
    }
  }
}

可以看到，查询的结果为 “total”: 1，但是聚合的结果却是 “doc_count”: 2，这说明嵌套字段内部的对象被当做了独立文档。
而且从cerebro的集群状态监控也可以看到，我们目前只索引了一条文档，但是索引状态显示的文档数却是3条：
在这里插入图片描述有人会说，明明只有一条整体的文档，但聚合结果却是2，岂不是结果错误了？如何才能得到我们需要的结果呢？这个时候就要用到反转嵌套(reverse_nested)，改写上面查询语句的聚合部分：

GET test_index_20211220/_search
{
  "size":0,
  "query": {
    "nested": {
      "path": "categoryNst",
      "query": {
        "term": {
          "categoryNst.class2": "补水"
        }
      }
    }
  },
  "aggs": {
    "nestedAgg":{
      "nested": {
        "path": "categoryNst"
      },
      "aggs": {
        "termAgg": {
          "terms": {
            "field": "categoryNst.class2"
          },
          "aggs": {
            "reverseAgg": {
              "reverse_nested": {}
            }
          }
        }
      }
    }
  }
}

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "nestedAgg": {
      "doc_count": 2,
      "termAgg": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "补水",
            "doc_count": 2,
            "reverseAgg": {
              "doc_count": 1
            }
          }
        ]
      }
    }
  }
}

可以看到，此时得到正确结果，命中"补水"关键词的文档"doc_count": 1。

5. 对象字段与嵌套字段的区别

根据第四节，我们知道了嵌套字段中的对象被ES存储为了独立的文档，那对象字段呢？ES在后台将对象字段进行打平处理，后台其实存储的是扁平结构，以categoryObj字段为例：

"categoryObj": [
    {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部护理"
    },
    {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部精华"
    }
]

后台存储的其实是：

{
	"categoryObj.class1": ["欧莱雅","欧莱雅"],
	"categoryObj.class2": ["补水","补水"],
	"categoryObj.class3": ["面部护理","面部精华"]
}

这就牺牲了对象之间的独立性，有时候会带来一些影响，具体就是某些情况下，对对象数组的"且"查询可能会变成"或"查询。
比如我们再给categoryObj字段加一个"雅诗兰黛"对象：

PUT test_index_20211220/e-com/1
{
  "id": "1",
  "name": "L＇oreal/欧莱雅复颜玻尿酸水光充盈导入膨润精华液",
  "brand": "欧莱雅",
  "price": 279,
  "desc": "补水 提拉紧致 淡化细纹",
  "categoryObj": [
    {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部护理"
    },
    {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部精华"
    },
    {
      "class1": "雅诗兰黛",
      "class2": "美白",
      "class3": "面霜"
    }
  ],
  "categoryNst": [
    {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部护理"
    },
    {
      "class1": "欧莱雅",
      "class2": "补水",
      "class3": "面部精华"
    }
  ],
  "comments": [
    "还没有用，赠品跟欧莱雅旗舰店的同款赠品有差异。味道也不一样",
    "只有这支玻璃尿酸水光充盈是真的"
  ]
}

这个时候我们去同时查询"欧莱雅"和"美白"这两个关键词，正常来说是不应该差出来任何文档的，因为categoryObj中没有任何一个对象同时具备"欧莱雅"和"美白"这两个关键词，可事实确不是这样：

GET test_index_20211220/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "categoryObj.class1": "欧莱雅"
          }
        },
        {
          "term": {
            "categoryObj.class2": "美白"
          }
        }
      ]
    }
  }
}

复杂且大数据量的查询可以放在filter上下文，避免打分，上面的query也可以写成这样：

GET test_index_20211220/_search
{
  "query": {
    "bool": {
      // filter上下文
      "filter": {
        "bool": {
          "must": [
            {
              "term": {
                "categoryObj.class1": "欧莱雅"
              }
            },
            {
              "term": {
                "categoryObj.class2": "美白"
              }
            }
          ]
        }
      }
    }
  }
}

结果居然将文档查询出来了！

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0,
    "hits": [
      {
        "_index": "test_index_20211220",
        "_type": "e-com",
        "_id": "1",
        "_score": 0,
        "_source": {
          "id": "1",
          "name": "L＇oreal/欧莱雅复颜玻尿酸水光充盈导入膨润精华液",
          "brand": "欧莱雅",
          "price": 279,
          "desc": "补水 提拉紧致 淡化细纹",
          "categoryObj": [
            {
              "class1": "欧莱雅",
              "class2": "补水",
              "class3": "面部护理"
            },
            {
              "class1": "欧莱雅",
              "class2": "补水",
              "class3": "面部精华"
            },
            {
              "class1": "雅诗兰黛",
              "class2": "美白",
              "class3": "面霜"
            }
          ],
          "categoryNst": [
            {
              "class1": "欧莱雅",
              "class2": "补水",
              "class3": "面部护理"
            },
            {
              "class1": "欧莱雅",
              "class2": "补水",
              "class3": "面部精华"
            }
          ],
          "comments": [
            "还没有用，赠品跟欧莱雅旗舰店的同款赠品有差异。味道也不一样",
            "只有这支玻璃尿酸水光充盈是真的"
          ]
        }
      }
    ]
  }
}

总结，对象字段与嵌套字段的区别：

mapping设置不同。ES可不设置字段type，则默认object类型，嵌套需要显示设置为"type":“nested”.
存储方式不同。对象数组在后台是扁平化存储，嵌套对象数组是每个对象独立成文档存储。因此，对象数据有时会有"且"条件查询出"或"结果，嵌套对象的文档聚合可能会多计数(除非加reverse_nested)，想保持数组中对象的独立性，就需要使用嵌套字段类型。
查询方式不同。对象类型直接通过"."号连接各层级字段进行查询。嵌套字段需要使用nested子查询。
聚合方式不同。嵌套字段需要使用nested聚合，对象类型只是名字要用"."号连接。