一、安装 ingest attachment 插件

安装方法:https://blog.csdn.net/catoop/article/details/124468788

二、定义文本抽取管道

1.单附件(示例)

PUT _ingest/pipeline/attachment
{
    "description": "Extract attachment information",
    "processors": [
        {
            "attachment": {
                "field": "data",
                "ignore_missing": true
            }
        },
        {
            "remove": {
                "field": "data",
            }
        }
    ]
}

其中remove段的配置表示附件经过管道处理后删除附件本身,只将附件的文字存入ES中,附件自身base64的数据抛弃掉。

2.多附件(示例)

PUT _ingest/pipeline/attachment
{
    "description": "Extract attachment information",
    "processors": [
        {
            "foreach": {
                "field": "attachments",
                "processor": {
                    "attachment": {
                        "field": "_ingest._value.data",
                        "target_field": "_ingest._value.attachment"
                    }
                }
            }
        },
        {
            "foreach": {
                "field": "attachments",
                "processor": {
                    "remove": {
                        "field": "_ingest._value.data",
                        "target_field": "_ingest._value.attachment"
                    }
                }
            }
        }
    ]
}

需要注意的是,多附件的情况下,field 和 target_field 必须要写成 _ingest._value.*,否则不能匹配正确的字段。
从 es 8.0 版本开始,需要删除二进制文件内容,只需要为 attachment 添加一个属性 remove_binary 为 true,就不需要像上面那样单独写一个 remove 处理器了。

三、建立文档结构映射

1.单附件(示例)

PUT newdoc_dispatch
{
  "mappings": {
    "properties": {
      "businessId":{
        "type": "keyword"
      },
      "title":{
        "type": "text",
        "analyzer": "ik_smart"
      },
      "fullDocNO":{
        "type": "text",
        "analyzer": "ik_smart"
      },
      "drafterUser":{
        "type": "keyword"
      },
      "dispatchNO":{
        "type": "text",
        "analyzer": "ik_smart"
      },
      "dispatchDept":{
        "type": "keyword"
      },
      "dispatchTime":{
        "type": "date"
      },
      "abolish":{
        "type": "keyword"
      },
      "tenantId":{
        "type": "keyword"
      },
      "attachment": {
        "properties": {
          "content":{
            "type": "text",
            "analyzer": "ik_smart"
          }
        }
      }
    }
  }
}

2.多附件(示例)

PUT newdoc_dispatch
{
  "mappings": {
    "properties": {
      "businessId":{
        "type": "keyword"
      },
      "title":{
        "type": "text",
        "analyzer": "ik_smart"
      },
      "fullDocNO":{
        "type": "text",
        "analyzer": "ik_smart"
      },
      "drafterUser":{
        "type": "keyword"
      },
      "dispatchNO":{
        "type": "text",
        "analyzer": "ik_smart"
      },
      "dispatchDept":{
        "type": "keyword"
      },
      "dispatchTime":{
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      },
      "abolish":{
        "type": "keyword"
      },
      "tenantid":{
        "type": "keyword"
      },
      "attachments" : {
        "properties" : {
          "attachment" : {
            "properties" : {
              "content" : {
                "type" : "text",
                "analyzer": "ik_smart"
              }
            }
          }        
        }
      }
    }
  }
}

工程中的代码是多附件的示例,mapping结构映射的对象详见ESDispatchDocumentVo

官网参考资料:https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html
其他参考资料:https://www.cnblogs.com/ncore/p/10475909.html
代码工程参考:https://gitee.com/catoop/es-attachment


(END)

Logo

为开发者提供学习成长、分享交流、生态实践、资源工具等服务,帮助开发者快速成长。

更多推荐