ES 基本操作

es基本概念与操作，canal数据同步

GIS之路

1120人浏览 · 2022-08-14 21:43:28

GIS之路 · 2022-08-14 21:43:28 发布

1.软件安装：必须保证版本一致

1.1下载地址：

https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html

ES：主程序
ES-HEAD：界面管理（过时）
Kibana：界面管理（推荐）
Logstash：数据同步
IK-Analyse：中文分词插件

1.2插件：

logstash-integration-jdbc：es-jdbc 集成插件
mysql-connector-java-8.0.26.jar：mysql 数据库同步包
postgresql-42.2.5.jre7.jar：pg 数据库同步包

2. ES 配置文件：配置节点、端口、网络连接

配置链接：

https://www.elastic.co/guide/en/elasticsearch/reference/current/important-settings.html

3.中文分词插件：版本一致，安装在 es pluging 目录下

下载地址：

https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v7.9.3

3.1 为什么需要安装中文分词插件呢？

分词器会按照一定规则划分一段文本，不同的分词器有不同的分词规则。
ES 中默认的分词器是 Standard Analyzer，会对文本内容按单词分类进行小写处理，
但是主要适用于处理英文，对中文不太友好。如果中文按两个字、三个字分词则效果不佳

3.2 正排索引：

按照文档内容进行编号

ID	文档内容
1	es好啊
2	es好用吗？好用
3	es好在哪里呢？嗯…
4	es这个搜索引擎就是好

3.3 倒排索引：

单词	文档记录
es	(1,{1},1),(2,{1},1),(3,{1},1),(4,{1},1)
好	(1,{2},1),(2,{2,6},1),(3,{2},1),(4,{1,5},1)
哪里	(3,{4},1)
搜索引擎	(4,{4},1)

ES分词插件会将正文内容进行分词处理，然后为每个词进行编号，记录其位置及重复次数（1，{2，5}，2）
当进行搜索的时候，ES会根据关键词进行文档匹配，也就是先找到关键字对应的文档编号，再从文档中取出完整的记录。

4.Logstash 文件配置

配置地址：

https://www.elastic.co/guide/en/logstash/7.15/plugins-filters-alter.html

配置 logstash.conf 文件：

Input 插件：jdbc 连接方式、同步数据库；Sql 语句需要将几何字段进行转换，可以选择转换为 GeoJSON 格式。
Filter 插件：数据过滤：字段以及格式
Output 插件：
创建 es 索引，存储同步数据。需要指定唯一 id

5.运行

首先运行 es ，双击 es bin 目录下的 elasticsearch.bat 启动程序：
查看是否启动：

http://localhost:9200/
运行 logstash 配置文件：打开 logstash bin 目录，在 cmd 中执行命令

logstash -f logstash.conf --config.reload.automatic
运行 Kibana：打开 bin 目录，双击运行 Kibana.bat 文件。查看是否启动：

http://localhost:5601/

6.查看同步数据（索引）

6.1 查看集群健康状态：红色（不可用）、黄色、绿色（健康）

集群的相关状态:

green: 所有主要和副本碎片均已分配。集群可以正常运行。
yellow: 所有主分片均已分配，但至少缺少一个副本。没有数据丢失，因此搜索结果仍将是完整的。但是，搜索的高可用性在一定程度上受到了损害。如果更多碎片消失，则可能会丢失数据。可以将其 yellow 视为应该立即进行调查的警告。
red: 至少缺少一个主分片（及其所有副本）。这意味着您缺少数据：搜索将返回部分结果，而对该分片建立索引将返回异常。

6.2 了解es索引结构

ES	关系数据库
索引	数据库
文档类型	表
映射关系（mapping）	字段
行	行

es与关系数据库的异同

**同：**在以前的版本中，es 保留了_doc类型，也就是一个索引下面可
以存储多个不同实体，这和一个数据库下面存储多张表一样。

**异：**es中每个type类型具有相同映射关系的字段在索引中是统一存储在
Lucene field 中。当A中存储了一个Number类型的字段，而B中存储了一个Bool类型的
同名字段，如果你想删除其中一个字段就会出现问题。而且在同一索引中
存储具有很少或没有共同字段的不同实体会导致数据稀疏，影响Lucene文档压缩能力。

**注：**为了解决这个问题，官方已经决定弃用文档type类型，也就是一个文档对应一个索引

7.导入 POI 数据

7.1 使用 PostGIS PostGIS Bundle 3 for PostgreSQL x64 12…

7.2 使用 shp2pgsql 程序：

打开postgresql bin目录，在 cmd 中运行命令：

shp2pgsql -W “UTF-8” -g poi_geom -s 4326 E:\POI\POI.shp poi> E:\POI\poi1.sql

参数说明：

-W：字符编码
-g：生成的几何字段
-s：空间参考
………

参考链接：

http://postgis.net/workshops/postgis-intro/loading_data.html

7.3 注意点：

导入的 shape 文件应该带有.cpg 文件，因为它保存了字符编码，默认为”UTF-8”。如果缺失，呵呵，一百年也不能导入成功。
版本问题：pg 版本 12，navicat 版本 12，shape 导入成功之后表不显示，后来将 navicat 版本改为 15，可以正常显示。
创建空间拓展：
- CREATE EXTENSION postgis;
- CREATE EXTENSION pgrouting;
- CREATE EXTENSION postgis_topology;
- CREATE EXTENSION fuzzystrmatch;
- CREATE EXTENSION postgis_tiger_geocoder;
- CREATE EXTENSION address_standardizer;

8.同步 POI 数据

配置完成，运行 es 即可

9.代码讲解

9.1 读取索引数据

9.2 高亮关键字

9.3 文字提示

9.4 地理搜索

**注意：**同步的 POI 数据几何字段类型在索引中是 text 类型，但用于地理查询的几何字段类型为 geo_type 类型

更改 mapping 映射字段类型

创建一个模板索引，将几何字段类型改为 geo_type

潜在问题：

geo_shape 格式问题：

ignore_malformed：
If true, malformed GeoJSON or WKT shapes are ignored.If false (default), malformed GeoJSON and WKT shapes throw an exception and reject the entire
document

{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "address": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        "analyzer": "ik_smart"
      },
      "category": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        "analyzer": "ik_smart"
      },
      "geometry": {
        "type": "geo_point"
      },
      "id": {
        "type": "long"
      },
      "name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        "analyzer": "ik_smart"
      },
      "telephone": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

同步索引数据：

POST /_reindex
{
    "source": {
		"index": "my_index",
    },
    "dest": {
        "index": "my_index2"
    }
}

but 问题依然没有解决 …

10.ES字段类型

10.1ES几何字段类型

1.Geopoint

geo_point有以下五种定义方式：
其中数组坐标为经度在前，纬度在后**（[longitude, latitude] ），其他方式都是纬度在前，经度在后(lat,lon)**。

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "location": {
        "type": "geo_point"
      }
    }
  }
}

PUT my-index-000001/_doc/1
{
  "text": "Geopoint as an object",
  "location": { 
    "lat": 41.12,
    "lon": -71.34
  }
}

PUT my-index-000001/_doc/2
{
  "text": "Geopoint as a string",
  "location": "41.12,-71.34" 
}

PUT my-index-000001/_doc/3
{
  "text": "Geopoint as a geohash",
  "location": "drm3btev3e86" 
}

PUT my-index-000001/_doc/4
{
  "text": "Geopoint as an array",
  "location": [ -71.34, 41.12 ] 
}

PUT my-index-000001/_doc/5
{
  "text": "Geopoint as a WKT POINT primitive",
  "location" : "POINT (-71.34 41.12)" 
}

GET my-index-000001/_search
{
  "query": {
    "geo_bounding_box": { 
      "location": {
        "top_left": {
          "lat": 42,
          "lon": -72
        },
        "bottom_right": {
          "lat": 40,
          "lon": -74
        }
      }
    }
  }
}

2.GeoShape

GeoShape用来查询矩形或多边形

GeoShape映射必须要显示声明为geo_shape类型

3.Point

11.ES 同步空间数据时几何字段为text类型的解决方法

显示声明映射类型中的几何字段为geo_point类型。
数据库存储几何信息的字段数据类型要与创建geo_point的五种类型相对应

11.1 获取经纬度坐标，用concat联合

有效：获取经纬度坐标（String类型），需要转换经纬度类型,字符串是纬度在前，经度在后。
location字段为String 类型，对应es geo_point 字符串对象

注意：当使用字符串创建es geo_point 时，sql语句的 geometry字段与es mapping中的几何字段相同

"statement"  => "select gid,name,adress,telephone,category,concat_ws(',',st_y(poi_geom),st_x(poi_geom))as geometry from poi2"

11.2 获取经纬度坐标，用concat联合

此方式与上同。

有效：获取经纬度坐标（String类型），需要转换经纬度类型

"statement"  => "select gid,name,adress,telephone,category,st_x(poi_geom)as x,st_y(poi_geom)as y,concat_ws(',',st_y(poi_geom),st_x(poi_geom))as location from poi2"

mutate { 
	
		# 注意书写顺序：
		#rename => {
      	  # 这里要注意一下，经纬度的顺序
          #"lat" => "[geometry][y]"
          #"lon" => "[geometry][x]"
        #}
		
		#转换经纬度坐标
		#convert => {"x" => "float"}
		#convert => {"y" => "float"}
		
		# 将geometry数组转换为float类型
		#convert => {
			#"[geometry][0]" => "float"
			#"[geometry][1]" => "float"
		#}
		
	
        remove_field => "@timestamp"    
        remove_field => "@version"    
    }

11.3 获取经纬度坐标，数组类型

此时需要用json过滤器进行过滤
获取经纬度坐标（数组类型）
json过滤器设置源字段，target设置目标放置字段为es mapping 中几何字段。

"statement"  => "select gid,name,adress,telephone,category,st_asgeojson(geom)::json->>'coordinates' as location from poi3"

 json {
		# 略过无效JSON
		# skip_on_invalid_json => true
		# 存储json数据的字段
		source => "location"
		
		# 放置json数据的目标字段
		target => "geometry"
		
		remove_field => ["@timestamp","@version","geom"]
		remove_tag => ["@timestamp","@version"]
    }

12 Canal 同步MySQL数据到ES

链接地址：
https://github.com/alibaba/canal/wiki/QuickStart

12.1 开启MySQL binlog

1.配置MySQL binlog
打开my.ini文件，输入以下信息，注意：该配置文件是在ProgramData目录下。

[mysqlId]

log-gin=mysql-bin #开启binlog

binlog-format=ROW #选择 ROW 模式

server_id=1 #配制 MySQL replaction需要定义，不要和canal的slavedId重复

2.重启MySQL服务，查看MySQL binlog是否开启
登录MySQL客户端，输入以下命令,查看MySQL binlog是否已经开启，注意末尾要加上分号。

SHOW VARIABLES LIKE '%log_bin%';

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-V5CXRBPq-1660484347437)(es-search_files/1.png)]

12.2 启动Canal

1.Canal下载：

Canal realese

2.Canal deployer配置

打开 conf/example/instance.properties 配置文件

mysql serverId 
canal.instance.mysql.slaveId = 1234

# position info，需要改成自己的数据库信息
canal.instance.master.address = 127.0.0.1:3306 
canal.instance.master.journal.name = 
canal.instance.master.position = 
canal.instance.master.timestamp = 
#canal.instance.standby.address = 
#canal.instance.standby.journal.name =
#canal.instance.standby.position = 
#canal.instance.standby.timestamp = 

# username/password，需要改成自己的数据库信息
canal.instance.dbUsername = canal  
canal.instance.dbPassword = canal
canal.instance.defaultDatabaseName =
canal.instance.connectionCharset = UTF-8
#table regex
canal.instance.filter.regex = .\*\\\\..\*

3.Canal Adapter配置：ElasticSearch适配器

修改启动器配置：application.yml
适配器表映射文件：修改 conf/es/mytest_user.yml文件

4.配置完成之后，启动同步之前，一定要先索引创建好，包括索引和数据库映射关系。

5.es7 .yml配置文件中可以书写es mapping 映射关系，但是无法映射到索引中，所以映射关系的创建
还是要在 es 中完成。

注意：.yml文件书写格式

dataSourceKey: defaultDS
destination: example
groupId: g1
esMapping:
  _index: person
  _id: id
  sql: "select id,name,age,sexy from person"
  # etlCondition: "where t.c_time>={}"
  commitBatch: 3000
  "person": {
    "settings": {
        "number_of_shards": 2,
        "number_of_replicas": 1
    },
    "mappings": {
        "properties": {
            "id": {
                "type" : "long"
            },
            "name": {
                "type" : "text"
            },
            "age": {
                "type" : "long"
            },
            "sexy": {
                "type": "text"
            }
        }
    }
   }

12.3 全量同步

curl http://localhost:8081/etl/es7/person.yml -X POST

12.3 增量同步

mysql 增删改查

12.4 填坑之旅

1.错误信息

java.lang.ClassCastException: com.alibaba.druid.pool.DruidDataSource cannot be cast to com.alibaba.druid.pool.DruidDataSource

解决：

修改druid包,找到client-adapter目录下的es core,将pom文件中的druid包修改为 provided

 <dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>druid</artifactId>
    <scope>provided</scope>
</dependency>

下载client 源码，对项目重新编译，

找到 client-adapter\es7x\target 目录下的
client-adapter.es7x-1.1.5-jar-with-dependencies.jar包

 将 cannal adapter pluging 目录下的 \plugin\client-adapter.es7x-1.1.5-jar-with-dependencies.jar 
 替换为 canal-client 下的es 7x 包下的 es7x-1.1.5-jar-with-dependencies.jar

2.日志名最好先不要填写：

position info
# 数据库地址
canal.instance.master.address=127.0.0.1:3306
3 日志名称
canal.instance.master.journal.name=mysql-bin
canal.instance.master.journal.name=
# mysql主库链接时起始的binlog偏移量
canal.instance.master.position=
canal.instance.master.timestamp=
canal.instance.master.gtid=

3.es 链接地址最好写上 (http://)