flink读取Hive数据写入Kafka（Hive Connector 和 kafka connector）

flink读取Hive数据写入Kafka由于需要将kafka的数据读取出来并和hive中的数据进行join操作，突然发现flink1.12.0后实现了批流统一，所以学习了一下心得Hive Connector，并编写了一个读取Hive插入到kafka的小例子（感觉没什么好写的流水账）。参考：https://www.jianshu.com/p/01c363f166c2https://ci.apache

Wu_Init

5040人浏览 · 2021-04-02 12:14:37

Wu_Init · 2021-04-02 12:14:37 发布

flink读取Hive数据写入Kafka

由于需要将kafka的数据读取出来并和hive中的数据进行join操作，突然发现flink1.12.0后实现了批流统一，所以学习了一下心得Hive Connector，并编写了一个读取Hive插入到kafka的小例子（感觉没什么好写的流水账）。

参考：

https://www.jianshu.com/p/01c363f166c2

https://ci.apache.org/projects/flink/flink-docs-release-1.12/zh/dev/table/connectors/hive/

1. 简介

Flink 与 Hive 的集成包含两个层面。
一是利用了 Hive 的 MetaStore 作为持久化的 Catalog，用户可通过HiveCatalog将不同会话中的 Flink 元数据存储到 Hive Metastore 中。例如，用户可以使用HiveCatalog将其 Kafka 表或 Elasticsearch 表存储在 Hive Metastore 中，并后续在 SQL 查询中重新使用它们。
二是利用 Flink 来读写 Hive 的表。
HiveCatalog的设计提供了与 Hive 良好的兼容性，用户可以”开箱即用”的访问其已有的 Hive 数仓。您不需要修改现有的 Hive Metastore，也不需要更改表的数据位置或分区。

即现在通过HiveCatalog就可以连接到Hive，不需要复杂操作，并且在flink语法中创建表也可以保存在hive中。

2. 运行环境

CDH 5.8.4
hive 1.1.0
flink 1.12.1
kafka 2.4.0

3. 项目结构

在这里插入图片描述

4. 依赖

要与 Hive 集成，您需要在 Flink 下的/lib/目录中添加一些额外的依赖包，以便通过 Table API 或 SQL Client 与 Hive 进行交互。或者，您可以将这些依赖项放在专用文件夹中，并分别使用 Table API 程序或 SQL Client 的-C或-l选项将它们添加到 classpath 中。
Apache Hive 是基于 Hadoop 之上构建的, 首先您需要 Hadoop 的依赖:

export HADOOP_CLASSPATH=hadoop classpath

有两种添加 Hive 依赖项的方法。第一种是使用 Flink 提供的 Hive Jar包。您可以根据使用的 Metastore 的版本来选择对应的 Hive jar。第二个方式是分别添加每个所需的 jar 包。如果您使用的 Hive 版本尚未在此处列出，则第二种方法会更适合。

注意：建议您优先使用 Flink 提供的 Hive jar 包。仅在 Flink 提供的 Hive jar 不满足您的需求时，再考虑使用分开添加 jar 包的方式。

/flink-1.12.0
   /lib
       // Flink's Hive connector
       flink-connector-hive_2.11-1.12.0.jar

       // Hive dependencies
       hive-metastore-1.1.0.jar
       hive-exec-1.1.0.jar
       libfb303-0.9.2.jar // libfb303 is not packed into hive-exec in some versions, need to add it separately

       // Orc dependencies -- required by the ORC vectorized optimizations
       orc-core-1.4.3-nohive.jar
       aircompressor-0.8.jar // transitive dependency of orc-core

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

    <properties>
        <flink.version>1.12.1</flink.version>
        <hive.version>1.1.0</hive.version>
    </properties>

    <groupId>org.apache.flink</groupId>
    <artifactId>flink-quickstart-java</artifactId>
    <version>1.12.0</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-api-java-bridge_2.11</artifactId>
            <version>${flink.version}</version>
            <scope>compile</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-planner-blink_2.11</artifactId>
            <version>${flink.version}</version>
            <scope>compile</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_2.11</artifactId>
            <version>${flink.version}</version>
            <scope>compile</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_2.11</artifactId>
            <version>${flink.version}</version>
            <scope>compile</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-hive_2.11</artifactId>
            <version>${flink.version}</version>
            <scope>compile</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>${hive.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>org.codehaus.janino</groupId>
                    <artifactId>janino</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.codehaus.janino</groupId>
                    <artifactId>commons-compiler</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-json</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-hadoop-compatibility_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-shaded-hadoop-2-uber</artifactId>
            <version>2.7.5-9.0</version>
        </dependency>

    </dependencies>
</project>

5. 代码

下面通过代码示例连接到Hive

请注意，虽然 HiveCatalog 不需要特定的 planner，但读写Hive表仅适用于 Blink planner。因此，强烈建议您在连接到 Hive 仓库时使用 Blink planner。
此处进行Table API和Sql混用形式：

StreamExecutionEnvironment bsEnv = StreamExecutionEnvironment.getExecutionEnvironment();
EnvironmentSettings bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner()
        .inStreamingMode().build();
StreamTableEnvironment bsTableEnv = StreamTableEnvironment.create(bsEnv, bsSettings);

// 由于集群存在kerberos认证，则此处需要进行kerberos认证

new KerberosAuth().KerberosAuth(true);

/**
* name 自定义你的Hive名字
* defaultDatabase  要连接的Hive数据库
* hiveConfDir  hive-site.xml的位置，若是IDEA或者Eclipse可以放在resources目录下进行读取，若是打包到服务器环境则需要指定其集群的文件位置
**/
String name = "myHive";
String defaultDatabase= "hive_database";
String hiveConfDir = System.getProperty("user.dir") + /resources/";

HiveCatalog hive = new HiveCatalog(name, defaultDatabase, hiveConfDir);
// 注册catalog
tableEnv.registerCatalog("myhive", hive);

// 先在默认的catalog下创建一张kafka表
bsTableEnv.executeSql("create table kafka_table( " +
        "`clickhouse_time` STRING, " +
        "`op` STRING, " +
        "`deleted` INT, " +
        "`name` STRING, " +
        "`id` INT, " +
        "`table` STRING, " +
        "`db` STRING " +
        ") WITH ( " +
        "'connector.type'='kafka', " +
        "'connector.version'='universal', " +
        "'connector.topic'='flinkTopic', " +
        "'connector.properties.bootstrap.servers'='xxx.xxx.xxx.xxx:9092', " +
        "'connector.properties.group.id'='flink', " +
        "'connector.startup-mode'='earliest-offset', " +
        "'format.type'='json' " +
        ")");

// 使用hive需要切换到其catalog下，不然默认使用flink的default catalog
bsTableEnv.useCatalog("myhive");

Table test = tableEnv.sqlQuery("select * from test");

bsTableEnv.useCatalog("default_catalog");

bsTableEnv.createTemporaryView("test", test);
// 插入
bsTableEnv.executeSql("insert into kafka_table select * from test");

KerberosAuth.class

import org.apache.hadoop.security.UserGroupInformation;

public class KerberosAuth {
    // 此处要将kerberos文件放在resources目录下面进行读取
    public static String path_krb5 = System.getProperty("user.dir") + "/resources/krb5.conf";
    public static String str_principal = "bdp/admin@HADOOP.COM";
    public static String path_keytab = System.getProperty("user.dir") + "/resources/bdp.keytab";

    public KerberosAuth(Boolean debug) {
        try {
            System.setProperty("java.security.krb5.conf", "src/main/resources/krb5.ini");
            System.setProperty("javax.security.auth.useSubjectCredsOnly", "false");
            if (debug) {
                System.setProperty("sun.security.krb5.debug", "true");
                if (UserGroupInformation.isLoginKeytabBased()) {
                    UserGroupInformation.getLoginUser().reloginFromKeytab();
                } else  {
                    UserGroupInformation.loginUserFromKeytab(str_principal, path_keytab);
                }
            }
            System.out.println("ticketCache---->" + UserGroupInformation.isLoginTicketBased());
        } catch (Exception e) {
            e.printStackTrace();
        }

    }
}

运行程序，并在kafka topic中查看数据。

下面为HiveCatalog参数hive connector配置：
在这里插入图片描述

6. 所遇到的问题

依赖问题，Hive版本依赖和Kerberos认证，需要添加新的依赖

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>${hive.version}</version>
    <exclusions>
        <exclusion>
            <groupId>org.codehaus.janino</groupId>
            <artifactId>janino</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.codehaus.janino</groupId>
            <artifactId>commons-compiler</artifactId>
        </exclusion>
    </exclusions>
</dependency>

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-shaded-hadoop-2-uber</artifactId>
    <version>2.7.5-9.0</version>
</dependency>

读取kafka中的json格式不对
在使用flink sql创建表时，格式对应不上，需要对应调整。
flink1.12.0json格式对照
读取Hive orc文件的各种依赖Exception问题

Caused by: java.lang.ClassNotFoundException: org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch

Java.io.IOException: Problem reading file footer hdfs://xxx:8020/user/hive/warehouse/xxx.db/xxx/000000_0

read hive orc table exception. could not initialize class org.apache.orc.impl.ZlibCodec

解决方法：
将对应自己的Hive版本打包到flink的lib目录下（例如我的Hive1.1.0）：

/flink-1.12.0
   /lib

       // Flink's Hive connector
       flink-connector-hive_2.11-1.12.0.jar

       // Hive dependencies
       hive-metastore-1.1.0.jar
       hive-exec-1.1.0.jar
       libfb303-0.9.2.jar // libfb303 is not packed into hive-exec in some versions, need to add it separately

       // Orc dependencies -- required by the ORC vectorized optimizations
       orc-core-1.4.3-nohive.jar
       aircompressor-0.8.jar // transitive dependency of orc-core