集群安装


本文使用的版本是21.12.3.32, 准备6台虚拟机作为集群

单机版-传送门
https://blog.csdn.net/xushijie89/article/details/122987121

集群依赖ZooKeeper服务, 请自行安装ZK集群

配置文件修改


(1)修改每台配置文件, 文件不存在可以新建一下

[root@172.16.131.56 /etc/clickhouse-server]$ vim /etc/clickhouse-server/config.d/metrika.xml
<clickhouse>
		<!--不同版本的节点名称不同, 注意修改-->
        <remote_servers>
                <!--集群名称, 可以自定义, 后面在新建库、表的时候需要用到集群名称-->
                <cluster_3shards_2replicas>
	                <shard>
	                	<!--内部复制(默认false), 开启后, 在分布式表引擎下, 数据写入时-->
	                	<!--每个分片只会去寻找一个节点写, 并不是每个都写-->
	                	<internal_replication>true</internal_replication>
		                <replica>
		                    <host>hadoop1</host>
		                    <port>9001</port>
		                </replica>
		                <replica>
		                    <host>hadoop2</host>
		                    <port>9001</port>
		                </replica>
	            	</shard>
		            <shard>
		            	<internal_replication>true</internal_replication>
		                <replica>
		                    <host>hadoop3</host>
		                    <port>9001</port>
		                </replica>
		                <replica>
		                    <host>hadoop4</host>
		                    <port>9001</port>
		                </replica>
		            </shard>
		            <shard>
		            	<internal_replication>true</internal_replication>
		                <replica>
		                    <host>hadoop5</host>
		                    <port>9001</port>
		                </replica>
		                <replica>
		                    <host>hadoop6</host>
		                    <port>9001</port>
		                </replica>
		            </shard>
                </cluster_3shards_2replicas>
        </remote_servers>
		
		<!--zookeeper配置-->
        <zookeeper>
	        <node>
	            <host>hadoop1</host>
	            <port>2181</port>
	        </node>
	        <node>
	            <host>hadoop2</host>
	            <port>2181</port>
	        </node>
	        <node>
	            <host>hadoop3</host>
	            <port>2181</port>
	        </node>
    	</zookeeper>
    
    <!--全局变量, 建复制表的时候可以使用${}灵活替换, 下文有例子-->
    <macros>
    	<!--01分片机器-->
        <shard>01</shard>
        <database>database_name</database>
        <replica>hadoop1</replica>
        <!--01分片副本机器-->
        <!--<shard>01</shard>-->
        <!--<database>database_name</database>-->
        <!--<replica>hadoop2</replica>-->
    </macros>
    <!--压缩相关配置-->
    <compression>
        <case>
            <min_part_size>10000000000</min_part_size>
            <min_part_size_ratio>0.01</min_part_size_ratio>
            <method>lz4</method> <!--压缩算法lz4压缩比zstd快, 更占磁盘-->
        </case>
    </compression>
</clickhouse>

(2)检查一下ClickHouse配置指向路径是否正确

[root@172.16.131.56 /etc/clickhouse-server]$ vim /etc/clickhouse-server/config.xml
检查这行配置
<include_from>/etc/clickhouse-server/config.d/metrika.xml</include_from>

(3)启动服务

/etc/init.d/clickhouse-server start

(4)连接控制台, 创建数据库

[root@172.16.131.56 /etc/clickhouse-server/config.d]$ clickhouse-client -m
CREATE DATABASE db_name on cluster cluster_3shards_2replicas COMMENT '测试库'

(5)新建复制表(物理表), 真实保存数据的

create table table_name on cluster cluster_3shards_2replicas (
 id UInt32,
 sku_id String,
 total_amount Decimal(16,2),
 create_time Datetime
) engine 
= ReplicatedMergeTree('/clickhouse/tables/{shard}/{database}/table_name','{replica}')
 partition by toYYYYMMDD(create_time)
 primary key (id)
 order by (id,sku_id);

on cluster cluster_3shards_2replicas:
建表时加上这个参数, 可以在cluster_3shards_2replicas指定的分片中, 都执行一次命令

ReplicatedMergeTree()的参数:
zoo_path: ZooKeeper中表的路径
replica_name: ZooKeeper中副本的名字
other_parameters(可以不填): Parameters of an engine which is used for creating the replicated version, for example, version in ReplacingMergeTree.

(6)新建分布式表(逻辑表), 不保存数据

CREATE TABLE db_name.table_name_all on cluster cluster_3shards_2replicas(
    id UInt32,
    sku_id String,
    total_amount Decimal(16,2),
    create_time Datetime
) ENGINE
= Distributed(cluster_3shards_2replicas, db_name, table_name, hiveHash(id));

Distributed()的参数:
cluster: 配置里的分片名
database: 数据库名
table: 表名
sharding_key(可选): 分片键
policy_name(可选): 策略名, 用于存储临时文件进行异步发送

(6)先在分布式表上插入数据
insert into table_name_all values (201, ‘sku_001’, 1000.00, ‘2020-06-01 12:00:00’);

逐台机器查看数据落在哪个分片的物理表上
select * from table_name;

最后看下分片的副本是否也有数据存在

Logo

华为开发者空间,是为全球开发者打造的专属开发空间,汇聚了华为优质开发资源及工具,致力于让每一位开发者拥有一台云主机,基于华为根生态开发、创新。

更多推荐