kafka在python中的使用及结束kafka消费者

先说下问题：正常使用kafka消费者，接收消息时，会出现消息循环无法结束问题，增加参数consumer_timeout_ms:超时时间（毫秒），超过指定时间没有获取到消息关闭kafka。（例子如下）consumer.py文件：from kafka import KafkaProducer, KafkaConsumerimport timeclass KafkaClient(object):topi

ol_m_lo

6356人浏览 · 2021-12-06 16:54:35

ol_m_lo · 2021-12-06 16:54:35 发布

Kafka 中部分主要参数的说明翻译：
consumer

:param bootstrap_servers: 服务器地址，ip:port or [ip:port, ip:port, ip:port]
:param sasl_mechanism: 为"PLAIN"时使用账号密码，默认为None
:param username: 用户名
:param password: 密码
:param security_protocol: PLAINTEXT, SSL, SASL_PLAINTEXT, 		      SASL_SSL, 默认为PLAINTEXT。
SSL：使用ssl认证，，【ssl_cafile，ssl_certfile，ssl_keyfile】为必传。 
SASL_PLAINTEXT： 使用账号，密码
:param ssl_check_hostname: 配置ssl是否握手标志,使用SSL时为True,默认为false
:param ssl_cafile:（str） CARoot 证书(要在证书中使用的ca文件的可选文件名验证。默认值：无。)
:param ssl_certfile:（str） 客户端证书(pem格式的可选文件名，包含客户端证书，以及所需的任何CA证书确定证书的真实性。默认值：无。)
:param ssl_keyfile:（str）客户端密钥 （包含客户端私钥的可选文件名。默认值：无。）

fetch max bytes (int): The maximum amount of data the server should return for a fetch request. This is not an absolute maximum, if the first message in the first non-empty partition of the fetch is  larger than this value, the message will still be returned to ensure that the consumer can make progress. NOTE: consumer performs  fetches to multiple brokers in parallel so memory usage will depend  on the number of brokers containing partitions for the topic. Supported Kafka version >= 0.10.1.0. Default: 52428800 (50 MB).
获取最大字节（int）：服务器为获取请求应返回的最大数据量。这不是绝对最大值，如果提取的第一个非空分区中的第一条消息大于此值，则仍将返回该消息，以确保使用者能够取得进展。注意：使用者并行执行对多个代理的读取，因此内存使用情况将取决于包含主题分区的代理的数量。支持的Kafka版本>= 0.10.1.0。默认值：52428800(50 MB)。

max partition fetch bytes (int): The maximum amount of data per-partition the server will return. The maximum total memory used for a request = #partitions *  max_partition_fetch_bytes. This size must be at least as large as the maximum message size  the server allows or else it is possible for the producer to send messages larger than the consumer can fetch. If that  happens, the consumer can get stuck trying to fetch a large message on a certain partition. Default: 1048576.
最大分区提取字节（int）：服务器将返回的每个分区的最大数据量。用于请求的最大总内存=#分区*max_partition_fitch_bytes。此大小必须至少与服务器允许的最大消息大小相同，否则生产者可能发送大于消费者可以获取的消息。如果发生这种情况，消费者可能会在试图获取特定分区上的大消息时陷入困境。默认值：1048576。


max poll records (int): The maximum number of records returned in a single call to :meth:`~kafka.KafkaConsumer.poll`. Default: 500
最大轮询记录（int）：在一次调用中返回的最大记录数：meth:`~kafka.KafkaConsumer.poll`。默认值：500

receive buffer bytes (int): The size of the TCP receive buffer (SO_RCVBUF) to use when reading data. Default: None (relies on system defaults). The java client defaults to 32768. 
接收缓冲区字节（int）：读取数据时使用的TCP接收缓冲区（SO_RCVBUF）的大小。默认值：无（依赖于系统默认值）。java客户端默认为32768。

send buffer bytes (int): The size of the TCP send buffer  (SO_SNDBUF) to use when sending data. Default: None (relies on system defaults). The java client defaults to 131072.
发送缓冲区字节（int）：发送数据时使用的TCP发送缓冲区（SO_SNDBUF）的大小。默认值：无（依赖于系统默认值）。java客户端默认为131072。

先说下问题：

方式1

正常使用kafka消费者，接收消息时，会出现消息循环无法结束问题，增加参数 consumer_timeout_ms:超时时间（毫秒），超过指定时间没有获取到消息关闭kafka。（例子如下）

consumer.py文件：

from kafka import KafkaProducer, KafkaConsumer
import time


class KafkaClient(object):
	topic = "topic"  # 使用的kafka的topic
	client = "0.0.0.0:19823"  # kafka所在的服务地址
	group_id = "test_consumer_group"  # kafka组信息
	
	@staticmethod
	def log(log_str):
		t = time.strftime(r"%Y-%m-%d_%H:%M:%S", time.localtime())
		print("[%s]%s" % (t, log_str))
		
	def  info_send(self, key, info_str):
		"""key： 发送信息的key;info_str:要发送的信息内容"""
		producer = KafkaProducer(bootstrap_servers=[self.client])
		producer.send(self.topic, key=key.encode("utf-8"), value=info_str.encode("utf-8"))
		# 批量提交可以使用 producer.flush()
		producer.close()
		
	def message_consumer():
		# consumer_timeout_ms:超时时间（毫秒），超过指定时间没有获取到消息关闭kafka
		consumer = KafkaConsumer(self.topic, group_id=self.group_id, bootstrap_servers=[self.client], consumer_timeout_ms=3000)
		for msg in consumer:
			# partition:消息所在的分区，offset:消息所在分区的位置，key:消息的key，value:消息的内容
			print(msg.topic, msg.partition, msg.offset, msg.timestamp, msg.key, msg.value)

方式二

我们可以先检查主题中最后一条消息的偏移量。当我们到达那个偏移量时，停止循环。

# 方式1
    client = "localhost:9092"
    consumer = KafkaConsumer(client)
    topic = 'test'
    tp = TopicPartition(topic,0)
    #register to the topic
    consumer.assign([tp])

    # obtain the last offset value
    consumer.seek_to_end(tp)
    lastOffset = consumer.position(tp)

    consumer.seek_to_beginning(tp)        

    for message in consumer:
        print "Offset:", message.offset
        print "Value:", message.message.value
        if message.offset == lastOffset - 1:
            break
 
# 方式2 使用了end_offset
from kafka import KafkaConsumer, TopicPartition

# settings
client = "localhost:9092"
topic = 'test'

# prepare consumer
tp = TopicPartition(topic,0)
consumer = KafkaConsumer(client)
consumer.assign([tp])
consumer.seek_to_beginning(tp)  

# obtain the last offset value
lastOffset = consumer.end_offsets([tp])[tp]

for message in consumer:
    print "Offset:", message.offset
    print "Value:", message.message.value
    if message.offset == lastOffset - 1:
        break

Kafka中使用ssl认证时

提取密钥
配置 Apache Kafka 实例后，将有两个 JKS 容器：kafka.client.keystore.jks和kafka.client.truststore.jks。第一个包含已签名的客户端证书、其私钥和用于对其进行签名的 “CARoot” 证书。第二个包含用于签署客户端证书和密钥的证书。因此，我们需要的一切都包含在kafka.client.keystore.jks文件中。要了解其内容的概述，可以调用：

keytool -list -rfc -keystore kafka.client.keystore.jks

提取客户端证书

首先，我们将提取客户端证书：

keytool -exportcert -alias caroot -keystore kafka.client.keystore.jks -rfc -file certificate.pem

需要注意的是，上面命令的参数别名-alias可以通过下面的命令来查看：

keytool -list -rfc -keystore client.keystore.jks

提取客户端密钥

接下来我们将提取客户端密钥。但是keytool不直接支持这一点，所以我们必须先将密钥库转换为pkcs12格式，然后从中提取私钥：

keytool -v -importkeystore -srckeystore kafka.client.keystore.jks -srcalias caroot -destkeystore cert_and_key.p12 -deststoretype PKCS12

生成 p12 文件后，使用下面的命令将密钥打印到 STDOUT，从那里可以将其复制并粘贴到key.pem中（确保复制到 --BEGIN PRIVATE KEY-- 和 --END PRIVATE KEY-- 之间的行）。

openssl pkcs12 -in cert_and_key.p12 -nocerts -nodes

但是，我在执行上面的命令后，怎么都找不到打印的信息，最后没办法，在https://www.openssl.net.cn/docs/249.html中查阅了下 OpenSSL 的命令，将结果打印在终端，从终端复制到key.pem文件中：

openssl pkcs12 -in cert_and_key.p12 -nodes

然而事实上，在最后的使用中，我根本没用到key.pem文件，直接添加了对应的 password 即可。。。。

提取 CARoot 证书
最后我们将提取 CARoot 证书：

keytool -exportcert -alias CARoot -keystore kafka.client.keystore.jks -rfc -file CARoot.pem

kafka-python创建连接
现在我们有了三个文件certificate.pem、key.pem、CARoot.pem。kafka-python它们可以作为消费者和生产者的构造函数的参数传递：

from kafka import KafkaConsumer, KafkaProducer

consumer = KafkaConsumer(bootstrap_servers='my.server.com',
                          security_protocol='SSL',
                          ssl_check_hostname=True,
                          ssl_cafile='CARoot.pem',
                          ssl_certfile='certificate.pem',
                          ssl_keyfile='key.pem')

producer = KafkaProducer(bootstrap_servers='my.server.com',
                          security_protocol='SSL',
                          ssl_check_hostname=True,
                          ssl_cafile='CARoot.pem',
                          ssl_certfile='certificate.pem',
                          ssl_keyfile='key.pem')

# Write hello world to test topic
producer.send("test", bytes("Hello World"))
producer.flush()

# Read and print all messages from test topic
consumer.assign([TopicPartition(TOPIC, 0)])
consumer.seek_to_beginning(TopicPartition(TOPIC, 0))
for msg in consumer:
    print(msg)

pykafka创建连接
以类似的方式，还可以将这些文件作为参数传递给pykafka：

from pykafka import KafkaClient, SslConfig

config = SslConfig(cafile='CARoot.pem',
                   certfile='certificate.pem',
                   keyfile='key.pem')

client = KafkaClient(hosts='my.server.com',
                     ssl_config=config)

topic = client.topics["test"]

# Write hello world to test topic
with topic.get_sync_producer() as producer:
   producer.produce('Hello World')

# Print all messages from test topic
consumer = topic.get_simple_consumer()
for message in consumer:
   if message is not None:
       print('{} {}'.format(message.offset, message.value))