[springboot, lettuce] io.lettuce.core.RedisCommandTimeoutException: Command timed out after

环境上用的springboot2.3.1，项目上线已经两年，今天第一次遇到这个lettuce的Redis “Command timed out”，于是网上查了查资料，找一下原因和解决方法。列举一下处理方式1、修改spring.redis.timeout和cluster.refresh当然这个是第一步，可以进行观察是否有作用#Spring Boot 从 2.0版本开始，将默认的Redis客户端Je

zzhongcy

21910人浏览 · 2021-07-20 15:49:20

zzhongcy · 2021-07-20 15:49:20 发布

环境上用的springboot 2.3.1，项目上线已经两年，今天第一次遇到这个lettuce的Redis “Command timed out”，于是网上查了查资料，找一下原因和解决方法。

GitHub ISSUE：

https://github.com/lettuce-io/lettuce-core/issues?q=is%3Aissue+Command+timed+out+after

https://github.com/lettuce-io/lettuce-core/issues/1362

参考官方连接：

https://lettuce.io/core/release/reference/index.html#faq.timeout

RedisCommandTimeoutException with a stack trace like:

io.lettuce.core.RedisCommandTimeoutException: Command timed out after 1 minute(s)
at io.lettuce.core.ExceptionFactory.createTimeoutException(ExceptionFactory.java:51)
at io.lettuce.core.LettuceFutures.awaitOrCancel(LettuceFutures.java:114)
at io.lettuce.core.FutureSyncInvocationHandler.handleInvocation(FutureSyncInvocationHandler.java:69)
at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
at com.sun.proxy.$Proxy94.set(Unknown Source)

Diagnosis:

Check the debug log (log level DEBUG or TRACE for the logger io.lettuce.core.protocol)
Take a Thread dump to investigate Thread activity

Cause:

Command timeouts are caused by the fact that a command was not completed within the configured timeout. Timeouts may be caused for various reasons:

Redis server has crashed/network partition happened and your Redis service didn’t recover within the configured timeout
Command was not finished in time. This can happen if your Redis server is overloaded or if the connection is blocked by a command (e.g. BLPOP 0, long-running Lua script). See also blpop(Duration.ZERO, …) gives RedisCommandTimeoutException.
Configured timeout does not match Redis’s performance.
If you block the EventLoop (e.g. calling blocking methods in a RedisFuture callback or in a Reactive pipeline). That can easily happen when calling Redis commands in a Pub/Sub listener or a RedisConnectionStateListener.

Action:

Check for the causes above. If the configured timeout does not match your Redis latency characteristics, consider increasing the timeout. Never block the EventLoop from your code.

列举一下处理方式

1、修改spring.redis.timeout和cluster.refresh（大部分情况都能解决）

环境：spring-boot-starter 2.x 和 sprig-data-starter-data-redis 2.x

在使用

connection.bRPop(timeout, rawKey);

方法时，如果这里的timeout大于springboot配置文件的spring.redis.timeout，就会出现异常io.lettuce.core.RedisCommandTimeoutException: Command timed out after。

所以解决方法就是timeout不要超出连接池的timeout就好了.

当然这个是第一步，可以进行观察是否有作用

#Spring Boot 从 2.0版本开始，将默认的Redis客户端Jedis替换为Lettuce
spring.redis.lettuce.pool.max-active=8
spring.redis.lettuce.pool.max-wait=-1ms
spring.redis.lettuce.pool.max-idle=5
spring.redis.lettuce.pool.min-idle=0
spring.redis.timeout=50000ms

其实springboot 2.3.0 支持自动刷新，增加下面配置，可以进行redis连接刷新

# 开启cluster自适应刷新 周期600秒
spring.redis.lettuce.cluster.refresh.adaptive=true
spring.redis.lettuce.cluster.refresh.period=600000

可以适当的修改这两个值，进行测试和观察验证。

2、修改redis.conf中的tcp-keepalive

将spring boot redis配置设置超时时间比redis.conf中的timeout要小，比如spring boot redis中设置超时30秒，那么服务器中redis timeout设置为40秒；另外将redis.conf中的tcp-keepalive改成10：

# Unix socket.
#
# Specify the path for the Unix socket that will be used to listen for
# incoming connections. There is no default, so Redis will not listen
# on a unix socket when not specified.
#
# unixsocket /tmp/redis.sock
# unixsocketperm 700
 
# Close the connection after a client is idle for N seconds (0 to disable)
timeout 40
 
# TCP keepalive.
#
# If non-zero, use SO_KEEPALIVE to send TCP ACKs to clients in absence
# of communication. This is useful for two reasons:
#
# 1) Detect dead peers.
# 2) Take the connection alive from the point of view of network
#    equipment in the middle.
#
# On Linux, the specified value (in seconds) is the period used to send ACKs.
# Note that to close the connection the double of the time is needed.
# On other kernels the period depends on the kernel configuration.
#
# A reasonable value for this option is 300 seconds, which is the new
# Redis default starting with Redis 3.2.1.
tcp-keepalive 10

将tcp-keepalive改成1-50左右的数字,比如改成10,之前是0或者300貌似,改成小一点的数字就行了,原理百度或者根据配置文件的注释英文翻译一下,记得改完重启Redis.

springboot版本是2.1,Redis是5,貌似Redis3.2.1开始默认60,所以建议改成1-20之间就不那么卡或者报错.

生产来说感觉这个方法还是有缺陷，比如tx云redis禁止修改tcp-keepalive，我觉得归根结底还是应该考虑客户端的重连检测机制，多应用的情况下心跳都消耗不少带宽了。

3、lettuce换成jredis

lettuce换成jredis,因为通过用Wireshark发现jredis默认是发心跳包的,而lettuce是不发心跳包的,所以能保持连接状态,更改如下

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-data-redis</artifactId>
    <exclusions>
        <exclusion>
            <groupId>redis.clients</groupId>
            <artifactId>jedis</artifactId>
        </exclusion>
        <exclusion>
            <artifactId>lettuce-core</artifactId>
            <groupId>io.lettuce</groupId>
        </exclusion>
    </exclusions>
</dependency>
<!-- jedis客户端 -->
<dependency>
    <groupId>redis.clients</groupId>
    <artifactId>jedis</artifactId>
</dependency>
<!-- spring2.X集成redis所需common-pool2，使用jedis必须依赖它-->
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-pool2</artifactId>
</dependency>

4、添加心跳

自己写个定时器轮询访问redis,模拟一下心跳,其实这个方法也没什么不好,只是需要自己写代码：

@Scheduled(cron = "0/10 * * * * *")
public void timer() {
    redisTemplate.opsForValue().get("heartbeat");
}

5、提高服务器硬件性能

不排除可能有服务器硬件原因

6、Redis阻塞问题

精简数据量，优化redis的key和value大小，较少时间消耗

Redis数据结构或API使用不合理，导致可能存在大对象且大对象使用复杂度高的命令；
- 对一个有千万个元素的hash执行hgetall操作, 或del操作.类似的这种操作都会造成Redis阻塞
- 对于这种大对象可以采用redis-cli -h {host} -p {port} bigkeys 来查看。但是该命令只能查询某类型中的其
  中最大的一个key。如果你想查询多个。可以采用修改redis-cli源代码的方式（Redis的源代码是C）。如果不想修改源代码的话也可以使用scan来完成。
- 对于Scan命令需要注意。该命令只能扫描单台Redis上的数据。如果你是一个集群，需要每台机器执行一遍。但是如果你使用开源的客户端的话（比如：Java的Lettuce客户端）就已经帮你把scan命令实现为可以扫描整个集群了。
- 然后对大对象进行拆分。具体拆分要视业务而定了。
Redis的cpu使用率接近100%
- 从机同步主机数据。从机接受到rdb文件后从磁盘加载数据
- 主从持久化数据。
- 将cpu使用率达到100%，有可能是真实业务访问量确实很大。单台Redis达到每秒处理6万+的请求。这个时候就只能做水平扩展了
- 如果Redis每秒操作数只有几百，或者几千，且cpu还是很高的话就有可能使用了高算法复杂度的命令。例如hgetall。还有一种可能是内存的过度优化导致。这种情况目前暂时没有遇到，但也纳入考虑范围。
CPU竞争
- Redis是一个CPU密集型的应用，不适合和其他CPU密集的服务部署在一起。
- 在生产环境中，我们一台服务器的配置是32核逻辑cpu, 256GB内存。每台机器如果只部署一台Redis比较浪费。所以可能会一台机器部署多个Redis。通常会将Redis进程绑定到CPU上。但是在生成RDB文件或者AOF持久话时，就会产生子进程。这样子进程与父进程会产生CPU竞争。所以当开启持久化或者主节点。不建议绑定CPU
内存交换
- Redis是一个内存型数据库，所有数据全部放在内存中。所以强烈建议不开启内存交换
网络问题
- 主从同步网络延迟较大的话，导致从机经常断线重连。如果断线时间久了。导致从机再次连接上主机时会全量同步，这时主机，从机都会收到影响