【Redis专题】redis日志分析

SLAVEOF not allowed in cluster mode 不允许在集群中使用主从切换日志1:M 01 Dec 14:44:51.799 # Setting secondary replication ID to 63e58e3dc3d0d1fc4607a519e5a20383eeb70219, valid up to offset: 3823202378426. New replic

拒绝emo，请你快乐

3144人浏览 · 2021-12-27 18:29:43

拒绝emo，请你快乐 · 2021-12-27 18:29:43 发布

SLAVEOF not allowed in cluster mode 不允许在集群中使用

主从切换日志

1:M 01 Dec 14:44:51.799 # Setting secondary replication ID to 63e58e3dc3d0d1fc4607a519e5a20383eeb70219, valid up to offset: 3823202378426. New replication ID is a8c16c15a0db9f55370096cb7460edba449d3b24

# 63e58e3dc3d0d1fc4607a519e5a20383eeb70219 为切换前原master的runid
# 3823202378426  为之前主从复制的位置
# a8c16c15a0db9f55370096cb7460edba449d3b24 为新master的runid
#从节点连接到新的主节点后会将原主的runid和offset发给新主，后面尝试尝试进行增量同步数据

1:M 01 Dec 14:44:51.799 # Connection with master lost.
1:M 01 Dec 14:44:51.800 * Caching the disconnected master state.

# 清空 master 缓存，在条件已经不可能执行 partial resync 时执行
1:M 01 Dec 14:44:51.800 * Discarding previously cached master state.

# 被哨兵节点选为主，让其取消复制，成为主服务器（释放主节点的全部socket连接，即客户端连接）
1:M 01 Dec 14:44:51.802 * MASTER MODE enabled (user request from 'id=36 addr=10.172.52.120:43644 fd=11 name=sentinel-33252760-cmd age=9084208 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=r cmd=exec')

# 关闭其复制功能，转为主，重写配置文件slaveof no one
1:M 01 Dec 14:44:51.802 # CONFIG REWRITE executed with success.

# 在没有任何从服务器的 3600 秒之后，释放 backlog
1:M 01 Dec 14:44:52.430 * Replication backlog freed after 3600 seconds without connected slaves.

# 从服务器请求同步
1:M 01 Dec 14:44:53.036 * Slave 10.172.33.73:70001 asks for synchronization

# 尝试增量同步失败，33.73的切换前的数据同步偏移量为3823202409576，但新主可提供的偏移量是3823202378426，数据有缺失，所以需要进行全量同步
1:M 01 Dec 14:44:53.037 * Partial resynchronization not accepted: Requested offset for second ID was 3823202409576, but I can reply up to 3823202378426

# 启动rdb全量同步
1:M 01 Dec 14:44:53.037 * Starting BGSAVE for SYNC with target: disk
1:M 01 Dec 14:44:53.054 * Background saving started by pid 19866
19866:C 01 Dec 14:44:54.857 * DB saved on disk
19866:C 01 Dec 14:44:54.867 * RDB: 29 MB of memory used by copy-on-write
1:M 01 Dec 14:44:54.966 * Background saving terminated with success

# rdb持久化完成，通过网络发送给73节点
1:M 01 Dec 14:44:55.330 * Synchronization with slave 10.172.33.73:70001 succeeded

AOF重写日志

aof流程：

如果没有正在重写：buf-->aof_buf----事件循环开始-调用flushAppendOnlyFile------>aof

如果正在执行重写：buf-->aof_buf----事件循环开始-调用flushAppendOnlyFile------>aof

buf-->aof_rewrite_buf_blocks-->重写完成后新的aof

17:S 26 Nov 10:20:17.340 * Starting automatic rewriting of AOF on 100% growth
# 相比上次重写后的aof文件大小2倍大，触发aof重写
17:S 26 Nov 10:20:17.551 * Background append only file rewriting started by pid 87134

# 如果重写正在执行，会将新的数据创建分块写入aof_rewrite_buf中，大小为aof重写缓冲区快大小
17:M 26 Nov 10:32:58.086 * Background AOF buffer size: 71 MB
17:M 26 Nov 10:33:33.751 * Background AOF buffer size: 176 MB
17:M 26 Nov 10:34:07.827 * Background AOF buffer size: 270 MB
17:M 26 Nov 10:34:11.126 * Background AOF buffer size: 280 MB
17:M 26 Nov 10:34:50.452 * Background AOF buffer size: 370 MB

# Fork一个子进程执行后台的aof重写操作， 
17:S 26 Nov 10:20:38.772 * AOF rewrite child asks to stop sending diffs.
# 重写子进程请求主线程停止往发送差异数据，暂存在aof重写缓冲区中
87134:C 26 Nov 10:20:38.772 * Parent agreed to stop sending diffs. Finalizing AOF...
# 主线程同意停止追加，子进程正在完成AOF。。。 
87134:C 26 Nov 10:20:38.773 * Concatenating 188.33 MB of AOF diff received from parent.
# 子进程遍历DB生成188.33 MB aof临时文件， 
87134:C 26 Nov 10:20:39.699 * SYNC append only file rewrite performed
# 执行将临时文件刷写到磁盘的Fsync函数
87134:C 26 Nov 10:20:39.792 * AOF rewrite: 1227 MB of memory used by copy-on-write
# 用1227M内存fork子进程完成aof重写==循环数据将其写成为aof文件（借助fork命令的copy on write机制（私有内存非共享内存））
17:S 26 Nov 10:20:40.004 * Background AOF rewrite terminated with success
# aof重写完成
17:S 26 Nov 10:20:40.009 * Residual parent diff successfully flushed to the rewritten AOF (5.38 MB)
# 父子进程数据差异成功刷写到aof中
17:S 26 Nov 10:20:40.009 * Background AOF rewrite finished successfully
# 整个aof完成

节点异常日志：（深入剖析Redis集群故障-爱站技术频道）

xx.x.xxx.200:8371(bedab2c537fe94f8c0363ac4ae97d56832316e65) master
xx.x.xxx.199:8373(792020fe66c00ae56e27cd7a048ba6bb2b67adb6) slave
xx.x.xxx.201:8375(5ab4f85306da6d633e4834b4d3327f45af02171b) master
xx.x.xxx.201:8372(826607654f5ec81c3756a4a21f357e644efe605a) slave
xx.x.xxx.199:8370(462cadcb41e635d460425430d318f2fe464665c5) master
xx.x.xxx.200:8374(1238085b578390f3c8efa30824fd9a4baba10ddf) slave

步1：
主节点8371失去和从节点8373的连接：
46590:M 09 Sep 18:57:51.379 # Connection with slave xx.x.xxx.199:8373 lost.

步2：
主节点8370/8375判定8371失联：
42645:M 09 Sep 18:57:50.117 * Marking node bedab2c537fe94f8c0363ac4ae97d56832316e65 as failing (quorum reached).

步3：
从节点8372/8373/8374收到主节点8375说8371失联：
46986:S 09 Sep 18:57:50.120 * FAIL message received from 5ab4f85306da6d633e4834b4d3327f45af02171b about bedab2c537fe94f8c0363ac4ae97d56832316e65

步4：
主节点8370/8375授权8373升级为主节点转移：
42645:M 09 Sep 18:57:51.055 # Failover auth granted to 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 for epoch 16

步5：
原主节点8371修改自己的配置，成为8373的从节点：
46590:M 09 Sep 18:57:51.488 # Configuration change detected. Reconfiguring myself as a replica of 792020fe66c00ae56e27cd7a048ba6bb2b67adb6

步6：
主节点8370/8375/8373明确8371失败状态：
42645:M 09 Sep 18:57:51.522 * Clear FAIL state for node bedab2c537fe94f8c0363ac4ae97d56832316e65: master without slots is reachable again.

步7：
新从节点8371开始从新主节点8373，第一次全量同步数据：
8373日志：：
4255:M 09 Sep 18:57:51.906 * Full resync requested by slave xx.x.xxx.200:8371
4255:M 09 Sep 18:57:51.906 * Starting BGSAVE for SYNC with target: disk
4255:M 09 Sep 18:57:51.941 * Background saving started by pid 5230
8371日志：：
46590:S 09 Sep 18:57:51.948 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440721826993

步8：
主节点8370/8375判定8373(新主)失联：
42645:M 09 Sep 18:58:00.320 * Marking node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 as failing (quorum reached).

步9：
主节点8370/8375判定8373(新主)恢复：
60295:M 09 Sep 18:58:18.181 * Clear FAIL state for node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6: is reachable again and nobody is serving its slots after some time.

步10：
主节点8373完成全量同步所需要的BGSAVE操作：
5230:C 09 Sep 18:59:01.474 * DB saved on disk
5230:C 09 Sep 18:59:01.491 * RDB: 7112 MB of memory used by copy-on-write
4255:M 09 Sep 18:59:01.877 * Background saving terminated with success

步11：
从节点8371开始从主节点8373接收到数据：
46590:S 09 Sep 18:59:02.263 * MASTER <-> SLAVE sync: receiving 2657606930 bytes from master

步12：
主节点8373发现从节点8371对output buffer作了限制：
4255:M 09 Sep 19:00:19.014 # Client id=14259015 addr=xx.x.xxx.200:21772 fd=844 name= age=148 idle=148 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16349 oll=4103 omem=95944066 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
4255:M 09 Sep 19:00:19.015 # Connection with slave xx.x.xxx.200:8371 lost.

步13：
从节点8371从主节点8373同步数据失败，连接断了，第一次全量同步失败：
46590:S 09 Sep 19:00:19.018 # I/O error trying to sync with MASTER: connection lost
46590:S 09 Sep 19:00:20.102 * Connecting to MASTER xx.x.xxx.199:8373
46590:S 09 Sep 19:00:20.102 * MASTER <-> SLAVE sync started

步14：
从节点8371重新开始同步，连接失败，主节点8373的连接数满了：
46590:S 09 Sep 19:00:21.103 * Connecting to MASTER xx.x.xxx.199:8373
46590:S 09 Sep 19:00:21.103 * MASTER <-> SLAVE sync started
46590:S 09 Sep 19:00:21.104 * Non blocking connect for SYNC fired the event.
46590:S 09 Sep 19:00:21.104 # Error reply to PING from master: '-ERR max number of clients reached'

步15：
从节点8371重新连上主节点8373，第二次开始全量同步：
8371日志：
46590:S 09 Sep 19:00:49.175 * Connecting to MASTER xx.x.xxx.199:8373
46590:S 09 Sep 19:00:49.175 * MASTER <-> SLAVE sync started
46590:S 09 Sep 19:00:49.175 * Non blocking connect for SYNC fired the event.
46590:S 09 Sep 19:00:49.176 * Master replied to PING, replication can continue...
46590:S 09 Sep 19:00:49.179 * Partial resynchronization not possible (no cached master)
46590:S 09 Sep 19:00:49.501 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440780763454
8373日志：
4255:M 09 Sep 19:00:49.176 * Slave xx.x.xxx.200:8371 asks for synchronization
4255:M 09 Sep 19:00:49.176 * Full resync requested by slave xx.x.xxx.200:8371
4255:M 09 Sep 19:00:49.176 * Starting BGSAVE for SYNC with target: disk
4255:M 09 Sep 19:00:49.498 * Background saving started by pid 18413
18413:C 09 Sep 19:01:52.466 * DB saved on disk
18413:C 09 Sep 19:01:52.620 * RDB: 2124 MB of memory used by copy-on-write
4255:M 09 Sep 19:01:53.186 * Background saving terminated with success

步16：
从节点8371同步数据成功，开始加载经内存：
46590:S 09 Sep 19:01:53.190 * MASTER <-> SLAVE sync: receiving 2637183250 bytes from master
46590:S 09 Sep 19:04:51.485 * MASTER <-> SLAVE sync: Flushing old data
46590:S 09 Sep 19:05:58.695 * MASTER <-> SLAVE sync: Loading DB in memory

步17：
集群恢复正常：
42645:M 09 Sep 19:05:58.786 * Clear FAIL state for node bedab2c537fe94f8c0363ac4ae97d56832316e65: slave is reachable again.

步18：
从节点8371同步数据成功，耗时7分钟：
46590:S 09 Sep 19:08:19.303 * MASTER <-> SLAVE sync: Finished with success

8371失联原因分析：

由于几台机器在同一个机架，不太可能发生网络中断的情况，于是通过SLOWLOG GET命令查看了慢查询日志，发现有一个KEYS命令被执行了，耗时8.3秒，再查看集群节点超时设置，发现是5s(cluster-node-timeout 5000)

出现节点失联的原因：

客户端执行了耗时1条8.3s的命令，

2016/9/9 18:57:43 开始执行KEYS命令
2016/9/9 18:57:50 8371被判断失联（redis日志）
2016/9/9 18:57:51 执行完KEYS命令

华为开发者空间

华为开发者空间，是为全球开发者打造的专属开发空间，汇聚了华为优质开发资源及工具，致力于让每一位开发者拥有一台云主机，基于华为根生态开发、创新。

更多推荐

华为云：做厚算力“黑土地”，成就行业AI先锋

华为开发者空间

在Oracle\PG\GaussDB库中实现用户甲在其它用户的SCHEMA中创建表的方法及所属属主的差异

华为开发者空间

通知语音播报功能，解锁全新体验

若应用进程不在前台，Push Kit会将消息内容传递给通知扩展进程，开发者可以在该进程中自行完成语音播报业务处理后，返回自定义消息内容，Push Kit将弹出通知提醒。（Push Kit）提供了推送通知扩展消息功能，该功能支持通过语音播报的方式，让用户能够迅速感知到重要消息，解锁全新的通知体验。在发送通知扩展消息的过程中，存在应用进程在前台和不在前台两种情况。若应用进程在前台，则不弹出通知提醒，开