INFO: rcu_sched detected stalls on CPU/tasks

运行环境：ARCH：ARMKernel：4.4.189内核提示消息：从图上可以看出提示消息：INFO：rcu_sched detected stalls on CPUs/tasks，这个提示是和RCU相关的。第二行：数字4，表示是CPU核4第三行：detected by 0，表示被CPU核0检测到了。第五行：18322 18321 分别是当前进程号和父进程号。关于RCURCU(Read-Copy

MIPSA

26201人浏览 · 2021-07-01 19:46:15

MIPSA · 2021-07-01 19:46:15 发布

运行环境：

ARCH：ARM
Kernel：4.4.189

内核提示消息：

在这里插入图片描述
从图上可以看出提示消息：INFO：rcu_sched detected stalls on CPUs/tasks，这个提示是和RCU相关的。
第二行：数字4，表示是CPU核4
第三行：detected by 0，表示被CPU核0检测到了。
第五行：18322 18321 分别是当前进程号和父进程号。

关于RCU

RCU(Read-Copy Update) 是Linux 2.6 内核开始引入的一种新的锁机制，与spinlock、rwlock不同，RCU有其独到之处，它只适用于读多写少的情况。

RCU是基于其原理命名的，Read-Copy Update，[Read]指的是对于被RCU保护的共享数据，reader可以直接访问，不需要获得任何锁；[Copy Update]指的是writer修改数据前首先拷贝一个副本，然后在副本上进行修改，修改完毕后向reclaimer(垃圾回收器)注册一个回调函数(callback)，在适当的时机完成真正的修改操作–把原数据的指针重新指向新的被修改的数据，–这里所说的适当的时机就是当既有的reader全都退出临界区的时候，而等待恰当时机的过程被称为grace period 。在RCU机制中，writer不需要和reader竞争任何锁，只在有多个writer的情况下它们之间需要某种锁进行同步作，如果写操作频繁的话RCU的性能会严重下降，所以RCU只适用于读多写少的情况。

通过内核源码追踪可以看到触发的过程：
在这里插入图片描述

引发stall的原因

参数内核文档kernel\Documentation\RCU\stallwarn.txt
The following problems can result in RCU CPU stall warnings:
o A CPU looping in an RCU read-side critical section.

o A CPU looping with interrupts disabled. This condition can result in RCU-sched and RCU-bh stalls.

o A CPU looping with preemption disabled. This condition can result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh stalls.

o A CPU looping with bottom halves disabled. This condition can result in RCU-sched and RCU-bh stalls.

o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel without invoking schedule(). Note that cond_resched() does not necessarily prevent RCU CPU stall warnings. Therefore, if the looping in the kernel is really expected and desirable behavior, you might need to replace some of the cond_resched() calls with calls to cond_resched_rcu_qs().

o Booting Linux using a console connection that is too slow to keep up with the boot-time console-message rate. For example, a 115Kbaud serial console can be -way- too slow to keep up with boot-time message rates, and will frequently result in RCU CPU stall warning messages. Especially if you have added debug printk()s.

o Anything that prevents RCU’s grace-period kthreads from running. This can result in the “All QSes seen” console-log message. This essage will include information on when the kthread last ran and how often it should be expected to run.

o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might happen to preempt a low-priority task in the middle of an RCU read-side critical section. This is especially damaging if that low-priority task is not permitted to run on any other CPU, in which case the next RCU grace period can never complete, which will eventually cause the system to run out of memory and hang. While the system is in the rocess of running itself out of memory, you might see stall-warning messages.

o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that is running at a higher priority than the RCU softirq threads. This will prevent RCU callbacks from ever being invoked, and in a CONFIG_PREEMPT_RCU kernel will further prevent RCU grace periods from ever completing. Either way, the system will eventually run out of memory and hang. In the CONFIG_PREEMPT_RCU case, you might see stall-warning messages.

o A hardware or software issue shuts off the scheduler-clock interrupt on a CPU that is not in dyntick-idle mode. This problem really has happened, and seems to be most likely to result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels.

o A bug in the RCU implementation.

o A hardware failure. This is quite unlikely, but has occurred at least once in real life. A CPU failed in a running system, becoming nresponsive, but not causing an immediate crash. This resulted in a series of RCU CPU stall warnings, eventually leading the realization that the CPU had failed.