本文探究在AArch64平台,Linux内核任务切换的实现机制。使用的调试工具主要为gdb及qemu虚拟机,调试的内核版本为5.3.12。笔者在实际工作中遇到一些互斥锁的操作;当一个进程或内核线程尝试对互斥锁加锁时,若该锁已被锁住,该进程或内核线程就会进入挂起、阻塞状态。此过程就会引发任务切换,通过调用kernel/sched/core.c中的schedule函数,逐步切换到其他的内核任务。

一,64位ARM调用规则,Procedure Call Standard for the ARM 64-bit Architecture
    该规则规定了过程调用发生时,调用者与被调用者之间的参数传递、返回值、寄存器保存等需要遵守的规则。当前最新版本的文档可以在Github上获取。我们关心的是通用寄存器的保存规则,截图如下:

    然而为什么任务切换需要遵循此调用规则呢?任务切换最重要的是进程及内核线程的上下文状态信息的保存与恢复,即上下文切换,在kernel/sched/core.c中定义了一个内联函数,context_switch,由此可见一斑。任务切换与内核驱动开发常常遇到的中断处理不同,中断是异步的,被中断执行的代码位置不固定的,且是被动的;而任务切换则是主动的,有着特定的调用流程。Linux内核以汇编函数的方式实现了上下文切换功能,自然需要遵循上图提及的过程调用规则(Procedure Call Standard)。

二,任务切换的核心
    如下图,定义于arch/arm64/kernel/entry.S中的汇编函数cpu_switch_to,精练且完美地实现了上面提到的调用规则;它首先保存了旧任务的x19至x30寄存器,之后恢复了新任务保存的x19至x30寄存器数据:

    首先,THREAD_CPU_CONTEXT是一个宏定义,包含此宏的头文件在编译Linux时动态地生成。此宏是thread_struct结构体在task_struct结构体中的偏移量,单位为字节(详见include/linux/sched.h中task_struct结构体的定义)。这样x0与x10相加的结果保存至x8寄存器,x8寄存器就指向了AArch64平台的thread_struct结构体,如下图:

    而thread_struct的第一个成员变量是cpu_context,接下来的stp/str指令就保存了64位ARM调用规则规定的、需要保存的通用寄存器信息。对此汇编代码存在一个疑问,即为何需要将栈寄存器sp复制到x9寄存器中呢?为什么不直接执行stp x29, sp, [x8], #16 呢?下图的操作解释了其中的缘故:ARM汇编器无法生成相对应的指令。

    之后的ldp/ldr等汇编指令将新的任务上下文恢复到调用规则指定的寄存器中,这些操作也非常干净利索。最后将x1寄存器写入sp_el0寄存器中。这样的操作是因为Linux内核通过sp_el0寄存器快速、方便地获取当前进程、内核线程的task_struct指针:

三,调试内核任务切换
    使用gdb/qemu可以方便地调试Linux内核,下面列出调试的过程:

(gdb) info address cpu_switch_to
Symbol "cpu_switch_to" is at 0xffffff801008552c in a file compiled without debugging.
(gdb) disassemble cpu_switch_to
Dump of assembler code for function cpu_switch_to:
   0xffffff801008552c <+0>:     mov     x10, #0x7d0                     // #2000
   0xffffff8010085530 <+4>:     add     x8, x0, x10
   0xffffff8010085534 <+8>:     mov     x9, sp
   0xffffff8010085538 <+12>:    stp     x19, x20, [x8], #16
   0xffffff801008553c <+16>:    stp     x21, x22, [x8], #16
   0xffffff8010085540 <+20>:    stp     x23, x24, [x8], #16
   0xffffff8010085544 <+24>:    stp     x25, x26, [x8], #16
   0xffffff8010085548 <+28>:    stp     x27, x28, [x8], #16
   0xffffff801008554c <+32>:    stp     x29, x9, [x8], #16
   0xffffff8010085550 <+36>:    str     x30, [x8]
   0xffffff8010085554 <+40>:    add     x8, x1, x10
   0xffffff8010085558 <+44>:    ldp     x19, x20, [x8], #16
   0xffffff801008555c <+48>:    ldp     x21, x22, [x8], #16
   0xffffff8010085560 <+52>:    ldp     x23, x24, [x8], #16
   0xffffff8010085564 <+56>:    ldp     x25, x26, [x8], #16
   0xffffff8010085568 <+60>:    ldp     x27, x28, [x8], #16
   0xffffff801008556c <+64>:    ldp     x29, x9, [x8], #16
   0xffffff8010085570 <+68>:    ldr     x30, [x8]
   0xffffff8010085574 <+72>:    mov     sp, x9
   0xffffff8010085578 <+76>:    msr     sp_el0, x1
   0xffffff801008557c <+80>:    ret
End of assembler dump.
(gdb) break *0xffffff801008552c
Breakpoint 1 at 0xffffff801008552c: file arch/arm64/kernel/entry.S, line 1138.
(gdb) break *0xffffff8010085578
Breakpoint 2 at 0xffffff8010085578: file arch/arm64/kernel/entry.S, line 1157.
(gdb) c
Continuing.
[Switching to Thread 1.2]

Thread 2 hit Breakpoint 1, cpu_switch_to () at arch/arm64/kernel/entry.S:1138
1138            mov     x10, #THREAD_CPU_CONTEXT
(gdb) bt
#0  cpu_switch_to () at arch/arm64/kernel/entry.S:1138
#1  0xffffff80100878dc in __switch_to (prev=0xffffffc00e880c00, next=0xffffffc00e83c800) at arch/arm64/kernel/process.c:509
#2  0xffffff80105638f8 in context_switch (rf=<optimized out>, next=<optimized out>, prev=<optimized out>, rq=<optimized out>) at kernel/sched/core.c:3254
#3  __schedule (preempt=<optimized out>) at kernel/sched/core.c:3921
#4  0xffffff8010563edc in schedule_idle () at kernel/sched/core.c:4016
#5  0xffffff80100d2604 in do_idle () at kernel/sched/idle.c:288
#6  0xffffff80100d27e4 in cpu_startup_entry (state=CPUHP_AP_ONLINE_IDLE) at kernel/sched/idle.c:355
#7  0xffffff80100944e8 in secondary_start_kernel () at arch/arm64/kernel/smp.c:259
#8  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) c
Continuing.

Thread 2 hit Breakpoint 2, cpu_switch_to () at arch/arm64/kernel/entry.S:1157
1157            msr     sp_el0, x1
(gdb) info register lr
lr             0xffffff80100878dc  -549486823204
(gdb) break *0xffffff80100878dc
Breakpoint 3 at 0xffffff80100878dc: file arch/arm64/kernel/process.c, line 512.
(gdb) c
Continuing.

Thread 2 hit Breakpoint 3, __switch_to (prev=0xffffffc00e83c800, next=0xffffffc00e880c00) at arch/arm64/kernel/process.c:512
512     }
(gdb) bt
#0  __switch_to (prev=0xffffffc00e83c800, next=0xffffffc00e880c00) at arch/arm64/kernel/process.c:512
#1  0xffffff80105638f8 in context_switch (rf=<optimized out>, next=<optimized out>, prev=<optimized out>, rq=<optimized out>) at kernel/sched/core.c:3254
#2  __schedule (preempt=<optimized out>) at kernel/sched/core.c:3921
#3  0xffffff8010563b70 in schedule () at kernel/sched/core.c:3988
#4  0xffffff80100bee54 in worker_thread (__worker=0xffffffc00e862000) at kernel/workqueue.c:2436
#5  0xffffff80100c571c in kthread (_create=0xffffffc00e855180) at kernel/kthread.c:255
#6  0xffffff8010085590 in ret_from_fork () at arch/arm64/kernel/entry.S:1169
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) i r cpsr
cpsr           0x60000085          1610612869
(gdb) !bitdump 0x60000085
   0x60000085 (0x60000085): ->
   31    27    23    19    15    11     7     3
    0110  0000  0000  0000  0000  0000  1000  0101
       28    24    20    16    12    8     4     0
   -----------------------------------------------
(gdb) info register SP_EL0
SP_EL0         0xffffffc00e83c800  -274634389504

    观察上面的调试记录,两次执行gdb的backtrace指令(bt),输出的调用回溯不相同,这正是任务切换成功的证据:不同的任务调用回溯通常是不同的。此外,在任务切换时,CPSR寄存器的第7个比特位置为1,表明屏蔽了IRQ中断。这样如此精巧的上下文切换操作不会被中断,也就更安全了。

四,内核对浮点运算支持
    在《Linux kernel development》一书中,作者指出Linux内核“No (easy) Use of foating point”,并建议不要在内核中加入浮点运算的代码。对于ARM32位处理器,它经历了多年的更新,早期的ARM核没有浮点运算的协处理器,因此内核中不支持浮点运算。然而对于ARM64位处理器,浮点运算功能是必备的。在64位ARM调用规则中,也指出了过程调用需要保存的浮点运算寄存器:

    对于这一点,内核开发人员的设计比较干脆:Linux内核在任务切换时,保存了所有的浮点寄存器。该操作由fpsimd_save汇编宏实现:

    通过调试可以确认:

(gdb) info address fpsimd_save_state
Symbol "fpsimd_save_state" is at 0xffffff8010086e20 in a file compiled without debugging.
(gdb) break *0xffffff8010086e20
Breakpoint 5 at 0xffffff8010086e20: file arch/arm64/kernel/entry-fpsimd.S, line 20.
(gdb) info address fpsimd_load_state
Symbol "fpsimd_load_state" is at 0xffffff8010086e74 in a file compiled without debugging.
(gdb) break *0xffffff8010086e74
Breakpoint 6 at 0xffffff8010086e74: file arch/arm64/kernel/entry-fpsimd.S, line 30.
(gdb) c
Continuing.
[Switching to Thread 1.1]

Thread 1 hit Breakpoint 5, fpsimd_save_state () at arch/arm64/kernel/entry-fpsimd.S:20
20              fpsimd_save x0, 8
(gdb) bt
#0  fpsimd_save_state () at arch/arm64/kernel/entry-fpsimd.S:20
#1  0xffffff80100859d8 in fpsimd_save () at arch/arm64/kernel/fpsimd.c:310
#2  0xffffff8010086264 in fpsimd_thread_switch (next=0xffffffc00e9cbc00) at arch/arm64/kernel/fpsimd.c:991
#3  0xffffff8010087738 in __switch_to (prev=0xffffffc00e9c1800, next=0xffffffc00e9cbc00) at arch/arm64/kernel/process.c:491
#4  0xffffff80105638f8 in context_switch (rf=<optimized out>, next=<optimized out>, prev=<optimized out>, rq=<optimized out>) at kernel/sched/core.c:3254
#5  __schedule (preempt=<optimized out>) at kernel/sched/core.c:3921
#6  0xffffff8010563b70 in schedule () at kernel/sched/core.c:3988
#7  0xffffff801008c94c in do_notify_resume (regs=0xffffff8010953ec0, thread_flags=2) at arch/arm64/kernel/signal.c:917
#8  0xffffff8010084060 in work_pending () at arch/arm64/kernel/entry.S:979
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) c
Continuing.

Thread 1 hit Breakpoint 5, fpsimd_save_state () at arch/arm64/kernel/entry-fpsimd.S:20
20              fpsimd_save x0, 8
(gdb) bt
#0  fpsimd_save_state () at arch/arm64/kernel/entry-fpsimd.S:20
#1  0xffffff80100859d8 in fpsimd_save () at arch/arm64/kernel/fpsimd.c:310
#2  0xffffff8010086264 in fpsimd_thread_switch (next=0xffffff8010750340 <init_task>) at arch/arm64/kernel/fpsimd.c:991
#3  0xffffff8010087738 in __switch_to (prev=0xffffffc00e9c1800, next=0xffffff8010750340 <init_task>) at arch/arm64/kernel/process.c:491
#4  0xffffff80105638f8 in context_switch (rf=<optimized out>, next=<optimized out>, prev=<optimized out>, rq=<optimized out>) at kernel/sched/core.c:3254
#5  __schedule (preempt=<optimized out>) at kernel/sched/core.c:3921
#6  0xffffff8010563b70 in schedule () at kernel/sched/core.c:3988
#7  0xffffff8010567924 in schedule_hrtimeout_range_clock (expires=<optimized out>, delta=<optimized out>, mode=<optimized out>, clock_id=<optimized out>) at kernel/time/hrtimer.c:1926
#8  0xffffff8010567948 in schedule_hrtimeout_range (expires=<optimized out>, delta=<optimized out>, mode=<optimized out>) at kernel/time/hrtimer.c:1983
#9  0xffffff80101db87c in poll_schedule_timeout (state=<optimized out>, slack=<optimized out>, expires=<optimized out>, pwq=<optimized out>) at fs/select.c:243
#10 do_poll (end_time=<optimized out>, wait=<optimized out>, list=<optimized out>) at fs/select.c:951
#11 do_sys_poll (ufds=<optimized out>, nfds=<optimized out>, end_time=<optimized out>) at fs/select.c:1001
#12 0xffffff80101dc6d4 in __do_sys_ppoll (sigsetsize=<optimized out>, sigmask=<optimized out>, tsp=<optimized out>, nfds=<optimized out>, ufds=<optimized out>) at fs/select.c:1101
#13 __se_sys_ppoll (sigsetsize=<optimized out>, sigmask=<optimized out>, tsp=<optimized out>, nfds=<optimized out>, ufds=<optimized out>) at fs/select.c:1081
#14 __arm64_sys_ppoll (regs=<optimized out>) at fs/select.c:1081
#15 0xffffff80100952e4 in __invoke_syscall (syscall_fn=<optimized out>, regs=<optimized out>) at arch/arm64/kernel/syscall.c:36
#16 invoke_syscall (syscall_table=<optimized out>, sc_nr=<optimized out>, scno=<optimized out>, regs=<optimized out>) at arch/arm64/kernel/syscall.c:48
#17 el0_svc_common (regs=0xffffff8010953ec0, scno=<optimized out>, syscall_table=<optimized out>, sc_nr=<optimized out>) at arch/arm64/kernel/syscall.c:114
#18 0xffffff8010095444 in el0_svc_handler (regs=<optimized out>) at arch/arm64/kernel/syscall.c:160
#19 0xffffff8010084188 in el0_svc () at arch/arm64/kernel/entry.S:1009
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

    因此有理由相信,在AArch64平台的内核,应该是可以支持浮点运算的;不过有待具体操作验证。让人摸不到头脑的是,为什么在任务切换时没有调用fpsimd_load_state恢复新任务的浮点寄存器?我想有可能不是通过fpsimd_load_state函数来实现浮点寄存器的恢复,可能存在其他功能类似的函数吧;这一点,留待以后分析。

Logo

华为开发者空间,是为全球开发者打造的专属开发空间,汇聚了华为优质开发资源及工具,致力于让每一位开发者拥有一台云主机,基于华为根生态开发、创新。

更多推荐