【Linux时钟系统】

Linux系统时间RTC时钟系统，频率的产生struct clocksourcestruct timekeeper链接RTC时钟系统，频率的产生1. cpu的时间产生不一定如此，用来阐述硬件产生滴答的原理2. 在断电情况下 RTC仍可以独立运行只要芯片的备用电源一直供电,RTC上的时间会一直走；可以产生频率；3. 通过振荡器作为时钟源被用来驱动系统时钟；2种二级时钟源，40KHz低速内部RC，3

wulaladamowang

2833人浏览 · 2022-04-21 23:44:27

wulaladamowang · 2022-04-21 23:44:27 发布

时间设备

RTC时钟系统

1. cpu的时间产生不一定如此，用来阐述硬件产生滴答的原理
2. 在断电情况下 RTC仍可以独立运行 只要芯片的备用电源一直供电,RTC上的时间会一直走；可以产生频率；
3. 通过振荡器作为时钟源被用来驱动系统时钟；2种二级时钟源，40KHz低速内部RC，32.768KHz低速外部晶体；

计时设备（吴斌个人主页）

1. TSC(time stamp counter)是每个CPU内部的一个计数器，它按CPU主频以固定频率递增，例如一个2400Mhz的CPU，计数器每秒逐一增加2400M个计数值。计数器在CPU启动时初始化为零，假设我们已知CPU启动时刻，那么只要把当前计算器的值除以频率再加上启动时刻，就可以得知当前时间了。
2. RTC(real time clock)是位于CMOS电路中的一个计时设备，它与TSC相比的优点是有独立的电池供电，即使计算机下电，RTC计时器仍可以继续工作；缺点是计数频率较低，因此时间精度较差。通常我们初始时将RTC计数器的值设置为当前时刻相对于1970年1月1日的时间差，因此通过RTC提供的寄存器接口，我们可以直接获取到当前时间。
3. ACPI_PM通常是南桥中的APCI电源管理模块提供的计时设备，其精度较低，通常不推荐使用。

定时通知设备(吴斌个人主页)

1. 概述：周期性地或者在一定时间间隔后向系统通知到期事件。常见的通知设备有Local APIC Timer、PIT(Programmable Interval Timer)、HPET(High Precision Event Timer)。初始时我们向时间通知设备的计数器中写入一个到期计数值，然后时间通知设备按固定频率递减计数器中的值，当计数器值为零时便通过中断向CPU通知事件发生；
2. Local APIC Timer是每个CPU的本地中断控制器(APIC)内部的定时设备，精度较高，是系统正常运行时采用的通知设备;
3. PIT是CPU之外的独立定时通知设备，属全局设备，精度较低，通常不使用;
4. HPET也是全局定时通知设备，精度较高，需要系统中含专属硬件;

系统启动过程中的时钟系统

在这里插入图片描述

1. LAPIC：各个cpu的本地tick设备；
2. PIT:PIT本质上是一种全局时钟事件设备，也就是说它不和某一个CPU绑定；支持oneshot模式以及periodic模式；
3. broadcast广播设备：为了当某些本地tick设备随CPU进入节电状态而停止工作时，能够再次发生中断以唤醒进入节电状态的CPU继续进行工作。这种情况下本地tick设备是无能为力的，因为它也随CPU进入睡眠状态了；

SMP初始化阶段

1. 在x86 SMP系统中，每个CPU的Local APIC中都有一个高精度的时钟事件设备(LAPIC Timer)，因此在BSP初始化的最后阶段及AP的初始化过程中，都会调用setup_APIC_timer进行LAPIC Timer的初始化；
2. setup_APIC_timer函数内部同样是调用clockevents_register_device进行注册，对于BSP它将使用lapic timer替换PIT作为本地tick设备，而PIT将设为广播设备；对于AP，将直接使用lapic timer作为本地tick设备；
3. 对于lapic timer的处理函数入口为smp_apic_timer_interrupt，它是在中断系统初始化过程(start_kernel->init_IRQ->apic_intr_init)中设定的；

Linux时钟中断

1. Linux的OS时钟的物理产生原因是可编程定时/计数器产生的输出脉冲，这个脉冲送入CPU，就可以引发一个中断请求信号，我们就把它叫做时钟中断；
2. 操作系统对可编程定时/计数器进行有关初始化，然后定时/计数器就对输入脉冲进行计数（分频），产生的三个输出脉冲Out0、Out1、Out2各有用途，很多接口书都介绍了这个问题，我们只看Out0上的输出脉冲，这个脉冲信号接到中断控制器8259A_1的0号管脚，触发一个周期性的中断，我们就把这个中断叫做时钟中断，时钟中断的周期，也就是脉冲信号的周期，我们叫做“滴答”或“时标”（tick）；
3. 8253即为上面介绍的PIT的时钟中断设备；BSP完成对PIT的初始化并且使能中断信号之后，BSP便可以周期性的接受来自PIT的中断，即可调用对应的中断处理函数；中断函数中及完成对下图函数的调用 timer_interrupt->tick_handler_periodic->tick_periodic->do_timer->update_process_times;
4. LAPIC Timer中断处理流程类似：中断处理函数：smp_apic_timer_interrupt->local_apic_timer_interrupt->tick_handle_periodic->tick_periodic->do_timer->update_process_times;

static struct irqaction irq0  = {
    .handler    = timer_interrupt,
    .flags      = IRQF_DISABLED | IRQF_NOBALANCING | IRQF_IRQPOLL | IRQF_TIMER,
    .name       = "timer"
};

void __init setup_default_timer_irq(void)
{
    setup_irq(0, &irq0); /*设备中断处理对象并使能中断信号，0号中断即时钟中断*/
}

时钟中断的物理产生
时钟中断触发的服务程序

低精度定时器的使用

定时器的初始化

1. 定时器中最关键的三个信息是：到期时间、到期处理函数、到期处理函数的参数;
2. init_timer_key实现时，会将定时器指向执行初始化动作的CPU的tvec_base结构。内核为每个CPU分配一个struct tvec_base对象，用来记录每个CPU上定时器相关的全局信息;

在这里插入图片描述

添加定时器

1. add_timer -> __internal_add_timer

在这里插入图片描述

触发定时器

每次时钟中断触发对应的函数进行到期定时器的处理以及链表的重新排列

在这里插入图片描述

代码

1. add_timer将定时器添加到执行cpu的tvec_base的时间轮链表中；
2. 内核根据定时器到期时间与当前时间jiffies的差值(值越小说明到期时间越早)，将定时器分别挂到五个级别的链表(tv1-tv5)，级别越低链表到期时间越早；

//定时器结构
//linux/include/linux/timer.h:

#define init_timer(timer)                       \
    __init_timer((timer), 0)

#define __init_timer(_timer, _flags)            \
    init_timer_key((_timer), (_flags), NULL, NULL)

struct timer_list {
    /*
     * All fields that change during normal runtime grouped to the
     * same cacheline
     */
    struct list_head entry; /*用于将当前定时器挂到CPU的tvec_base链表中*/
    unsigned long expires; /*定时器到期时间*/
    struct tvec_base *base; /*定时器所属的tvec_base*/

    void (*function)(unsigned long); /*到期处理函数*/
    unsigned long data; /*到期处理函数的参数*/

    int slack; /*允许的偏差值*/

    ...
};

//定时器的挂载与cpu的struct tvec_base结构
//linux/kernel/timer.c:

/**
 * init_timer_key - initialize a timer
 * @timer: the timer to be initialized
 * @flags: timer flags
 * @name: name of the timer
 * @key: lockdep class key of the fake lock used for tracking timer
 *       sync lock dependencies
 *
 * init_timer_key() must be done to a timer prior calling *any* of the
 * other timer functions.
 */
void init_timer_key(struct timer_list *timer, unsigned int flags,
    const char *name, struct lock_class_key *key)
{
    debug_init(timer);
    do_init_timer(timer, flags, name, key);
}

static void do_init_timer(struct timer_list *timer, unsigned int flags,
    const char *name, struct lock_class_key *key)
{
    struct tvec_base *base = __raw_get_cpu_var(tvec_bases);

    timer->entry.next = NULL;
    timer->base = (void *)((unsigned long)base | flags);
    timer->slack = -1;
    ...
}

struct tvec_base {
    spinlock_t lock; /*同步当前tvec_base的链表操作*/
    struct timer_list *running_timer; /*正在运行(到期触发)的定时器*/
    unsigned long timer_jiffies; /*用于判断定时器是否到期的当前时间，通常和系统的jiffies值相等*/
    unsigned long next_timer; /*下一个到期的定时器的到期时间*/
    unsigned long active_timers; /*激活的定时器的个数*/
    struct tvec_root tv1; /*tv1~tv5是用于保存已添加定时器的链表，也称为时间轮*/
    struct tvec tv2;
    struct tvec tv3;
    struct tvec tv4;
    struct tvec tv5;
} ____cacheline_aligned;

/*
 * per-CPU timer vector definitions:
 */
#define TVN_BITS (CONFIG_BASE_SMALL ? 4 : 6)
#define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8)
#define TVN_SIZE (1 << TVN_BITS)
#define TVR_SIZE (1 << TVR_BITS)
#define TVN_MASK (TVN_SIZE - 1)
#define TVR_MASK (TVR_SIZE - 1)
#define MAX_TVAL ((unsigned long)((1ULL << (TVR_BITS + 4*TVN_BITS)) - 1))

struct tvec {
    struct list_head vec[TVN_SIZE];
};

struct tvec_root {
    struct list_head vec[TVR_SIZE];
};
// 添加定时器到cpu对应的链表之上
static void
__internal_add_timer(struct tvec_base *base, struct timer_list *timer)
{
    unsigned long expires = timer->expires;
    unsigned long idx = expires - base->timer_jiffies; /*idx即为时间差*/
    struct list_head *vec;

    if (idx < TVR_SIZE) {
        int i = expires & TVR_MASK; /*以超时时间(而非时间差idx)作为索引寻找对应的链表，方便后续的超时处理*/
        vec = base->tv1.vec + i;
    } else if (idx < 1 << (TVR_BITS + TVN_BITS)) {
        int i = (expires >> TVR_BITS) & TVN_MASK;
        vec = base->tv2.vec + i;
    } else if (idx < 1 << (TVR_BITS + 2 * TVN_BITS)) {
        int i = (expires >> (TVR_BITS + TVN_BITS)) & TVN_MASK;
        vec = base->tv3.vec + i;
    } else if (idx < 1 << (TVR_BITS + 3 * TVN_BITS)) {
        int i = (expires >> (TVR_BITS + 2 * TVN_BITS)) & TVN_MASK;
        vec = base->tv4.vec + i;
    } else if ((signed long) idx < 0) {
        /*
         * Can happen if you add a timer with expires == jiffies,
         * or you set a timer to go off in the past
         */
        vec = base->tv1.vec + (base->timer_jiffies & TVR_MASK);
    } else {
        int i;
        /* If the timeout is larger than MAX_TVAL (on 64-bit
         * architectures or with CONFIG_BASE_SMALL=1) then we
         * use the maximum timeout.
         */
        if (idx > MAX_TVAL) {
            idx = MAX_TVAL;
            expires = idx + base->timer_jiffies;
        }
        i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
        vec = base->tv5.vec + i;
    }
    /*
     * Timers are FIFO:
     */
    list_add_tail(&timer->entry, vec);
}

高精度定时器

1. 低精度定时器的参考是jiffies，是毫秒级的；高精度的定时器硬件可以基于timerkeeper等多种计时器做参照；
2. 高精度的定时器采用了红黑树进行高效的排序，增删查改等操作；

高精度定时器的初始化

1. 确定高精度定时器的模式，将高精度定时器挂在cpu对应的红黑树上，不同的模式对应着不同的红黑树；

在这里插入图片描述

高精度定时器的启动

1. hrtimer_start -> __hrtimer_start_range_ns;
2. 设置回调函数；
3. 设置定时器内部的超时时间；
4. 将定时器加入到对应的红黑树之中；
5. 检查是否当前定时器是红黑树中最早的定时器，如果是，则修改clock event device的ontshot计数，高精度定时器的工作模式都是oneshot模式；
6. 备注，在这个过程中需要加锁，对指向的hrtimer_clock_base对象；

切换到高精度模式

1.内核正常启动后首先工作在低精度模式，然而在时钟中断的处理中，内核会检测是否具备切换到高精度的条件；
2. 满足条件，时钟中断中在处理低精度时钟时，通过hrtimer_run_pending()完成切换动作；
3. 时钟中断的处理函数从tick_handle_periodic切换到hrtimer_interrupt，然后tick_device的时钟模式切换到oneshot;
4. 在切换到高精度模式中，会在hrtimer_switch_to_hres函数中调用tick_setup_sched_timer(); 设置一个专门的调度定时器，用来处理调度任务；

在这里插入图片描述

高精度定时器的到期处理

1. 通过时钟源更新当前的系统时间；
2. 针对不同计时参照模型的对象；
3. 不同模型中的红黑树按照定时器到期时间依次处理定时器；调用到期函数；
4. 处理特殊情况，定时器到期回调执行时间过长导致下一个定时器到期的情况；
5. 设置的cpu的专门的调度定时器负责进程调度处理；

代码

/**
 * struct hrtimer - the basic hrtimer structure
 * @node:   timerqueue node, which also manages node.expires,
 *          the absolute expiry time in the hrtimers internal
 *          representation. The time is related to the clock on
 *          which the timer is based. Is setup by adding
 *          slack to the _softexpires value. For non range timers
 *          identical to _softexpires.
 * @_softexpires: the absolute earliest expiry time of the hrtimer.
 *          The time which was given as expiry time when the timer
 *          was armed.
 * @function:   timer expiry callback function
 * @base:   pointer to the timer base (per cpu and per clock)
 * @state:  state information (See bit values above)
 *
 * The hrtimer structure must be initialized by hrtimer_init()
 */
struct hrtimer {
    struct timerqueue_node      node;//挂在那棵树
    ktime_t                     _softexpires;//宽容度
    enum hrtimer_restart        (*function)(struct hrtimer *);//回调函数
    struct hrtimer_clock_base   *base;//指向哪个cpu的哪颗树，与采取的时间模式有关
    unsigned long               state;
    ...
};

enum hrtimer_mode {
    HRTIMER_MODE_ABS = 0x0,		/* Time value is absolute */
    HRTIMER_MODE_REL = 0x1,		/* Time value is relative to now */
    HRTIMER_MODE_PINNED = 0x02,	/* Timer is bound to CPU */
    HRTIMER_MODE_ABS_PINNED = 0x02,
    HRTIMER_MODE_REL_PINNED = 0x03,
};
static void __hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
    enum hrtimer_mode mode)
{
    struct hrtimer_cpu_base *cpu_base;
    int base;

    memset(timer, 0, sizeof(struct hrtimer));

    cpu_base = &__raw_get_cpu_var(hrtimer_bases); /*获取当前CPU的hrtimer_cpu_base对象*/
	/*
	 * POSIX magic: Relative CLOCK_REALTIME timers are not affected by
	 * clock modifications, so they needs to become CLOCK_MONOTONIC to
	 * ensure POSIX compliance.
	 */
    if (clock_id == CLOCK_REALTIME && mode != HRTIMER_MODE_ABS) /*REALTIME只支持绝对模式*/
        clock_id = CLOCK_MONOTONIC;

    base = hrtimer_clockid_to_base(clock_id); /*索引计时参照*/
    timer->base = &cpu_base->clock_base[base];
    timerqueue_init(&timer->node); /*初始化红黑树节点*/

    ...
}

时间子系统的软件架构（wowotech）

1. 计算机用户是无法直接使用这些时间设备的，必须通过运行在CPU上的内核程序，应用程序或者用户才能最终获取计时和时间通知服务。因此我们可以将计算机时间子系统分为硬件和软件两部分：各种时间设备属于硬件部分，内核使能这些硬件的模块(也称内核时间子系统)属于软件部分。内核中将计时设备称为时钟源(Clock Source)，将定时通知设备称为时钟事件设备(Clock Event Device)；
2. 对于旧的内核，clock event就是通过timer硬件的中断处理函数完成的，在此基础上可以构建tick模块，tick模块维护了系统的tick，例如系统存在10ms的tick，每次tick到来的时候，timekeeping模块就增加系统时间，如果timekeeping完全是tick驱动，那么它精度只能是10ms，为了更高精度，clock source模块就是一个提供tick之间的offset时间信息的接口函数；

在这里插入图片描述

2. 在引入multi-core之后，过去HW timer的功能被分成两个部分，一个是free running的system counter，是全局的，不属于任何一个CPU。另外一部分就是产生定时事件的HW block，我们称之timer，timer硬件被嵌入到各个cpu core中，因此，我们更准确的称之为CPU local Timer，这些timer都是基于一个Global counter运作的。在驱动层，我们提供一个clock source chip driver的模块来驱动硬件，这是模块是和硬件体系结构有关的。如果系统内存在多个HW timer和counter block，那么系统中可能会存在多个clock source chip driver。
3. 面对形形色色的timer和counter硬件，linux kernel抽象出了通用clock event layer和通用clock source模块，这两个模块和硬件无关。底层的clock source chip driver会调用通用clock event和clock source模块的接口函数，注册clock source和clock event设备;
4. clocksource是一个timeline，clock event是在timeline上指定的点产生clock event的设备，之所以能产生异步事件，当然是基于中断子系统了，clock source chip driver会申请中断并调用通用clock event模块的callback函数来通知这样的异步事件；
5. tick device layer基于clock event设备进行工作的：一般而言，每个CPU形成自己的一个小系统，有自己的调度、有自己的进程统计等，这个小系统都是拥有自己的tick设备，而且是唯一的。对于clock event设备而言就不是这样了，硬件有多少个timer硬件就注册多少个clock event device，各个cpu的tick device会选择自己适合的那个clock event设备；

struct clocksource

概述

1. 内核通过clocksource对象来描述物理计时设备，x86架构下最常见的计时设备是tsc，我们来看看tsc对应的clocksource:可以使用read_tsc(本质是通过rdtsc指令)来获取当前tsc计数值;

//linux/arch/x86/kernel/tsc.c:

static struct clocksource clocksource_tsc = {
    .name                   = "tsc",
    .rating                 = 300,
    .read                   = read_tsc,
    .resume                 = resume_tsc,
    .mask                   = CLOCKSOURCE_MASK(64),
    .flags                  = CLOCK_SOURCE_IS_CONTINUOUS |
                              CLOCK_SOURCE_MUST_VERIFY,
};

/**
 * struct clocksource - hardware abstraction for a free running counter
 *	Provides mostly state-free accessors to the underlying hardware.
 *	This is the structure used for system time.
 *
 * @name:		ptr to clocksource name
 * @list:		list head for registration
 * @rating:		rating value for selection (higher is better)
 *			To avoid rating inflation the following
 *			list should give you a guide as to how
 *			to assign your clocksource a rating
 *			1-99: Unfit for real use
 *				Only available for bootup and testing purposes.
 *			100-199: Base level usability.
 *				Functional for real use, but not desired.
 *			200-299: Good.
 *				A correct and usable clocksource.
 *			300-399: Desired.
 *				A reasonably fast and accurate clocksource.
 *			400-499: Perfect
 *				The ideal clocksource. A must-use where
 *				available.
 * @read:		returns a cycle value, passes clocksource as argument
 * @enable:		optional function to enable the clocksource
 * @disable:		optional function to disable the clocksource
 * @mask:		bitmask for two's complement
 *			subtraction of non 64 bit counters
 * @mult:		cycle to nanosecond multiplier
 * @shift:		cycle to nanosecond divisor (power of two)
 * @max_idle_ns:	max idle time permitted by the clocksource (nsecs)
 * @maxadj:		maximum adjustment value to mult (~11%)
 * @max_cycles:		maximum safe cycle value which won't overflow on multiplication
 * @flags:		flags describing special properties
 * @archdata:		arch-specific data
 * @suspend:		suspend function for the clocksource, if necessary
 * @resume:		resume function for the clocksource, if necessary
 * @mark_unstable:	Optional function to inform the clocksource driver that
 *			the watchdog marked the clocksource unstable
 * @owner:		module reference, must be set by clocksource in modules
 *
 * Note: This struct is not used in hotpathes of the timekeeping code
 * because the timekeeper caches the hot path fields in its own data
 * structure, so no line cache alignment is required,
 *
 * The pointer to the clocksource itself is handed to the read
 * callback. If you need extra information there you can wrap struct
 * clocksource into your own struct. Depending on the amount of
 * information you need you should consider to cache line align that
 * structure.
 */
struct clocksource {
	u64 (*read)(struct clocksource *cs);
	u64 mask;
	u32 mult;
	u32 shift;
	u64 max_idle_ns;
	u32 maxadj;
#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
	struct arch_clocksource_data archdata;
#endif
	u64 max_cycles;
	const char *name;
	struct list_head list;
	int rating;
	int (*enable)(struct clocksource *cs);
	void (*disable)(struct clocksource *cs);
	unsigned long flags;
	void (*suspend)(struct clocksource *cs);
	void (*resume)(struct clocksource *cs);
	void (*mark_unstable)(struct clocksource *cs);
	void (*tick_stable)(struct clocksource *cs);

	/* private: */
#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
	/* Watchdog related data, used by the framework */
	struct list_head wd_list;
	u64 cs_last;
	u64 wd_last;
#endif
	struct module *owner;
};

1. 该结构体是对真实的时钟源的进行的软件抽象；
2. 该抽象的时钟源本身不会产生中断，要获得当前时钟源的计数，需要主动调用read回调函数获得当前的计数值，即cycle,然后借助成员变量mult和shift来完成计算；
3. 内核的启动阶段，会注册一个基于jiffies的clocksource ,然后注册到系统中，注意，注册时候，成员变量还没有赋值；
4. 时钟发生器是硬件设备，通过时钟发生器产生的循环，可以读到cycle信息，进而获得时间；(对时间产生的理解)

struct timekeeper

概述

1. timekeeper是内核中负责计时功能的核心对象，它通过使用当前系统中最优的clocksource来提供时间服务：

在这里插入图片描述

/**
 * struct timekeeper - Structure holding internal timekeeping values.
 * @tkr_mono:		The readout base structure for CLOCK_MONOTONIC
 * @tkr_raw:		The readout base structure for CLOCK_MONOTONIC_RAW
 * @xtime_sec:		Current CLOCK_REALTIME time in seconds
 * @ktime_sec:		Current CLOCK_MONOTONIC time in seconds
 * @wall_to_monotonic:	CLOCK_REALTIME to CLOCK_MONOTONIC offset
 * @offs_real:		Offset clock monotonic -> clock realtime
 * @offs_boot:		Offset clock monotonic -> clock boottime
 * @offs_tai:		Offset clock monotonic -> clock tai
 * @tai_offset:		The current UTC to TAI offset in seconds
 * @clock_was_set_seq:	The sequence number of clock was set events
 * @cs_was_changed_seq:	The sequence number of clocksource change events
 * @next_leap_ktime:	CLOCK_MONOTONIC time value of a pending leap-second
 * @raw_sec:		CLOCK_MONOTONIC_RAW  time in seconds
 * @cycle_interval:	Number of clock cycles in one NTP interval
 * @xtime_interval:	Number of clock shifted nano seconds in one NTP
 *			interval.
 * @xtime_remainder:	Shifted nano seconds left over when rounding
 *			@cycle_interval
 * @raw_interval:	Shifted raw nano seconds accumulated per NTP interval.
 * @ntp_error:		Difference between accumulated time and NTP time in ntp
 *			shifted nano seconds.
 * @ntp_error_shift:	Shift conversion between clock shifted nano seconds and
 *			ntp shifted nano seconds.
 * @last_warning:	Warning ratelimiter (DEBUG_TIMEKEEPING)
 * @underflow_seen:	Underflow warning flag (DEBUG_TIMEKEEPING)
 * @overflow_seen:	Overflow warning flag (DEBUG_TIMEKEEPING)
 *
 * Note: For timespec(64) based interfaces wall_to_monotonic is what
 * we need to add to xtime (or xtime corrected for sub jiffie times)
 * to get to monotonic time.  Monotonic is pegged at zero at system
 * boot time, so wall_to_monotonic will be negative, however, we will
 * ALWAYS keep the tv_nsec part positive so we can use the usual
 * normalization.
 *
 * wall_to_monotonic is moved after resume from suspend for the
 * monotonic time not to jump. We need to add total_sleep_time to
 * wall_to_monotonic to get the real boot based time offset.
 *
 * wall_to_monotonic is no longer the boot time, getboottime must be
 * used instead.
 */
struct timekeeper {
	struct tk_read_base	tkr_mono;
	struct tk_read_base	tkr_raw;
	u64			xtime_sec;
	unsigned long		ktime_sec;
	struct timespec64	wall_to_monotonic;
	ktime_t			offs_real;
	ktime_t			offs_boot;
	ktime_t			offs_tai;
	s32			tai_offset;
	unsigned int		clock_was_set_seq;
	u8			cs_was_changed_seq;
	ktime_t			next_leap_ktime;
	u64			raw_sec;

	/* The following members are for timekeeping internal use */
	u64			cycle_interval;
	u64			xtime_interval;
	s64			xtime_remainder;
	u64			raw_interval;
	/* The ntp_tick_length() value currently being used.
	 * This cached copy ensures we consistently apply the tick
	 * length for an entire tick, as ntp_tick_length may change
	 * mid-tick, and we don't want to apply that new value to
	 * the tick in progress.
	 */
	u64			ntp_tick;
	/* Difference between accumulated time and NTP time in ntp
	 * shifted nano seconds. */
	s64			ntp_error;
	u32			ntp_error_shift;
	u32			ntp_err_mult;
	/* Flag used to avoid updating NTP twice with same second */
	u32			skip_second_overflow;
#ifdef CONFIG_DEBUG_TIMEKEEPING
	long			last_warning;
	/*
	 * These simple flag variables are managed
	 * without locks, which is racy, but they are
	 * ok since we don't really care about being
	 * super precise about how many events were
	 * seen, just that a problem was observed.
	 */
	int			underflow_seen;
	int			overflow_seen;
#endif
};

1. 时间的分类
	- RTC时间：通过专门的硬件实现；毫秒级别，如果位于外部的RTC芯片，访问速度也比较慢；
	- xtime:墙上时间，内存中的变量，精度较高，可以达到纳秒级别；CLOCK_REALTIME
	- monotonic time:自系统开机就一直增加，不计算系统的休眠的时间；CLOCK_MONOTONIC
	- raw monotonic time:代表独立时钟硬件对时间的统计；CLOCK_MONOTONIC_RAW
2. 上述时间是依赖于结构成员clock指向的时钟源的；更新clocksource会通过通知链通知timekeeper作时钟源的改变；
3. 默认的clocksource是基于jiffies的clocksource_jiffies;利用rtc的当前时间，对timekeeper中的成员变量xtime,raw_time,wall_to_monotonic等字段进行描述；然后初始化代表实际时间与monotonic时间之间的偏移量的offs_read字段，total_sleep_time字段初始化为0；
4. xtime字段因为是保存在内存中，系统掉电后无法保存时间信息，所以每次启动时都要通过timekeeping_init从RTC中同步正确的时间信息；
5. xtime同步完RTC的数据之后，timekeeper就会独立于rtc，利用自身关联的clocksource进行时间的更新；每次xtime更新会调用一次do_timer,增加jiffies计数；
6. 根据内核的配置项的不同，更新时间的操作发生的频度也不尽相同，如果没有配置NO_HZ选项，通常每个tick的定时中断周期，do_timer会被调用一次，相反，如果配置了NO_HZ选项，可能会在好几个tick后，do_timer才会被调用一次；

在这里插入图片描述

struct clock_event_device

在这里插入图片描述

/**
 * struct clock_event_device - clock event device descriptor
 * @event_handler:	Assigned by the framework to be called by the low
 *			level handler of the event source 利用该回调实现对时钟事件的处理；
 * @set_next_event:	set next event function using a clocksource delta
 * @set_next_ktime:	set next event function using a direct ktime value
 * @next_event:		local storage for the next event in oneshot mode
 * @max_delta_ns:	maximum delta value in ns
 * @min_delta_ns:	minimum delta value in ns
 * @mult:		nanosecond to cycles multiplier
 * @shift:		nanoseconds to cycles divisor (power of two)
 * @state_use_accessors:current state of the device, assigned by the core code
 * @features:		features
 * @retries:		number of forced programming retries
 * @set_state_periodic:	switch state to periodic
 * @set_state_oneshot:	switch state to oneshot
 * @set_state_oneshot_stopped: switch state to oneshot_stopped
 * @set_state_shutdown:	switch state to shutdown
 * @tick_resume:	resume clkevt device
 * @broadcast:		function to broadcast events
 * @min_delta_ticks:	minimum delta value in ticks stored for reconfiguration
 * @max_delta_ticks:	maximum delta value in ticks stored for reconfiguration
 * @name:		ptr to clock event name
 * @rating:		variable to rate clock event devices
 * @irq:		IRQ number (only for non CPU local devices)
 * @bound_on:		Bound on CPU
 * @cpumask:		cpumask to indicate for which CPUs this device works
 * @list:		list head for the management code
 * @owner:		module reference
 */
struct clock_event_device {
	void			(*event_handler)(struct clock_event_device *);
	int			(*set_next_event)(unsigned long evt, struct clock_event_device *);
	int			(*set_next_ktime)(ktime_t expires, struct clock_event_device *);
	ktime_t			next_event;
	u64			max_delta_ns;
	u64			min_delta_ns;
	u32			mult;
	u32			shift;
	enum clock_event_state	state_use_accessors;
	unsigned int		features;
	unsigned long		retries;

	int			(*set_state_periodic)(struct clock_event_device *);
	int			(*set_state_oneshot)(struct clock_event_device *);
	int			(*set_state_oneshot_stopped)(struct clock_event_device *);
	int			(*set_state_shutdown)(struct clock_event_device *);
	int			(*tick_resume)(struct clock_event_device *);

	void			(*broadcast)(const struct cpumask *mask);
	void			(*suspend)(struct clock_event_device *);
	void			(*resume)(struct clock_event_device *);
	unsigned long		min_delta_ticks;
	unsigned long		max_delta_ticks;

	const char		*name;
	int			rating;
	int			irq;
	int			bound_on;
	const struct cpumask	*cpumask;
	struct list_head	list;
	struct module		*owner;
} ____cacheline_aligned;

struct sys_timer {
	void			(*init)(void);
	void			(*suspend)(void);
	void			(*resume)(void);
#ifdef CONFIG_ARCH_USES_GETTIMEOFFSET
	unsigned long		(*offset)(void);
#endif
}

1. clock_event_device与clocksource的差别：clocksource不能被编程，没有产生事件的能力，它主要被用于timekeeper来实现对真实时间进行精确的统计；而clock_event_device则是可编程的，它可以工作在周期触发或单次触发模式，系统可以对它进行编程，以确定下一次事件触发的时间，clock_event_device主要用于实现普通定时器和高精度定时器，同时也用于产生tick事件，供给进程调度子系统使用；
2. tick_device是基于clock_event_device的进一步封装，用于代替原有的时钟滴答中断，给内核提供tick事件，以完成进程的调度和进程信息统计，负载平衡和时间更新等操作；
3. 时钟事件设备的上层抽象如图，与硬件相关的代码进行了分离；
4. 时钟事件设备的核心数据结构是clock_event_device结构，它代表着一个时钟硬件设备，该设备就好像是一个具有事件触发能力（通常就是指中断）的clocksource，它不停地计数，当计数值达到预先编程设定的数值那一刻，会引发一个时钟事件中断，继而触发该设备的事件处理回调函数，以完成对时钟事件的处理;
5. 每一个machine，都要定义一个自己的machine_desc结构，该结构定义了该machine的一些最基本的特性，其中需要设定一个sys_timer结构指针，machine级的代码负责定义sys_timer结构；该init回调函数的主要作用就是完成系统中的clocksource和clock_event_device的硬件初始化工作；同时注册通知链；

在这里插入图片描述

struct tick_device

struct tick_device {
	struct clock_event_device *evtdev;
	enum tick_device_mode mode;
};

1. 当内核没有配置成支持高精度定时器时，系统的tick由tick_device产生，tick_device其实是clock_event_device的简单封装，它内嵌了一个clock_event_device指针和它的工作模式;
2. DEFINE_PER_CPU(struct tick_device, tick_cpu_device);
3. 在更新clock_event_deviec时，通知链会调用到tick_check_new_device函数判断是否可以用于本cpu；
4. TICKDEV_MODE_PERIODIC或TICKDEV_MODE_ONESHOT模式；oneshot模式才会支持NO_HZ和HRTIMER,其根据tick_device的特性设置；
5. tick事件的处理：tick_checkouk_new_device之后；
6. 在do_timer函数中，完成一下操作；更新jiffies的时间，更新墙上时间，每10个tick，更新一次cpu的负载信息；通过调用update_process_times完成以下工作：1).更新进程的时间统计信息 2).触发TIMER_SOFTIRQ软件中断，以便系统处理传统的低分辨率定时器；3).检查rcu的callback；4). 通过scheduler_tick触发调度系统进行进程统计和调度工作；

在这里插入图片描述

timekeeping模块的电源管理(wowotech)

1. 初始化采用类似通知链的函数

static struct syscore_ops timekeeping_syscore_ops = {
    .resume        = timekeeping_resume,
    .suspend    = timekeeping_suspend,
};

static int __init timekeeping_init_ops(void)
{
    register_syscore_ops(&timekeeping_syscore_ops);
    return 0;
}

device_initcall(timekeeping_init_ops);

2. suspend回调函数

static int timekeeping_suspend(void)
{
    struct timekeeper *tk = &timekeeper;
    unsigned long flags;
    struct timespec        delta, delta_delta;
    static struct timespec    old_delta;

    read_persistent_clock(&timekeeping_suspend_time); //读取rtc的时间，需要硬件记录时间的流逝
    if (timekeeping_suspend_time.tv_sec || timekeeping_suspend_time.tv_nsec)
        persistent_clock_exist = true;

    raw_spin_lock_irqsave(&timekeeper_lock, flags);
    write_seqcount_begin(&timekeeper_seq);
    timekeeping_forward_now(tk);//最后一次更新timekeeper的系统时钟的数据，此后，底层的应将counter以及硬件timer会停止；
    timekeeping_suspended = 1; //做标记

    delta = timespec_sub(tk_xtime(tk), timekeeping_suspend_time);//记录硬件与当前clock_device的时间偏差
    delta_delta = timespec_sub(delta, old_delta);
    if (abs(delta_delta.tv_sec)  >= 2) {
        old_delta = delta;
    } else {
        timekeeping_suspend_time =
            timespec_add(timekeeping_suspend_time, delta_delta);
    }

    timekeeping_update(tk, TK_MIRROR);//更新shaw_timerkeeping，简单认为是临时变量
    write_seqcount_end(&timekeeper_seq);
    raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

    clockevents_notify(CLOCK_EVT_NOTIFY_SUSPEND, NULL);//通知
    clocksource_suspend();//
    clockevents_suspend();//

    return 0;
}

3. resume回调函数

static void timekeeping_resume(void)
{
    struct timekeeper *tk = &timekeeper;
    struct clocksource *clock = tk->clock;
    unsigned long flags;
    struct timespec ts_new, ts_delta;
    cycle_t cycle_now, cycle_delta;
    bool suspendtime_found = false;

    read_persistent_clock(&ts_new); －－－－－－通过persistent clock记录醒来的时间点

    clockevents_resume();－－－－－－－－－－－resume系统中所有的clockevent设备
    clocksource_resume(); －－－－－－－－－－resume系统中所有的clocksource设备


    cycle_now = clock->read(clock);
    if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
        cycle_now > clock->cycle_last) {如果标记了clock_source_suspend_nonostop的标记位，则对应的clourcesource不会停止，因此使用更高精度的clocksource进行更新时间（1）
        u64 num, max = ULLONG_MAX;
        u32 mult = clock->mult;
        u32 shift = clock->shift;
        s64 nsec = 0;

        cycle_delta = (cycle_now - clock->cycle_last) & clock->mask; －－－本次suspend的时间
        do_div(max, mult);
        if (cycle_delta > max) {
            num = div64_u64(cycle_delta, max);
            nsec = (((u64) max * mult) >> shift) * num;
            cycle_delta -= num * max;
        }
        nsec += ((u64) cycle_delta * mult) >> shift; －－－－将suspend时间从cycle转换成ns

        ts_delta = ns_to_timespec(nsec);－－－－将suspend时间从ns转换成timespec
        suspendtime_found = true;
    } else if (timespec_compare(&ts_new, &timekeeping_suspend_time) > 0) {//使用persistent来更新数据
        ts_delta = timespec_sub(ts_new, timekeeping_suspend_time);
        suspendtime_found = true;
    }

    if (suspendtime_found)
        __timekeeping_inject_sleeptime(tk, &ts_delta); －－－－－－－－－－－－－－－－（3）

    tk->cycle_last = clock->cycle_last = cycle_now; －－－更新last cycle的值
    tk->ntp_error = 0;
    timekeeping_suspended = 0; －－－－标记完成了suspend/resume过程
    timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET); －－更新shadow timerkeeper
    write_seqcount_end(&timekeeper_seq);
    raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

    touch_softlockup_watchdog();

    clockevents_notify(CLOCK_EVT_NOTIFY_RESUME, NULL); －－－通知resume信息到clockevent
    hrtimers_resume(); －－－高精度timer相关，另文描述
}
//更新timerkeeping
static void __timekeeping_inject_sleeptime(struct timekeeper *tk,  struct timespec *delta)
{
    tk_xtime_add(tk, delta);－－－－－－将suspend的时间加到real time clock上去
    tk_set_wall_to_mono(tk, timespec_sub(tk->wall_to_monotonic, *delta));
    tk_set_sleep_time(tk, timespec_add(tk->total_sleep_time, *delta));
    tk_debug_account_sleep_time(delta);
}