调度器类
而依据其调度策略的不同实现了5个调度器类, 一个调度器类可以用一种种或者多种调度策略调度某一类进程, 也可以用于特殊情况或者调度特殊功能的进程.
其所属进程的优先级顺序为
stop_sched_class -> dl_sched_class -> rt_sched_class -> fair_sched_class -> idle_sched_class
3个调度实体
调度器不限于调度进程, 还可以调度更大的实体, 比如实现组调度.
这种一般性要求调度器不直接操作进程, 而是处理可调度实体, 因此需要一个通用的数据结构描述这个调度实体,即seched_entity结构, 其实际上就代表了一个调度对象,可以为一个进程,也可以为一个进程组.
linux中针对当前可调度的实时和非实时进程, 定义了类型为seched_entity的3个调度实体
- sched_dl_entity 采用EDF算法调度的实时调度实体 sched_rt_entity
- 采用Roound-Robin或者FIFO算法调度的实时调度实体 rt_sched_class
- sched_entity 采用CFS算法调度的普通非实时进程的调度实体
task_tick
event_handler()-->tick_handle_periodic()->tik_periodic()->update_process_times()-->scheduler_tick()
void scheduler_tick(void) { int cpu = smp_processor_id(); struct rq *rq = cpu_rq(cpu); struct task_struct *curr = rq->curr; sched_clock_tick(); //----------(1) raw_spin_lock(&rq->lock); update_rq_clock(rq); //---------(2) curr->sched_class->task_tick(rq, curr, 0); //-------(3) update_cpu_load_active(rq); //-----------(4) raw_spin_unlock(&rq->lock); perf_event_task_tick(); #ifdef CONFIG_SMP rq->idle_balance = idle_cpu(cpu); trigger_load_balance(rq); #endif rq_last_tick_reset(rq); }
这个函数的关键处理就是调用:
curr->sched_class->task_tick(rq, curr, 0);
注意它传入的参数,是当前runqueue中的curr变量,也就是当前正在运行着的进程。这里会调用调度类中的task_tick回调,我们主要以CFS调度器来做介绍。
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se; for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); entity_tick(cfs_rq, se, queued); //(1) } if (numabalancing_enabled) task_tick_numa(rq, curr); update_rq_runnable_avg(rq, 1); //(2) }
其中关键的两个步骤:
(1)执行调度实体的tick函数更新统计量和vruntime
(2)更新runqueue的avg统计量和负载
static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) { /* * Update run-time statistics of the 'current'. */ update_curr(cfs_rq); //(1) /* * Ensure that runnable average is periodically updated. */ update_entity_load_avg(curr, 1); //(2) update_cfs_rq_blocked_load(cfs_rq, 1); update_cfs_shares(cfs_rq); #ifdef CONFIG_SCHED_HRTICK /* * queued ticks are scheduled to match the slice, so don't bother * validating it and just reschedule. */ if (queued) { resched_curr(rq_of(cfs_rq)); return; } /* * don't let the period tick interfere with the hrtick preemption */ if (!sched_feat(DOUBLE_TICK) && hrtimer_active(&rq_of(cfs_rq)->hrtick_timer)) return; #endif if (cfs_rq->nr_running > 1) check_preempt_tick(cfs_rq, curr); //(3) }
(1)update_curr更新当前调度实体的runtime信息,包括exec time实际执行时间,以及vruntime,虚拟时间
(2)更新调度实体的avg负载,以便于给后面runqueue负载计算使用
(3) check_preempt_tick用于判断当前情况是否需要执行系统调度,这个是调度关键函数。
static void check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) { unsigned long ideal_runtime, delta_exec; struct sched_entity *se; s64 delta; ideal_runtime = sched_slice(cfs_rq, curr); delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; if (delta_exec > ideal_runtime) { resched_curr(rq_of(cfs_rq)); /* * The current task ran long enough, ensure it doesn't get * re-elected due to buddy favours. */ clear_buddies(cfs_rq, curr); return; } /* * Ensure that a task that missed wakeup preemption by a * narrow margin doesn't have to wait for a full slice. * This also mitigates buddy induced latencies under load. */ if (delta_exec < sysctl_sched_min_granularity) return; se = __pick_first_entity(cfs_rq); delta = curr->vruntime - se->vruntime; if (delta < 0) return; if (delta > ideal_runtime) resched_curr(rq_of(cfs_rq)); }
resched_curr
调用resched_curr()给当前进程设置一个标志位TIF_NEED_RESCHED.
TIF_NEED_RESCHED调度点
设置了TIF_NEED_RESCHED标志之后,真正调用执行schedule()函数的时机只有两种,第一种是系统调用或者中断返回时,根据TIF_NEED_RESCHED标志决定是否调用schedule()函数(从效率方面考虑,趁着还在内核态把该处理的事情处理完毕);第二种情况是当前任务因为原因需要睡眠,进程睡眠后立即调用schedule()函数,在内核中这种情况也比较多,比如磁盘、网卡等设备驱动程序中
睡眠的任务被唤醒时
当睡眠任务所等待的事件到达时,内核(例如驱动程序的中断处理函数)将会调用wake_up()唤醒相关的任务,并最终调用try_to_wake_up()。它完成三件事:将任务重新添加到就绪队列,将运行标志设置为TASK_RUNNING,如果被唤醒的任务可以抢占当前运行任务则设置当前任务的TIF_NEED_RESCHED标志
一个小实验看清Linux内核调度机制
0号进程(idle进程)完成一系列初始化之后,就进入一个while循环
While(1) { …. static void do_idle(void) { ….
schedule_idle();….
}
}
void __sched schedule_idle(void) { do { __schedule(false); } while (need_resched()); }
接着会调用schedule_idle(),然后自愿调用__schedule(false),让调度器挑出
下一个任务去执行。
现在要做的一个实验很简单,就是注释掉do_idle()中的schedule_idle().
这样系统应该怎么都调用不到schedule() ,此时系统会stall了。实验结果果真如此,此时终端也不会有任何响应。如果打开内核选项:
RCU_STALL_COMMON = y
CONFIG_RCU_CPU_STALL_TIMEOUT = 10(手动设置10s)
系统会打印出如下信息:
INFO: rcu_sched detected stalls onCPUs/tasks:
RCU_GP_WAIT_FQS(3) ->state=0x0->cpu=0
因为就算不能调用schedule(),此时一直有时钟中断过来,系统利用中断来检测。
此时时钟中断在进行,就代表说scheduler_tick函数在正常工作。这个函数完成的功能就是调用本进程调度类的task_tick(),可惜这个函数在idle调用类中是个空函数.
/* * This function gets called by the timer code, with HZ frequency. * We call it with interrupts disabled. */ void scheduler_tick(void) { curr->sched_class->task_tick(rq, curr, 0); } const struct sched_class idle_sched_class = { .task_tick = task_tick_idle, } static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued) {}
如果当中断来临的时候当前不是idle进程,而是属于公平调度类的进程:
此时调用scheduler_tick会调用到公平调度类的
task_tick_fair()
onst struct sched_class fair_sched_class = { .task_tick = task_tick_fair, }; static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct sched_entity *se = &curr->se; for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); entity_tick(cfs_rq, se, queued); } } static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) { check_preempt_tick(cfs_rq, curr); } static void
check_preempt_tick(struct cfs_rq cfs_rq, struct sched_entity curr)
{ if (delta_exec > ideal_runtime) { resched_curr(rq_of(cfs_rq)); } if (delta > ideal_runtime) resched_curr(rq_of(cfs_rq)); }
大概的说就是所运行的虚拟时间比同一运行队列红黑树上的最小虚拟时间大,或者时间片使用完。调用resched_curr()给当前进程设置一个标志位TIF_NEED_RESCHED.
设置完以后,此进程也不是马上就被切走,要等到一个调度点。调度点在哪里呢,中断完成之后就会来一个调度点。(关于调度点,不展开了)
ENTRY(ret_to_user_from_irq)
ldr r2, [tsk, #TI_ADDR_LIMIT]
cmp r2, #TASK_SIZE
blne addr_limit_check_failed
ldr r1, [tsk, #TI_FLAGS]
tst r1, #_TIF_WORK_MASK
bne slow_work_pending
slow_work_pending: mov r0, sp @ 'regs' mov r2, why @ 'syscall'
bl do_work_pending
cmp r0, #0 beq no_work_pending movlt scno, #(__NR_restart_syscall - __NR_SYSCALL_BASE) ldmia sp, {r0 - r6} @ have to reload r0 - r6 b local_restart @ ... and off we go asmlinkage int do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall) { do { if (likely(thread_flags & _TIF_NEED_RESCHED)) { schedule(); ... }
这样中断退出时检查当前进程标示位_TIF_NEED_RESCHED,然后找到下一个合适的进程发生进程切换。
而schedule()函数中有pick_next_task
/* * Pick up the highest-prio task: */ static inline struct task_struct * pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { const struct sched_class *class; struct task_struct *p; /* * Optimization: we know that if all tasks are in the fair class we can * call that function directly, but only if the @prev task wasn't of a * higher scheduling class, because otherwise those loose the * opportunity to pull in more work from other CPUs. */ if (likely((prev->sched_class == &idle_sched_class || prev->sched_class == &fair_sched_class) && rq->nr_running == rq->cfs.h_nr_running)) { p = fair_sched_class.pick_next_task(rq, prev, rf); if (unlikely(p == RETRY_TASK)) goto again; /* Assumes fair_sched_class->next == idle_sched_class */ if (unlikely(!p)) p = idle_sched_class.pick_next_task(rq, prev, rf); return p; } again: for_each_class(class) { p = class->pick_next_task(rq, prev, rf); if (p) { if (unlikely(p == RETRY_TASK)) goto again; return p; } } /* The idle class should always have a runnable task: */ BUG(); }