schedule()函数

2008-01-12 20:36

Direct invocation(直接调用)

The scheduler is invoked directly when the current process must be blocked right away because the resource it needs is not available. In this case, the kernel routine that wants to block it proceeds as follows:
1. Inserts current in the proper wait queue.
2. Changes the state of current either to TASK_INTERRUPTIBLE or to TASK_UNINTERRUPTIBLE.
3. Invokes schedule( ).
4. Checks whether the resource is available; if not, goes to step 2.
5. Once the resource is available, removes current from the wait queue.

Lazy invocation(延迟调用)
The scheduler can also be invoked in a lazy way by setting the TIF_NEED_RESCHED flag of current to 1. Because a check on the value of this flag is always made before resuming the execution of a User Mode process, schedule( ) will definitely be invoked at some time in the near future.
Typical examples of lazy invocation of the scheduler are:

When current has used up its quantum of CPU time; this is done by the scheduler_tick( ) function.
When a process is woken up and its priority is higher than that of the current process; this task is performed by the try_to_wake_up( ) function.
When a sched_setscheduler( ) system call is issued.

schedule()概述

The act of picking the next task to run and switching to it is implemented via the schedule() function.选定下一个进程并切换到它去执行是通过schedule()函数实现的。
The schedule() function is relatively simple for all it must accomplish. The following code determines the highest priority task:

struct task_struct *prev, *next;

struct list_head *queue;

struct prio_array *array;

int idx;

prev = current;

array = rq->active;

idx = sched_find_first_bit(array->bitmap);

queue = array->queue + idx;

next = list_entry(queue->next, struct task_struct, run_list);

    First, the active priority array is searched to find the first set bit. This bit corresponds to the highest priority task that is runnable. Next, the scheduler selects the first task in the list at that priority. This is the highest priority runnable task on the system and is the task the scheduler will run.
    首先，要在活动优先级数组中找到第一个被设置的位。该位对应着优先级最高的可执行进程。然后，调度程序选择这个级别链表里的头一个进程。这就是系统中优先级最高的可执行进程，也是马上会被调度执行的进程。

    If prev does not equal next, then a new task has been selected to run. The function context_switch() is called to switch from prev to next.
    如果prev和next不等，说明被选中的进程不是当前进程。此时函数context_switch()被调用，负责从prev切换到next.

schedule()实现细节

    The goal of the schedule( ) function consists of replacing the currently executing process with another one. Thus, the key outcome of the function is to set a local variable called next, so that it points to the descriptor of the process selected to replace current. If no runnable process in the system has priority greater than the priority of current, at the end, next coincides with current and no process switch takes place.
    schedule()函数的目的在于用另一个进程替换当前正在运行的进程。因此，这个函数的主要结果就是设置一个名为next的变量，以便它指向所选中的代替current的进程的描述符。如果在系统中没有可运行进程的优先级大于current的优先级，那么，结果是next与current一致，没有进程切换发生。

asmlinkage void __sched schedule(void)
{
    long *switch_count;
    task_t *prev, *next;
    runqueue_t *rq;
    prio_array_t *array;
    struct list_head *queue;
    unsigned long long now;
    unsigned long run_time;
    int cpu, idx;

    if (likely(!current->exit_state)) {
        if (unlikely(in_atomic())) {
            printk(KERN_ERR "scheduling while atomic: "
                "%s/0x%08x/%d ",
                current->comm, preempt_count(), current->pid);
            dump_stack();
        }
    }
    profile_hit(SCHED_PROFILING, __builtin_return_address(0));

Actions performed by schedule( ) before a process switch

关闭内核抢占功能；初始化参数prev、rq
The schedule( ) function starts by disabling kernel preemption and initializing a few local variables:
|---------------------------------|
|need_resched:                    |
|    preempt_disable();           |
|    prev = current;              |
|    release_kernel_lock(prev);   |
|need_resched_nonpreemptible:     |
|    rq = this_rq();              |
|---------------------------------|

    if (unlikely(prev == rq->idle) && prev->state != TASK_RUNNING) {
        printk(KERN_ERR "bad: scheduling from the idle thread! ");
        dump_stack();
    }

    schedstat_inc(rq, sched_cnt);

计算进程prev本次运行时间(run_time)：
通常连续运行时间(run_time)限制在1秒内(要转换成纳秒)
    The sched_clock( ) function is invoked to read the TSC and convert its value to nanoseconds; the timestamp obtained is saved in the now local variable. Then, schedule( ) computes the duration of the CPU time slice used by prev:
    now = sched_lock();
    run_time = now - prev->timestamp;
    if (run_time > 1000000000)
        run_time = 1000000000;
|-----------------------------------------------------------------------|
|   now = sched_clock();                                                |
|   if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) {|
|       run_time = now - prev->timestamp;                               |
|       if (unlikely((long long)(now - prev->timestamp) < 0))           |
|           run_time = 0;                                               |
|   } else                                                              |
|       run_time = NS_MAX_SLEEP_AVG;                                    |
|-----------------------------------------------------------------------|

根据原平均睡眠时间(CURRENT_BONUS)“倍减”本次连续运行时间：
本来进程prev的平均睡眠时间应该更新为:
    原平均睡眠时间 - 本次连续运行时间；
不过，schedule()为了奖励原平均睡眠时间较长的进程--CURRENT_BONUS(prev)值较大；经过下面运算将会减小run_time,从而降低了本次连续运行时间对新的平均睡眠时间的影响
|--------------------------------------------|
|   run_time /= (CURRENT_BONUS(prev) ? : 1); |
|--------------------------------------------|

关闭本地中断；使用自旋锁保护runqueue
Before starting to look at the runnable processes, schedule( ) must disable the local interrupts and acquire the spin lock that protects the runqueue:
|------------------------------------|
|   spin_lock_irq(&rq->lock);        |
|------------------------------------|

为了识别当前进程是否已经终止，schedule检查PF_DEAD标志
|----------------------------------------|
|   if (unlikely(prev->flags & PF_DEAD)) |
|       prev->state = EXIT_DEAD;         |
|----------------------------------------|

    switch_count = &prev->nivcsw;

如果进程prev因为等待某事件的发生而调用schedule()放弃CPU控制权，则schedule()将根据该进程的具体状态(TASK_INTERRUPTIBLE还是TASK_UNINTERRUPTIBLE)来决定它是继续留在活跃队列；还是从活跃队列中删除
    如果进程prev处于不可运行状态；并且该进程在内核态没有被抢占；则应该从可执行队列(runqueue)中删除。然而如果该进程有不可阻塞的信号并且其状态为TASK_INTERRUPTIBLE则该进程将会被置为TASK_RUNNING并继续留在runqueue中。这个操作与把处理器分配给 prev是不同的，它只是给prev一次被选中执行的机会。
    schedule( ) examines the state of prev. If it is not runnable and it has not been preempted in Kernel Mode,then it should be removed from the runqueue. However, if it has nonblocked pending signals and its state is TASK_INTERRUPTIBLE, the function sets the process state to TASK_RUNNING and leaves it into the runqueue. This action is not the same as assigning the processor to prev; it just gives prev a chance to be selected for execution:
    如果进程prev处于不可运行状态；并且该进程在内核态没有被抢占；则说明该进程在调用schedule()之前，由于等待某事件的发生而进入等待队列- -处于睡眠状态。如果其状态为TASK_INTERRUPTIBLE并且收到了信号(并不处理信号)，则该进程再次回到TASK_RUNNING状态(被调度后将会去处理信号)
|---------------------------------------------------------------|
|   if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {   |
|       switch_count = &prev->nvcsw;                            |
|       if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&      |
|               unlikely(signal_pending(prev))))                |
|           prev->state = TASK_RUNNING;                         |
|       else {                                               |
|           if (prev->state == TASK_UNINTERRUPTIBLE)            |
|               rq->nr_uninterruptible++;                       |
|           deactivate_task(prev, rq);                          |
|       }                                                       |
|   }                                                           |
|---------------------------------------------------------------|

    cpu = smp_processor_id();

Actions performed by schedule( ) to make the process switch

检测可执行队列(runqueue)中可运行进程数，并根据所剩进程数进行负载均衡运算:
schedule( ) checks the number of runnable processes left in the runqueue.
    If no runnable process exists, the function invokes idle_balance( ) to move some runnable process from another runqueue to the local runqueue; idle_balance( ) is similar to load_balance( )
    如果运行队列中没有可运行的进程存在，schedule()就调用idle_balance()，从另外一个运行队列迁移一些可运行进程到本地运行队列中， idle_balance( )与load_balance( )类似
    如果idle_balance( )没有成功地把进程迁移到本地运行队列中，schedule( )就调用wake_sleeping_dependent( )重新调度空闲CPU（即每个运行swapper进程的CPU）中的可运行进程。就象前面讨论 dependent_sleeper( ) 函数时所说明的，通常在内核支持超线程技术的时候可能会出现这种情况。然而，在单处理机系统中，或者当把进程迁移到本地运行队列的种种努力都失败的情况下，函数就选择swapper进程作为next进程并继续进行下一步骤。
    If there are some runnable processes, the function invokes the dependent_sleeper( ) function. In most cases, this function immediately returns zero.
|-------------------------------------------------|
|   if (unlikely(!rq->nr_running)) {              |
|go_idle:                                         |
|       idle_balance(cpu, rq);                    |
|       if (!rq->nr_running) {                    |
|           next = rq->idle;                      |
|           rq->expired_timestamp = 0;            |
|           wake_sleeping_dependent(cpu, rq);     |
|           if (!rq->nr_running)                  |
|               goto switch_tasks;                |
|       }                                         |
|-------------------------------------------------|
|   } else {                                      |
|       if (dependent_sleeper(cpu, rq)) {         |
|           next = rq->idle;                      |
|           goto switch_tasks;                    |
|       }                                         |
|       if (unlikely(!rq->nr_running))            |
|           goto go_idle;                         |
|   }                                             |
|-------------------------------------------------|

如果可运行队列的活跃队列中(runqueue.active)已经没有活跃进程；则交换活跃队列(active)和过期队列(expired)
Let's suppose that the schedule( ) function has determined that the runqueue includes some runnable processes; now it has to check that at least one of these runnable processes is active. If not, the function exchanges the contents of the active and expired fields of the runqueue data structure; thus, all expired processes become active, while the empty set is ready to receive the processes that will expire in the future.
|------------------------------------------|
|   array = rq->active;                    |
|   if (unlikely(!array->nr_active)) {     |
|       schedstat_inc(rq, sched_switch);   |
|       rq->active = rq->expired;          |
|       rq->expired = array;               |
|       array = rq->active;                |
|       rq->expired_timestamp = 0;         |
|       rq->best_expired_prio = MAX_PRIO; |
|   }                                      |
|------------------------------------------|

从优先级数组中选取优先级最高的进程next:
It is time to look up a runnable process in the active prio_array_t data structure.First of all, schedule( ) searches for the first nonzero bit in the bitmask of the active set. Remember that a bit in the bitmask is set when the corresponding priority list is not empty. Thus, the index of the first nonzero bit indicates the list containing the best process to run. Then, the first process descriptor in that list is retrieved:
|-----------------------------------------------------|
|   idx = sched_find_first_bit(array->bitmap);        |
|   queue = array->queue + idx;                       |
|   next = list_entry(queue->next, task_t, run_list); |
|-----------------------------------------------------|

计算进程next的平均睡眠时间:
如果进程next是普通用户进程，并且该进程是从TASK_INTERRUPTIBLE或者TASK_STOPPED被唤醒的，scheduler将要为该进程增加平均睡眠时间sleep_avg(此时计算平均睡眠时间不能简单增加唤醒前的睡眠时间)
If next is a conventional process and it is being awakened from the TASK_INTERRUPTIBLE or TASK_STOPPED state, the scheduler adds to the average sleep time of the process the nanoseconds elapsed since the process was inserted into the runqueue. In other words, the sleep time of the process is increased to cover also the time spent by the process in the runqueue waiting for the CPU:
|-------------------------------------------------------------------|
|   if (!rt_task(next) && next->activated > 0) {                    |
|       unsigned long long delta = now - next->timestamp;           |
|       if (unlikely((long long)(now - next->timestamp) < 0))       |
|           delta = 0;                                              |
|       if (next->activated == 1)                                   |
|           delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128; |
|       array = next->array;                                        |
|       dequeue_task(next, array);                                  |
|       recalc_task_prio(next, next->timestamp + delta);            |
|       enqueue_task(next, array);                                  |
|   }                                                               |
|   next->activated = 0;                                            |
|-------------------------------------------------------------------|

Actions performed by schedule( ) to make the process switch

switch_tasks:
    if (next == rq->idle)
       schedstat_inc(rq, sched_goidle);

获取进程next的thread_info域
现在schedule()函数已经确定将要运行的进程next。内核将访问进程next的thread_info域--该域存放在进程next描述符的顶部(task_struct.thread_info):
Now the schedule( ) function has determined the next process to run.In a moment, the kernel will access the thread_info data structure of next, whose address is stored close to the top of next's process descriptor:

|-----------------------|
|   prefetch(next);     |
|-----------------------|

在替换prev进程前，调度程序需要进行对prev做一些处理:
清除标志位TIF_NEED_RESCHED
Before replacing prev, the scheduler should do some administrative work:
The clear_tsk_need_resched( ) function clears the TIF_NEED_RESCHED flag of prev, just in case schedule( ) has been invoked in the lazy way. Then, the function records that the CPU is going through a quiescent state
|-----------------------------------|
|   clear_tsk_need_resched(prev);   |
|   rcu_qsctr_inc(task_cpu(prev)); |
|-----------------------------------|

    update_cpu_clock(prev, rq, now);
计算进程prev的平均睡眠时间sleep_avg
计算进程prev的平均睡眠时间sleep_avg(进程上下文切换前进程prev运行了run_time长的时间，因此该进程的sleep_avg应该减少run_time)；更新该进程进入睡眠状态的时间戳
The schedule( ) function must also decrease the average sleep time of prev, charging to it the slice of CPU time used by the process:
|-------------------------------------------|
|   prev->sleep_avg -= run_time;            |
|   if ((long)prev->sleep_avg <= 0)         |
|       prev->sleep_avg = 0;                |
|   prev->timestamp = prev->last_ran = now; |
|-------------------------------------------|


    sched_info_switch(prev, next);

执行进程上下文切换动作：
At this point, prev and next are different processes, and the process switch is for real:

|-----------------------------------------------|
|   if (likely(prev != next)) {                 |
|       next->timestamp = now;                  |
|       rq->nr_switches++;                      |
|       rq->curr = next;                        |
|       ++*switch_count;                        |
|       prepare_arch_switch(rq, next);          |
|       prev = context_switch(rq, prev, next); |
|-----------------------------------------------|

Actions performed by schedule( ) after a process switch

|---------------------------------|
|       barrier();                |
|       finish_task_switch(prev); |
|---------------------------------|

如果prev和next是同一个进程:
It is quite possible that prev and next are the same process: this happens if no other higher or equal priority active process is present in the runqueue. In this case, the function skips the process switch:
|-----------------------------------|
|   } else                          |
|       spin_unlock_irq(&rq->lock); |
|-----------------------------------|

    prev = current;
    if (unlikely(reacquire_kernel_lock(prev) < 0))
        goto need_resched_nonpreemptible;
    preempt_enable_no_resched();
    if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
        goto need_resched;
}

相关阅读:
理解python多个参数*args
物联网MQTT 协议测试
python 自动化测试人工智能
Django 初识
算法排序
python教程笔记GUI wxpython
python入门教程学习笔记#3 基础部分
python入门教程学习笔记#1 安装准备
2012-2013 Northwestern European Regional Contest (NWERC 2012)
2017 Benelux Algorithm Programming Contest (BAPC 17)

原文地址：https://www.cnblogs.com/yuzaipiaofei/p/4124200.html