• 调度器23—EAS Hello


    基于 Linux-5.10

    一、EAS概述

    EAS在CPU调度领域,在为任务选核是起作用,目的是保证性能的情况下尽可能节省功耗。其基于的能量模型框架(EnergyModel (EM) framework)是一个通用的接口模块,该模块连接了支持不同 perf level 的驱动模块和系统中的其他想要感知能量消耗的模块。其中这里说的EAS,CPU调度器和CPU驱动模块就是一个典型的例子,调度器希望能够感知底层CPU能量的消耗,从而做出更优的选核策略。对于CPU设备,各个Cluster有各自独立的调频机制,Cluster内的CPU统一工作在一个频率下(Qcom做了更改,使每个CPU频点都可以不同)。因此每个Cluster就会形成一W个性能域(performance domain)。调度器通过 EM framework 接口可以获取CPU在各个 performance level 的能量消耗。

    二、相关数据结构

    1. struct perf_domain

    struct perf_domain {
        struct em_perf_domain *em_pd;
        struct perf_domain *next; //构成单链表
        struct rcu_head rcu; //保护此链表的rcu
    };

    perf_comain 结构表示一个CPU性能域,每一个性能域都是由 perf_domain 抽象。perf_comain 和 cpufreq policy 是一一对应的,对于一个4+3+1结构的平台,因此系统共计3个perf domain,形成链表,链表头在全局 root_domain 中。root_domain 的其它相关成员一并列出,如下:

    struct root_domain {
        ...
        int        overload; //该root domain,即系统是否处于overload状态
        int        overutilized; //该root domain,即系统是否处于overutilized状态
        unsigned long    max_cpu_capacity; //系统中算力最大的那个cpu的算力
        struct perf_domain __rcu *pd; //perf_domain单链表的表头
    }

    澄清一下 overloadoverutilized 这两个概念,在单个cpu overload/overutilized 基础上,又定义了 root domain(即整个系统)的overload和overutilized。

    (1) 对于一个 CPU 而言,其处于 overload 状态则说明其 rq 上有大于等于2个任务,或者虽然只有一个任务,但是是 misfit task。
    (2) 对于一个 CPU 而言,其处于 overutilized 状态说明该 cpu 的 utility 超过其 capacity(缺省预留20%的算力,另外,这里的 capacity 是用于cfs任务的算力)。
    (3) 对于 root domain,overload 表示至少有一个 cpu 处于 overload 状态。overutilized 表示至少有一个 cpu 处于 overutilized 状态。

    overutilized 状态非常重要,它决定了调度器是否启用EAS,只有在系统没有 overutilized 的情况下EAS才会生效。overload和newidle balance的频次控制相关,当系统在overload的情况下,newidle balance才会启动进行均衡。

    2. struct em_perf_domain

    struct em_perf_domain {
        struct em_perf_state *table; //performance states, 里面的频点必须是升序排列,em_cpu_energy()依赖于此
        int nr_perf_states; //table中元素的个数
        int milliwatts; //指示功率值的标志,以毫瓦或其他一些标度为单位。
        unsigned long cpus[]; //此性能域包含哪些cpu
    };

    此结构存放在cpu对应的 struct device 结构中,那就是per-cpu的结构了!在 EM framework 中,使用 em_perf_domain 来抽象一个 performance domain。

    3. struct em_perf_state

    struct em_perf_state {
        unsigned long frequency; //单位KHz,与 CPUFreq 保持一致
        unsigned long power; //单位毫瓦,此频点下的功率,它可以是总的功耗:静态+动态
        unsigned long cost; //与频点下的成本系数,功率计算过程中使用,等于:power * max_frequency / frequency
    };

    每个性能域都有若干个 perf level,每一个 perf level 对应能耗是不同的,使用用 struct em_perf_state 来表示一个 perf level 下的能耗信息。

    三、能量计算

    1. 能量计算方法概述

    基本计算公式:能量 = 功率 X 时间

    对于CPU而言,要计算其能量需要进一步细化公式(省略了CPU处于idle状态的能耗):CPU在此频点的能量消耗 = CPU在此频点的功率 X CPU在此频点的运行时间

    EM中记录了CPU各个频点的功率,使用 em_perf_state::power 来表示,这是事先Soc供应商计算好的。而运行时间是通过 cpu utility 来表示的。有一个不太方便的地方就是CPU utility 是归一化到 1024 的一个值,失去了在某个频点的运行时间长度的信息,不过可以转换:CPU在此频点运行时间 = cpu_util / cpu_current_capacity。注意计算energy只是为了比较大小,就省略了周期。

    CPU在某个 perf-state(即某个频点)下的算力:ps->cap = scale_cpu * (ps->freq / cpu_max_freq) ----(1)。scale_cpu是cpu在最大频点时scale到1024后的算力。

    不考虑 idle state 的功耗,CPU在某个 perf-state 的能量估计:cpu_nrg = ps->power * (cpu_util / ps->cap) ------(2)

    把(1)带入(2)得到:cpu_nrg = (ps->power * cpu_max_freq /ps->freq) * (cpu_util / scale_cpu) ----(3)

    上面公式的第一项是一个常量,保存在 em_perf_state 的 cost 成员中,即CPU在某个 perf-state 的能量估计:cpu_nrg = ps->cost * cpu_util / scale_cpu ----(4)

    由于每个 perf domain 中的微架构都是一样的,因此 scale_cpu 的值是一样的,那么 cost 也是一样的,通过提取公因式可以得到整个 perf domain(cpu cluster)的能耗公式:pd_nrg = ps->cost * \Sum cpu_util / scale_cpu ----(5)

    三、能量模型的构建

    1. perf domain 的构建

    在CPU拓扑初始化的时候,通过 build_perf_domains() 创建各个perf domain,并作为 root domain 的 perf domain 链表。

    (1) 函数实现

    //kernel/sched/topology.c
    static bool build_perf_domains(const struct cpumask *cpu_map)
    {
        int i, nr_pd = 0, nr_ps = 0, nr_cpus = cpumask_weight(cpu_map);
        struct perf_domain *pd = NULL, *tmp;
        int cpu = cpumask_first(cpu_map);
        struct root_domain *rd = cpu_rq(cpu)->rd;
        bool eas_check = false;
    
        if (!sysctl_sched_energy_aware) //若没有使能EAS,直接不build,退出
            goto free;
        ...
        for_each_cpu(i, cpu_map) {
            /* Skip already covered CPUs. */
            if (find_pd(pd, i)) //跳过已经被包含在某个pd中的cpu,这样的话各个cluster的首个cpu才能继续往下执行
                continue;
    
            /* Create the new pd and add it to the local list. */
            tmp = pd_init(i);
            tmp->next = pd; //单链表上元素个数为cpu的个数
            pd = tmp; //等同于头插法,后probe的cluster插入在链表头,pd指向链表头
    
            /* Count performance domains and performance states for the complexity check. */
            nr_pd++;
            //所有pd的ps的数量之和
            nr_ps += em_pd_nr_perf_states(pd->em_pd); //return pd->nr_perf_states;
        }
    
        /* Bail out if the Energy Model complexity is too high. */
        if (nr_pd * (nr_ps + nr_cpus) > EM_MAX_COMPLEXITY) { //2048 能量模型的复杂度不能太高
            WARN(1, "rd %*pbl: Failed to start EAS, EM complexity is too high\n", cpumask_pr_args(cpu_map));
            goto free;
        }
    
        //打印整个pd的信息,debug才会打印
        perf_domain_debug(cpu_map, pd);
    
        /* Attach the new list of performance domains to the root domain. */
        tmp = rd->pd;
        rcu_assign_pointer(rd->pd, pd); //全局变量 root_domain::pd 指向perf_domain链表头
        if (tmp)
            call_rcu(&tmp->rcu, destroy_perf_domain_rcu); //rcu更新,root_domain::pd指向新的,删除旧的
        
        pr_info("nr_pd = %d\n", nr_pd); //cpu7没有isolate就是3,否则就是2
    
        return !!pd;
    
    free:
        free_pd(pd);
        tmp = rd->pd;
        rcu_assign_pointer(rd->pd, NULL);
        if (tmp)
            call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
    
        return false;
    }
    
    static struct perf_domain *pd_init(int cpu)
    {
        struct em_perf_domain *obj = em_cpu_get(cpu);
        struct perf_domain *pd = kzalloc(sizeof(*pd), GFP_KERNEL);
        pd->em_pd = obj; //指针指向
        return pd;
    }
    //kernel/power/energy_model.c
    struct em_perf_domain *em_cpu_get(int cpu)
    {
        struct device *cpu_dev = get_cpu_device(cpu); //return per_cpu(cpu_sys_devices, cpu);
        return em_pd_get(cpu_dev); //return dev->em_pd, 直接存放在cpu对应的device结构中的
    }

    root_domain的pd成员指向的perf-domain链表是头插法,各个pd在链表上的顺序是 root_domain->pd --> cluster2->pd --> --> cluster1->pd --> cluster0->pd。pd是per-cluster的,不是per-cpu的。只有一个cluster 的全部cpu都被isolate了,其pd才会从链表上删除。

    (2)调用路径:

                init_cpu_capacity_callback //arch_topology.c 初始化时cpu算力更新时执行
                    schedule_work(&update_topology_flags_work);
                        init_cpu_capacity_callback //arch_topology.c
                            update_topology_flags_workfn //arch_topology.c
                            cpuset_hotplug_workfn //cpuset.c 下面有调用路径
                        ///proc/sys/kernel/sched_energy_aware 的响应函数
                            sched_energy_aware_handler //topology.c
                                rebuild_sched_domains //cpuset.c
            pause_cpus //cpu.c 执行出错的时候调用
            //cpu.c cpuhp_hp_states[]的"sched:active"的.startup.single回调
        resume_cpus    //cpu.c 
            sched_cpus_activate //core.c
        pause_cpus //cpu.c 
            sched_cpus_deactivate_nosync //core.c
                sched_cpu_activate //core.c
                    cpuset_cpu_active //core.c
        //cpu.c cpuhp_hp_states[]的"sched:active"的.teardown.single回调
    resume_cpus    //cpu.c 
        sched_cpus_activate //core.c
            sched_cpu_deactivate //core.c
        pause_cpus //cpu.c 
            sched_cpus_deactivate_nosync //core.c
                _sched_cpu_deactivate    
                    cpuset_cpu_inactive
                        cpuset_update_active_cpus //cpuset.c
                    cpuset_track_online_nodes_nb.notifier_call //cpu.c
                        cpuset_track_online_nodes //cpuset.c
                            schedule_work(&cpuset_hotplug_work);
                    resume_cpus //cpu.c
                        cpuset_update_active_cpus_affine //cpuset.c
                            schedule_work_on(cpu, &cpuset_hotplug_work); //调用指定cpu上的
                    //文件/dev/cpuset/[<group>/]cpus、mems的写回调函数
                        cpuset_write_resmask //cpuset.c
                            flush_work(&cpuset_hotplug_work); //执行完工作队列上的任务,只是flush
                                cpuset_hotplug_workfn //工作队列处理函数
                                    rebuild_sched_domains_locked
                                        partition_sched_domains_locked
                                            build_perf_domains  //每次传参都是cpu0-6 或 cpu0-7,不是一个cluster一个cluster传参的

    cpuset_hotplug_workfn 下面的部分是加 dump_stack() 打印出来的内容,内核启动、online/offline、isolate/unisolate 调用的路径相同。上面部分是按代码实现上找出的调用路径。

    在每个CPU online/offline、isolate/unisolate 都会触发domain的rebuild流程。

    cgroup分组中指定对cpus文件的改动不会触发rebuild流程。

    em_pd->cpus仅仅表示一个cluster包含哪些cpu,isolate/unisolate和online/offline cpu对其值没影响。

    四、EAS全局控制开关

    sysctl全局控制变量为 sysctl_sched_energy_aware,对应的控制文件为 /proc/sys/kernel/sched_energy_aware。

    //kernel/sched/topology.c
    int sched_energy_aware_handler(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos)
    {
        int ret, state;
    
        if (write && !capable(CAP_SYS_ADMIN))
            return -EPERM;
    
        ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
        if (!ret && write) {
            state = static_branch_unlikely(&sched_energy_present);
            if (state != sysctl_sched_energy_aware) {
                mutex_lock(&sched_energy_mutex);
                sched_energy_update = 1; //partition_sched_domains_locked中唯一使用
                rebuild_sched_domains();
                sched_energy_update = 0;
                mutex_unlock(&sched_energy_mutex);
            }
        }
    
        return ret;
    }

    sched_energy_present 的更新:

    /*
     * kernel/sched/topology.c
     * partition_sched_domains_locked --> sched_energy_set
     */
    static void sched_energy_set(bool has_eas)
    {
        if (!has_eas && static_branch_unlikely(&sched_energy_present)) {
            static_branch_disable_cpuslocked(&sched_energy_present);
        } else if (has_eas && !static_branch_unlikely(&sched_energy_present)) {
            static_branch_enable_cpuslocked(&sched_energy_present);
        }
    }

    sched_energy_enabled() 中判断这个static key值。使用位置有2:
    (1) 负载均衡路径中,find_busiest_group() 中判断使能了EAS并且系统没有overutilized,就终止此次balance。
    (2) 任务选核路径中,select_task_rq_fair() 中判断使能了EAS才会调用find_energy_efficient_cpu()进行EAS路径选核。

    五、EAS作用场景——EAS选核

    1. 唤醒场景的进入EAS选核的条件

    对于阻塞状态的任务,异步事件或者其他线程调用 try_to_wake_up() 会唤醒该线程,唤醒后会进行task placement,也即为唤醒任务选核。如果使能了EAS,那么优先采用EAS选核。当然,只有在轻载(系统没有overutilized)才会启用EAS,重载下(只要有一个cpu处于over utilized状态)还是使用传统内核算法选核。

    static int select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
    {
        int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
        ...
        trace_android_rvh_select_task_rq_fair(p, prev_cpu, sd_flag, wake_flags, &target_cpu);
        if (target_cpu >= 0)
            return target_cpu;
        ...
        //只有唤醒场景才有可能走EAS选核路径
        if (sd_flag & SD_BALANCE_WAKE) {
            //sysctl全局使能控制是否开启EAS
            if (sched_energy_enabled()) {
                new_cpu = find_energy_efficient_cpu(p, prev_cpu, sync);
                if (new_cpu >= 0)
                    return new_cpu; //只要EAS选到核,就使用EAS的选核结果
            }
        }
        ...
    }

    由上面代码可见,EAS只用于wakeup,fork和exec均衡都不走EAS选核算法。find_energy_efficient_cpu()是EAS的主选核路径,使用EAS选核需要满足两个条件:是唤醒路径且使能了EAS特性。EAS选中了适合的CPU就直接返回。如果EAS选核不成功,那么恢复缺省cpu为prev cpu,走传统选核路径重新选核。

    2. EAS选核细节

    EAS选核细节在 find_energy_efficient_cpu() 中体现,如下:

    static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu, int sync)
    {
        unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
        struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
        int max_spare_cap_cpu_ls = prev_cpu, best_idle_cpu = -1;
        unsigned long max_spare_cap_ls = 0, target_cap;
        unsigned long cpu_cap, util, base_energy = 0;
        bool boosted, latency_sensitive = false;
        unsigned int min_exit_lat = UINT_MAX;
        int cpu, best_energy_cpu = prev_cpu;
        struct cpuidle_state *idle;
        struct sched_domain *sd;
        struct perf_domain *pd;
        int new_cpu = INT_MAX;
    
        //更新任务负载
        sync_entity_load_avg(&p->se);
        //Vendor厂商或ODM厂商可能会注册hook从而不使用这个函数
        trace_android_rvh_find_energy_efficient_cpu(p, prev_cpu, sync, &new_cpu); //hook
        if (new_cpu != INT_MAX)
            return new_cpu;
    
        rcu_read_lock();
        //从rd中获取 perf_domain 链表
        pd = rcu_dereference(rd->pd);
        if (!pd || READ_ONCE(rd->overutilized)) //若系统中只要有一个cpu是overutilized,就退出EAS选核
            goto fail;
    
        cpu = smp_processor_id(); //当前正在运行的cpu
        //若是同步唤醒,且当前cpu只有本任务在运行,且唤醒的任务运行运行在此cpu上,且此cpu的算力能满足被唤醒任务的需求,那就直接选择当前cpu
        if (sync && cpu_rq(cpu)->nr_running == 1 && cpumask_test_cpu(cpu, p->cpus_ptr) && task_fits_capacity(p, capacity_of(cpu))) { //uclamp util小于80%的CPU算力
            rcu_read_unlock();
            return cpu;
        }
    
        /* Energy-aware wake-up happens on the lowest sched_domain starting from sd_asym_cpucapacity spanning over this_cpu and prev_cpu. */
        sd = rcu_dereference(*this_cpu_ptr(&sd_asym_cpucapacity)); //返回此cpu对应的DIE层级的sd
        /* 若prev_cpu是个有效的cpuid,在手机上,这个判断完全是多于的,因为只有DIE层级
       * 从最低level的且包括不同算力CPU的sd开始向上搜索,直到该sd覆盖了this_cpu和prev_cpu为止。对于手机平台就是DIE层级的sd。之所以要求包括异构CPU的sd是因为同构的cpu不需要EAS,不会有功耗的节省。
       */ while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd))) sd = sd->parent; if (!sd) goto fail; //max(util, util_est) 任务p的util为0,goto unlock是直接返回prev_cpu if (!task_util_est(p)) goto unlock; //待唤醒任务p所在cgroup的是否设置了cpu.uclamp.latency_sensitive 标志 latency_sensitive = uclamp_latency_sensitive(p); //受全局和cgroup限制后的任务p的uclamp min是否还大于0 boosted = uclamp_boosted(p); target_cap = boosted ? 0 : ULONG_MAX; //这个值是根据下面的使用逻辑赋的 //从大核开始遍历,次序为:大核-->中核-->小核。这个遍历次序无关紧要,疑问是所有都遍历完才做的决策 for (; pd; pd = pd->next) { //循环体中的变量,遍历每个pd时都是新的 unsigned long cur_delta, spare_cap, max_spare_cap = 0; unsigned long base_energy_pd; int max_spare_cap_cpu = -1; /* Compute the 'base' energy of the pd, without @p */ //计算不包括p的情况下此pd的energy,作为基准energy。注意dst_cpu传-1,p的util也会从其之前运行的cpu上被减去 base_energy_pd = compute_energy(p, -1, pd); //不包括p的情况下系统的总energy base_energy += base_energy_pd; /* * 这里竟然没有判断是否为active的cpu! pd->em_pd->cpus仅表示一个cluster包含哪些cpu, offline的cpu会从sd->span * 中清除掉,但是isolated的不会。上面'base' energy的计算也可能有问题。 */ for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) { if (!cpumask_test_cpu(cpu, p->cpus_ptr)) //过滤掉p不允许运行的cpu核 continue; util = cpu_util_next(cpu, p, cpu); //计算p放到此cpu上后此cpu上的util cpu_cap = capacity_of(cpu); spare_cap = cpu_cap; lsub_positive(&spare_cap, util); //计算p放到此cpu上后此cpu还剩余的算力 /* * Skip CPUs that cannot satisfy the capacity request. IOW, placing the task there would make the CPU * overutilized. Take uclamp into account to see how much capacity we can get out of the CPU; this is * aligned with schedutil_cpu_util(). */ //对util进行一下uclamp,若clmap后cpu算力不满足需求了,就放弃此cpu的继续探测 util = uclamp_rq_util_with(cpu_rq(cpu), util, p); if (!fits_capacity(util, cpu_cap)) continue; /* Always use prev_cpu as a candidate. */ if (!latency_sensitive && cpu == prev_cpu) { //若对延迟不敏感,且对比的这个cpu就是任务之前运行的cpu prev_delta = compute_energy(p, prev_cpu, pd); //计算p放在prev_cpu后整个pd的energy prev_delta -= base_energy_pd; //计算p放在prev_cpu后整个pd的增加的energy best_delta = min(best_delta, prev_delta); //这里又取最小值 } /* * Find the CPU with the maximum spare capacity in the performance domain */ //记录p放上去后剩余算力最大的cpu和其剩余算力 if (spare_cap > max_spare_cap) { max_spare_cap = spare_cap; max_spare_cap_cpu = cpu; } if (!latency_sensitive) //若对延迟不敏感,取消对此cpu的继续探测 continue; /*--- 下面就是延迟敏感情况下的才需要执行的 ---*/ if (idle_cpu(cpu)) { cpu_cap = capacity_orig_of(cpu); //若是boosted,target_cap 初始化为0。若是boost,尽量往算力大的CPU上选 if (boosted && cpu_cap < target_cap) continue; //若是非boosted,target_cap 初始化为ULONG_MAX。若是非boost,尽量往算力小的CPU上选 if (!boosted && cpu_cap > target_cap) continue; idle = idle_get_state(cpu_rq(cpu)); //return rq->idle_state; //CPU算力相等的情况下,选idle退出延迟最小的。若exit_latency上变为">=",有利于从cluster的首个CPU开始选 if (idle && idle->exit_latency > min_exit_lat && cpu_cap == target_cap) continue; if (idle) //对idle的判断只是避免程序崩溃而已,记录合适idle cpu的退出延迟,这里不是最小退出延迟的意思。 min_exit_lat = idle->exit_latency; target_cap = cpu_cap; //保存idle cpu的算力 best_idle_cpu = cpu; //记录认为是最好的idle cpu: } else if (spare_cap > max_spare_cap_ls) { //延迟敏感,又非idle cpu max_spare_cap_ls = spare_cap; //记录最大空余算力 max_spare_cap_cpu_ls = cpu; //记录最大空余算力的cpu } } /*---下面就是一个cluster的cpu遍历完后的处理---*/ /* Evaluate the energy impact of using this CPU.*/ if (!latency_sensitive && max_spare_cap_cpu >= 0 && max_spare_cap_cpu != prev_cpu) { //计算p放在当前cluster的最大空余算力的cpu上后其pd的energy增量,和其它所有cpu对比这个增量,取较小的 cur_delta = compute_energy(p, max_spare_cap_cpu, pd); cur_delta -= base_energy_pd; if (cur_delta < best_delta) { best_delta = cur_delta; best_energy_cpu = max_spare_cap_cpu; } } } //下面就是遍历完了: unlock: rcu_read_unlock(); if (latency_sensitive) return best_idle_cpu >= 0 ? best_idle_cpu : max_spare_cap_cpu_ls; /* * Pick the best CPU if prev_cpu cannot be used, or if it saves at least 6% of the energy used by prev_cpu. */ if (prev_delta == ULONG_MAX) return best_energy_cpu; //放在prev_cpu上的energy增量与放在每个cluster空余算力最大的cpu上energy增量的差值,大于把任务p放在prev_cpu上energy消耗的6.25% if ((prev_delta - best_delta) > ((prev_delta + base_energy) >> 4)) return best_energy_cpu; return prev_cpu; fail: rcu_read_unlock(); return -1; }

    原生逻辑,所有遍历下,只计算了nr_cluster+2次energy,分别是任务p不放在任何cpu上的基准energy、计算放在prev_cpu上的energy、计算放在每个cluster的最大空余算力的cpu时cluster的energy。主要比较的是在非latency_sensitive的情况下,将任务p放置在各个cluster的剩余算力最大的cpu上,然后对比,选一个能量增量最小的具有最大空余算力的cpu作为备选cpu。

    总结这个函数选核逻辑如下:

    (1) 若任务p是latency_sensitive的,若best_idle_cpu存在就返回best_idle_cpu,若best_idle_cpu不存在就返回空余算力最大的cpu。best_idle_cpu筛选的条件为:
    a. 首先需要是idle的cpu.
    b. 若任务p被uclamp min值了,就认为是boost的,那么就尽量往算力大的CPU上选,否则尽量往算力小的CPU上选。
    c. 若相同算力的CPU选退出延迟短的,也就是休眠深度浅的CPU.
    空余算力最大的cpu的筛选条件为:
    a. 首先需要是非idle的cpu.
    b. 其次需要是任务p放到此cpu上后,空余算力最大的cpu.

    (2) 若任务p是非latency_sensitive的,若prev_cpu不可用(任务p的亲和性不允许运行在prev_cpu或prev_cpu剩余的算力容纳不下任务p了),那么直接返回best_energy_cpu,
    best_energy_cpu的筛选条件:
    a. 默认是取prev_cpu的
    b. 每个cluster的最大空余算力的那个CPU之间进行PK,放置上任务p到其上后,哪个CPU的energy增量小,选哪一个CPU。

    (3) 若任务p是非latency_sensitive的,且prev_cpu可用,且放在prev_cpu上的energy增量与放在best_energy_cpu的差值,小于等于把任务p放在prev_cpu上energy消耗的6.25%,那么选prev_cpu. 因为能量节省有限,选prev_cpu可以减少cache miss. 可以增加一个优化,并且在cache_hot()的情况下才选prev_cpu。

    此函数用到的 compute_energy 函数:

    /*
     * 作用:计算任务p迁移到dst_cpu上后,整个pd,也就是此cluster的energy。若dst_cpu传-1,就表示
     * 任务p不运行在pd内的任何一个cpu上时,此pd的energy,也即是base energy。
     */
    static long compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
    {
        struct cpumask *pd_mask = perf_domain_span(pd);
        unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask)); //return per_cpu(cpu_scale, cpu); 此cpu的算力
        unsigned long max_util = 0, sum_util = 0;
        unsigned long energy = 0;
        int cpu;
    
        //对此pd中的每一个online cpu都执行
        for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
            //计算若p运行在dst_cpu上,此pd下各个cpu变化后的util值
            unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
            struct task_struct *tsk = cpu == dst_cpu ? p : NULL; //注意传参,这里可能恒为NULL
    
            //返回cfs+irq+rt+dl使用掉的cpu算力之和.注意这里传的是ENERGY_UTIL
            sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap, ENERGY_UTIL, NULL);
            //这次计算util考虑了uclamp,util大概率是往高处clamp的,dl的util计算方式也不同,这里使用的是带宽
            cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap, FREQUENCY_UTIL, tsk); //tsk是否为NULL只对clamp区间值有影响
            //取此pd中所有cpu的cpu_util的最大值
            max_util = max(max_util, cpu_util);
        }
        energy = em_cpu_energy(pd->em_pd, max_util, sum_util); //返回的是整个pd的energy
    
        return energy;
    }

    em_cpu_energy 来根据此cluster上所有cpu的util之和计算energy和通过util最大的那个cpu的util 去调频。

    /*
     * em_cpu_energy() - Estimates the energy consumed by the CPUs of a performance domain
     * @pd         : performance domain for which energy has to be estimated
     * @max_util : highest utilization among CPUs of the domain
     * @sum_util : sum of the utilization of all CPUs in the domain
     */
    /*
     * 作用:计算pd的energy,参数max_util用来为此cluster调频,sum_util用来计算此cluster即pd的energy
     */
    static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, unsigned long max_util, unsigned long sum_util)
    {
        unsigned long freq, scale_cpu;
        struct em_perf_state *ps;
        int i, cpu;
    
        if (!sum_util)
            return 0;
    
        cpu = cpumask_first(to_cpumask(pd->cpus));
        scale_cpu = arch_scale_cpu_capacity(cpu); //此pd下cpu的算力
        ps = &pd->table[pd->nr_perf_states - 1]; //由于是升序排列,这是最大的perf-state
        freq = map_util_freq(max_util, ps->frequency, scale_cpu); //return (freq + (freq >> 2)) * util / cap = 1.25 * (util / cap) * max_freq ;
    
        /*
         * Find the lowest performance state of the Energy Model above the requested frequency.
         */
        //找一个频点刚好大于等于计算出来的freq的em_perf_state
        for (i = 0; i < pd->nr_perf_states; i++) {
            ps = &pd->table[i];
            if (ps->frequency >= freq)
                break;
        }
    
        /*
         * The capacity of a CPU in the domain at the performance state (ps)
         * can be computed as:
         *
         *             ps->freq * scale_cpu
         *   ps->cap = --------------------                          (1)
         *                 cpu_max_freq
         *
         * So, ignoring the costs of idle states (which are not available in
         * the EM), the energy consumed by this CPU at that performance state
         * is estimated as:
         *
         *             ps->power * cpu_util
         *   cpu_nrg = --------------------                          (2)
         *                   ps->cap
         *
         * since 'cpu_util / ps->cap' represents its percentage of busy time.
         *
         *   NOTE: Although the result of this computation actually is in
         *         units of power, it can be manipulated as an energy value
         *         over a scheduling period, since it is assumed to be
         *         constant during that interval.
         *
         * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
         * of two terms:
         *
         *             ps->power * cpu_max_freq   cpu_util
         *   cpu_nrg = ------------------------ * ---------          (3)
         *                    ps->freq            scale_cpu
         *
         * The first term is static, and is stored in the em_perf_state struct
         * as 'ps->cost'.
         *
         * Since all CPUs of the domain have the same micro-architecture, they
         * share the same 'ps->cost', and the same CPU capacity. Hence, the
         * total energy of the domain (which is the simple sum of the energy of
         * all of its CPUs) can be factorized as:
         *
         *            ps->cost * \Sum cpu_util
         *   pd_nrg = ------------------------                       (4)
         *                  scale_cpu
         */
        return ps->cost * sum_util / scale_cpu; //就是之前计算的,整个pd的功耗
    }

    cpu_util_next 用来计算若将任务p放置在dst_cpu上后,此pd各个cpu的util。遍历此pd内的每个cpu就可以可以得到此pd内的sum_util,从而计算pd的energy。

    /*
     * Predicts what cpu_util(@cpu) would return if @p was migrated (and enqueued) to @dst_cpu.
     * 作用:预测若任务p迁移到参数dst_cpu上后,参数cpu上的util值
     */
    //compute_energy 传参:(cpu, p, -1) cpu为此pd中的某个cpu,注意dst_cpu传的是-1。-1就只可能减,不可能加了
    static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
    {
        struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
        unsigned long util_est, util = READ_ONCE(cfs_rq->avg.util_avg); //cfs_rq的util值,不会写回。
    
        /*
         * If @p migrates from @cpu to another, remove its contribution. Or,
         * if @p migrates from another CPU to @cpu, add its contribution. In
         * the other cases, @cpu is not impacted by the migration, so the
         * util_avg should already be correct.
         */
        //若此cpu是任务p之前运行的cpu,但是不是p将要运行的cpu
        if (task_cpu(p) == cpu && dst_cpu != cpu)
            sub_positive(&util, task_util(p)); //从cfs_rq的util中减去p的util
        //若此cpu不是任务p之前运行的cpu,但是是p将要运行的cpu
        else if (task_cpu(p) != cpu && dst_cpu == cpu)
            util += task_util(p); //cfs_rq的util加上p的util
    
        if (sched_feat(UTIL_EST)) {
            util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
    
            /*
             * During wake-up, the task isn't enqueued yet and doesn't
             * appear in the cfs_rq->avg.util_est.enqueued of any rq,
             * so just add it (if needed) to "simulate" what will be
             * cpu_util() after the task has been enqueued.
             */
            //若cpu就是任务p要运行的 cpu
            if (dst_cpu == cpu)
                util_est += _task_util_est(p);
    
            util = max(util, util_est);
        }
    
        //返回判断后cfs_rq的util
        return min(util, capacity_orig_of(cpu));
    }
    
    
    /*
     * 对于非dst cpu: compute_energy:传参(cpu, util_cfs, cpu_cap, ENERGY_UTIL, NULL) cpu为此pd下的某个cpu, 
     * util_cfs是这个cpu对应的任务p迁移到dst_cpu后的util, cpu_cap是此pd下单个cpu的算力
     * 
     * 对于dst cpu: compute_energy:传参(cpu, util_cfs, cpu_cap, FREQUENCY_UTIL, tsk)
     *
     * 为了减少篇幅,下面两个函数都删除了大量注释
    */
    /*
     * This function computes an effective utilization for the given CPU, to be
     * used for frequency selection given the linear relation: f = u * f_max.
     */
    //作用:计算cpu上的有效util
    unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs, unsigned long max, enum schedutil_type type, struct task_struct *p)
    {
        unsigned long dl_util, util, irq;
        struct rq *rq = cpu_rq(cpu);
    
        if (!uclamp_is_used() && type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
            return max;
        }
    
        irq = cpu_util_irq(rq);
        if (unlikely(irq >= max))
            return max;
    
        util = util_cfs + cpu_util_rt(rq); //return rq->avg_rt.util_avg
        if (type == FREQUENCY_UTIL) //FREQUENCY_UTIL 才会考虑uclamp,EAS的计算不考虑
            util = uclamp_rq_util_with(rq, util, p);
    
        dl_util = cpu_util_dl(rq); //return rq->avg_dl.util_avg
    
        if (util + dl_util >= max) //CFS+RT+DL 已经超过cpu的算力了
            return max;
    
        /*
         * OTOH, for energy computation we need the estimated running time, so
         * include util_dl and ignore dl_bw.
         */
        if (type == ENERGY_UTIL)
            util += dl_util;
    
        util = scale_irq_capacity(util, irq, max);
        util += irq; //util = util * (1 - irq/max) + irq
    
        if (type == FREQUENCY_UTIL)
            util += cpu_bw_dl(rq);
    
        return min(max, util); //返回cfs+irq+rt+dl后的cpu的算力
    }

    六、Soc厂商对原生逻辑的修改

    注意代码中的HOOK,厂商可能会修改导致不执行原生EAS选核逻辑。

    七、相关DEBUG文件

    1. DEBUG perf_domain 链表的程序

    /* 放到 kernel/sched 下面 */
    
    #define pr_fmt(fmt) "perf_domain_debug: " fmt
    
    #include <linux/fs.h>
    #include <linux/sched.h>
    #include <linux/proc_fs.h>
    #include <linux/seq_file.h>
    #include <linux/string.h>
    #include <linux/printk.h>
    #include <asm/topology.h>
    #include <linux/cpumask.h>
    #include <linux/sched/topology.h>
    #include "sched.h"
    
    
    struct perf_domain_debug_t {
        int cmd;
    };
    
    static struct perf_domain_debug_t pdd;
    
    
    static void perf_domain_debug(struct seq_file *m, struct perf_domain *pd)
    {
        int i;
        struct em_perf_domain *em_pd = pd->em_pd;
    
        seq_printf(m, "em_pd->nr_perf_states=%d, em_pd->milliwatts=%d, em_pd->cpus==%*pbl \n",
            em_pd->nr_perf_states, em_pd->milliwatts, cpumask_pr_args(to_cpumask(em_pd->cpus)));
    
        for (i = 0; i < em_pd->nr_perf_states; i++) {
            seq_printf(m, "[%d]: frequency=%lu, power=%lu, cost=%ld\n",
                    i, em_pd->table[i].frequency, em_pd->table[i].power, em_pd->table[i].cost);
        }
    
        seq_printf(m, "-------------------------------------------------------------------\n");
    }
    
    static int perf_domain_debug_show(struct seq_file *m, void *v)
    {
        struct root_domain *rd = cpu_rq(0)->rd;
        struct perf_domain *pd = rd->pd;
    
        while (pd) {
            perf_domain_debug(m, pd);
    
            pd = pd->next;
        }
    
        return 0;
    }
    
    static int perf_domain_debug_open(struct inode *inode, struct file *file)
    {
        return single_open(file, perf_domain_debug_show, NULL);
    }
    
    static ssize_t perf_domain_debug_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos)
    {
    
        int ret, cmd_value;
        char buffer[32] = {0};
    
        if (count >= sizeof(buffer)) {
            count = sizeof(buffer) - 1;
        }
        if (copy_from_user(buffer, buf, count)) {
            pr_info("copy_from_user failed\n");
            return -EFAULT;
        }
        ret = sscanf(buffer, "%d", &cmd_value);
        if(ret <= 0){
            pr_info("sscanf dec failed\n");
            return -EINVAL;
        }
        pr_info("cmd_value=%d\n", cmd_value);
    
        pdd.cmd = cmd_value;
    
        return count;
    }
    
    //Linux5.10 change file_operations to proc_ops
    static const struct proc_ops perf_domain_debug_fops = {
        .proc_open    = perf_domain_debug_open,
        .proc_read    = seq_read,
        .proc_write   = perf_domain_debug_write,
        .proc_lseek  = seq_lseek,
        .proc_release = single_release,
    };
    
    
    static int __init perf_domain_debug_init(void)
    {
        proc_create("perf_domain_debug", S_IRUGO | S_IWUGO, NULL, &perf_domain_debug_fops);
    
        pr_info("domain_topo_debug probed\n");
    
        return 0;
    }
    fs_initcall(perf_domain_debug_init);
    View Code

    2. 测试结果

    # cat /proc/perf_domain_debug
    em_pd->nr_perf_states=28, em_pd->milliwatts=1, em_pd->cpus==7
    [0]: frequency=1300000, power=308, cost=722615
    [1]: frequency=1400000, power=353, cost=769035
    [2]: frequency=1500000, power=393, cost=799100
    [3]: frequency=1600000, power=444, cost=846375
    [4]: frequency=1700000, power=490, cost=879117
    [5]: frequency=1800000, power=538, cost=911611
    [6]: frequency=1900000, power=588, cost=943894
    [7]: frequency=2000000, power=651, cost=992775
    [8]: frequency=2050000, power=691, cost=1028073
    [9]: frequency=2100000, power=732, cost=1063142
    [10]: frequency=2150000, power=785, cost=1113604
    [11]: frequency=2200000, power=830, cost=1150681
    [12]: frequency=2250000, power=876, cost=1187466
    [13]: frequency=2300000, power=922, cost=1222652
    [14]: frequency=2350000, power=971, cost=1260234
    [15]: frequency=2400000, power=1020, cost=1296250
    [16]: frequency=2450000, power=1088, cost=1354448
    [17]: frequency=2500000, power=1144, cost=1395680
    [18]: frequency=2550000, power=1198, cost=1432901
    [19]: frequency=2600000, power=1239, cost=1453442
    [20]: frequency=2650000, power=1299, cost=1495075
    [21]: frequency=2700000, power=1340, cost=1513703
    [22]: frequency=2750000, power=1403, cost=1556054
    [23]: frequency=2800000, power=1448, cost=1577285
    [24]: frequency=2850000, power=1511, cost=1617035
    [25]: frequency=2900000, power=1559, cost=1639637
    [26]: frequency=3000000, power=1674, cost=1701900
    [27]: frequency=3050000, power=1746, cost=1746000
    -------------------------------------------------------------------
    em_pd->nr_perf_states=32, em_pd->milliwatts=1, em_pd->cpus==4-6
    [0]: frequency=200000, power=21, cost=299250
    [1]: frequency=300000, power=31, cost=294500
    [2]: frequency=400000, power=41, cost=292125
    [3]: frequency=500000, power=55, cost=313500
    [4]: frequency=600000, power=70, cost=332500
    [5]: frequency=700000, power=87, cost=354214
    [6]: frequency=800000, power=104, cost=370500
    [7]: frequency=900000, power=125, cost=395833
    [8]: frequency=1000000, power=145, cost=413250
    [9]: frequency=1100000, power=169, cost=437863
    [10]: frequency=1200000, power=192, cost=456000
    [11]: frequency=1300000, power=215, cost=471346
    [12]: frequency=1400000, power=245, cost=498750
    [13]: frequency=1500000, power=272, cost=516800
    [14]: frequency=1600000, power=300, cost=534375
    [15]: frequency=1700000, power=335, cost=561617
    [16]: frequency=1800000, power=379, cost=600083
    [17]: frequency=1900000, power=420, cost=630000
    [18]: frequency=2000000, power=470, cost=669750
    [19]: frequency=2050000, power=496, cost=689560
    [20]: frequency=2100000, power=523, cost=709785
    [21]: frequency=2150000, power=543, cost=719790
    [22]: frequency=2200000, power=572, cost=741000
    [23]: frequency=2250000, power=602, cost=762533
    [24]: frequency=2300000, power=623, cost=771978
    [25]: frequency=2350000, power=645, cost=782234
    [26]: frequency=2400000, power=666, cost=790875
    [27]: frequency=2450000, power=690, cost=802653
    [28]: frequency=2550000, power=736, cost=822588
    [29]: frequency=2650000, power=783, cost=842094
    [30]: frequency=2750000, power=832, cost=862254
    [31]: frequency=2850000, power=880, cost=880000
    -------------------------------------------------------------------
    em_pd->nr_perf_states=30, em_pd->milliwatts=1, em_pd->cpus==0-3
    [0]: frequency=200000, power=14, cost=126000
    [1]: frequency=250000, power=19, cost=136800
    [2]: frequency=300000, power=23, cost=138000
    [3]: frequency=350000, power=28, cost=144000
    [4]: frequency=400000, power=32, cost=144000
    [5]: frequency=450000, power=37, cost=148000
    [6]: frequency=500000, power=43, cost=154800
    [7]: frequency=550000, power=47, cost=153818
    [8]: frequency=600000, power=53, cost=159000
    [9]: frequency=650000, power=59, cost=163384
    [10]: frequency=700000, power=63, cost=162000
    [11]: frequency=750000, power=70, cost=168000
    [12]: frequency=800000, power=76, cost=171000
    [13]: frequency=850000, power=81, cost=171529
    [14]: frequency=900000, power=87, cost=174000
    [15]: frequency=950000, power=94, cost=178105
    [16]: frequency=1000000, power=99, cost=178200
    [17]: frequency=1050000, power=108, cost=185142
    [18]: frequency=1100000, power=115, cost=188181
    [19]: frequency=1150000, power=125, cost=195652
    [20]: frequency=1200000, power=132, cost=198000
    [21]: frequency=1250000, power=140, cost=201600
    [22]: frequency=1300000, power=150, cost=207692
    [23]: frequency=1350000, power=158, cost=210666
    [24]: frequency=1400000, power=166, cost=213428
    [25]: frequency=1450000, power=177, cost=219724
    [26]: frequency=1500000, power=185, cost=222000
    [27]: frequency=1600000, power=205, cost=230625
    [28]: frequency=1700000, power=222, cost=235058
    [29]: frequency=1800000, power=243, cost=243000
    -------------------------------------------------------------------
    View Code

    cpu7被isolate的话cluster3就不会有了,pd单链表次序:cluster3 --> cluster2 --> cluster1。

    ps->cost = ps->power * cpu_max_freq / ps->freq,对于小核的第一个频点对应的cost也就是 14 * 1800000 / 200000 = 126,但是dump出来的cost=126000,看来是乘以1000了。

    八、总结

    EAS选核的进入条件是唤醒路径的选核,且系统是没有overutil的且EAS是使能的。要求的选核范围中包含Asym Capacity Cpu,在手机平台上对应DIE层级的各个sg中进行选择。首先找出各个cluster中空余算力最大的cpu作为备选cpu,然后计算将待选核任务p放在各个备选cpu上其perf-domain的energy增量,选出energy增量最小的cpu作为 best_energy_cpu。之后 best_energy_cpu 还要与 prev_cpu 进行能量PK,只有前者的能量收益超过prev_cpu的1/16时,才会选择best_energy_cpu,否则选择 prev_cpu。

    参考:https://blog.csdn.net/feelabclihu/article/details/122007603?spm=1001.2014.3001.5501

  • 相关阅读:
    Jenkins发布.net core程序
    CentOS7部署Jenkins
    ASP.NET Core 中的响应缓存中间件
    浏览器缓存
    WEB缓存
    Jmeter常用插件——梯度加压、响应时间、TPS
    Jmeter压测报错:Non HTTP response code: java.net.ConnectExceptionexception的解决办法
    Linux如何安装rpm文件
    Prometheu---配置文件修改
    Grafana介绍---prometheus系列
  • 原文地址:https://www.cnblogs.com/hellokitty2/p/15738144.html
Copyright © 2020-2023  润新知