• linux源码解读(十六):红黑树在内核的应用——虚拟内存管理


      1、linux内核中利用红黑树增删改查快速、稳定的特性来管理的还有另一个非常重要的功能:虚拟内存管理!前面介绍了buddy和slab算法是用来管理物理页面的。由于早期物理页面远比虚拟页面小很多,而且只需要分配和回收合并,所以也没用树形结构来组织,简单粗暴地用链表来管理!但是虚拟内存不一样了:以32位的系统为例,虚拟内存有4GB,能划分的虚拟内存块有很多,划分后需要快速增删查虚拟内存块(需要频繁地读取代码、读写数据、加载动态链接库等),此时用红黑树就很合适了!老规矩,先上结构体:

      task_struct中嵌套了一个mm_struct结构体指针,这个结构体大有乾坤:

    struct task_struct {
    .......
        struct mm_struct *mm;
    .......
    }

      继续深入:又发现了红黑树的根节点!另外两个vm_area_struct结构体又是干嘛的了?

    struct mm_struct {
              struct vm_area_struct * mmap;       /* list of VMAs */
              struct rb_root mm_rb;/*又是红黑树的根节点*/
              struct vm_area_struct * mmap_cache;      /* last find_vma result */
    .......
    }

      继续深入:

    • 发现rb_node结构体了么?这明显是红黑树的节点呀!和上面的rb_root不是刚好组成红黑树么?
    • 进程的虚拟内存空间会被分成不同的若干区域,每个区域都有其相关的属性和用途;一个合法的地址总是落在某个区域当中的,这些区域也不会重叠。在linux内核中,这样的区域被称之为虚拟内存区域(virtual memory areas),简称 VMA。一个vma就是一块连续的线性地址空间的抽象,它拥有自身的权限(可读,可写,可执行等等)
    struct vm_area_struct {
        struct mm_struct * vm_mm;    /* 所属的内存描述符 */
        unsigned long vm_start;    /* vma的起始地址 */
        unsigned long vm_end;        /* vma的结束地址 */
     
        /* 该vma的在一个进程的vma链表中的前驱vma和后驱vma指针,链表中的vma都是按地址来排序的*/
        struct vm_area_struct *vm_next, *vm_prev;
     
        pgprot_t vm_page_prot;        /* vma的访问权限 */
        unsigned long vm_flags;    /* 标识集 */
     
        struct rb_node vm_rb;      /* 红黑树中对应的节点 */
            ...............
    }

       为了直观展示这些结构体之间的关系,我画了一张图供大家参考,要点说明如下:

    • vm_area_struct有vm_start和vm_end两个字段,分别指向虚拟内存区域的开始和结束地址;
    • vm_area_struct的vm_rb又组成了红黑树:便于根据特定的条件快速查找目标区域
    • vm_area_struct实例之间也用链表连接:主要用来遍历节点(不需要像树形结构一样前序、中序、后序等方式遍历,速度快一些;树形结构遍历需要递归或借助栈/队列等结构,空间复杂度是O(N)

      

       2、(1)结构体定义好了,下一步就是操作这些结构体了。既然用到了红黑树管理VMA,第一件事肯定是建树、建链表啦(所有的操作都在mm\mmap.c文件里),最直接的api是__vma_link函数,如下:

    /*将新创建的vm_area_struct挂在mm_struct中管理的红黑树上*/
    static void
    __vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
        struct vm_area_struct *prev, struct rb_node **rb_link,
        struct rb_node *rb_parent)
    {
        __vma_link_list(mm, vma, prev, rb_parent);
        __vma_link_rb(mm, vma, rb_link, rb_parent);
    }

      调用了两个函数,从名称看就知道一个是建立链表,另一个是建立红黑树;先看第一个_vma_link_list函数,如下:就是把vma实例加入到mm的mmap字段,然后让vma的next指针指向下一个vma实例,完成链表的建立!

    void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
            struct vm_area_struct *prev, struct rb_node *rb_parent)
    {
        struct vm_area_struct *next;
    
        vma->vm_prev = prev;
        if (prev) {
            next = prev->vm_next;
            prev->vm_next = vma;
        } else {
            mm->mmap = vma;
            if (rb_parent)
                next = rb_entry(rb_parent,
                        struct vm_area_struct, vm_rb);
            else
                next = NULL;
        }
        vma->vm_next = next;
        if (next)
            next->vm_prev = vma;
    }

      另一个是__vma_link_rb函数在红黑树中插入节点,如下:

    void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
            struct rb_node **rb_link, struct rb_node *rb_parent)
    {
        /* Update tracking information for the gap following the new vma. */
        if (vma->vm_next)
            vma_gap_update(vma->vm_next);
        else
            mm->highest_vm_end = vma->vm_end;
    
        /*
         * vma->vm_prev wasn't known when we followed the rbtree to find the
         * correct insertion point for that vma. As a result, we could not
         * update the vma vm_rb parents rb_subtree_gap values on the way down.
         * So, we first insert the vma with a zero rb_subtree_gap value
         * (to be consistent with what we did on the way down), and then
         * immediately update the gap to the correct value. Finally we
         * rebalance the rbtree after all augmented values have been set.
         */
        rb_link_node(&vma->vm_rb, rb_parent, rb_link);
        vma->rb_subtree_gap = 0;
        vma_gap_update(vma);
        /*传入的vma插入红黑树*/
        vma_rb_insert(vma, &mm->mm_rb);
    }

      (2)上述代码执行完毕,红黑树也就建好了!接下来就是查询了;因为红黑树是用来管理vma的,建树的时候所用的key值就是vma的线性地址了,所以红黑树节点左孩vma的地址都比当前节点小,右孩vma的地址都比当前节点大;代码如下:

    /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
    struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
    {
        struct rb_node *rb_node;
        struct vm_area_struct *vma;
    
        /* Check the cache first. */
        vma = vmacache_find(mm, addr);
        if (likely(vma))
            return vma;
        /*红黑树的根节点*/
        rb_node = mm->mm_rb.rb_node;
    
        while (rb_node) {
            struct vm_area_struct *tmp;
    
            tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
    
            if (tmp->vm_end > addr) {
                vma = tmp;
                /*要查找的addr刚好在这个vma实例的vm_start和vm_end之间,说明找到了*/
                if (tmp->vm_start <= addr)
                    break;
                /*addr比vm_start都大,继续从左子树查找*/
                rb_node = rb_node->rb_left;
            } else /*addr比vm_end小,继续从右子树查找*/
                rb_node = rb_node->rb_right;
        }
    
        if (vma)
            vmacache_update(addr, vma);
        return vma;
    }

      (3)上面是根据用户指定的线性地址查找第一个符合的vma实例,用户实际使用时,还需要查找空闲未使用的虚拟内存块,用来存储重要的数据,这个又该怎么实现了?linux的实现方法为:arch_get_unmapped_area函数,如下:最核心的代码加了中文注释;思路就是根据addr查找vma;如果vma为空,并且大小、边界等条件也都满足,这块内存就可以拿来用了

    unsigned long
    arch_get_unmapped_area(struct file *filp, unsigned long addr,
            unsigned long len, unsigned long pgoff, unsigned long flags)
    {
        struct mm_struct *mm = current->mm;
        struct vm_area_struct *vma;
        struct vm_unmapped_area_info info;
    
        if (len > TASK_SIZE - mmap_min_addr)
            return -ENOMEM;
    
        if (flags & MAP_FIXED)
            return addr;
    
        if (addr) {
            addr = PAGE_ALIGN(addr);//将addr调整成页大小的倍数
            /*通过addr查找对应的vma是否为空;如果是空,说明该区域还未被使用
              ,如果其他条件也满足,就直接使用这块地址了*/
            vma = find_vma(mm, addr);
            if (TASK_SIZE - len >= addr && addr >= mmap_min_addr &&
                (!vma || addr + len <= vma->vm_start))
                return addr;
        }
    
        info.flags = 0;
        info.length = len;
        info.low_limit = mm->mmap_base;
        info.high_limit = TASK_SIZE;
        info.align_mask = 0;
        return vm_unmapped_area(&info);
    }

      (4)内存用完后就要释放,为了避免碎片,肯定是要和现有的空闲内存合并的;由于空闲内存没有用红黑树组织,所以此步骤也不涉及红黑树的操作,具体的思路为:先检查释放区域之前的prve区域终止地址是否与释放区域起始地址重合,或释放区域的结束地址是否与其之后的next区域起始地址重合;接着再检查将要合并的区域是否有相同的标志。如果合并区域均映射了磁盘文件,则还要检查其映射文件是否相同,以及文件内的偏移量是否连续。思路并不复杂,代码如下:

    /*
     * Given a mapping request (addr,end,vm_flags,file,pgoff), figure out
     * whether that can be merged with its predecessor or its successor.
     * Or both (it neatly fills a hole).
     *
     * In most cases - when called for mmap, brk or mremap - [addr,end) is
     * certain not to be mapped by the time vma_merge is called; but when
     * called for mprotect, it is certain to be already mapped (either at
     * an offset within prev, or at the start of next), and the flags of
     * this area are about to be changed to vm_flags - and the no-change
     * case has already been eliminated.
     *
     * The following mprotect cases have to be considered, where AAAA is
     * the area passed down from mprotect_fixup, never extending beyond one
     * vma, PPPPPP is the prev vma specified, and NNNNNN the next vma after:
     *
     *     AAAA             AAAA                AAAA          AAAA
     *    PPPPPPNNNNNN    PPPPPPNNNNNN    PPPPPPNNNNNN    PPPPNNNNXXXX
     *    cannot merge    might become    might become    might become
     *                    PPNNNNNNNNNN    PPPPPPPPPPNN    PPPPPPPPPPPP 6 or
     *    mmap, brk or    case 4 below    case 5 below    PPPPPPPPXXXX 7 or
     *    mremap move:                                    PPPPXXXXXXXX 8
     *        AAAA
     *    PPPP    NNNN    PPPPPPPPPPPP    PPPPPPPPNNNN    PPPPNNNNNNNN
     *    might become    case 1 below    case 2 below    case 3 below
     *
     * It is important for case 8 that the the vma NNNN overlapping the
     * region AAAA is never going to extended over XXXX. Instead XXXX must
     * be extended in region AAAA and NNNN must be removed. This way in
     * all cases where vma_merge succeeds, the moment vma_adjust drops the
     * rmap_locks, the properties of the merged vma will be already
     * correct for the whole merged range. Some of those properties like
     * vm_page_prot/vm_flags may be accessed by rmap_walks and they must
     * be correct for the whole merged range immediately after the
     * rmap_locks are released. Otherwise if XXXX would be removed and
     * NNNN would be extended over the XXXX range, remove_migration_ptes
     * or other rmap walkers (if working on addresses beyond the "end"
     * parameter) may establish ptes with the wrong permissions of NNNN
     * instead of the right permissions of XXXX.
     */
    struct vm_area_struct *vma_merge(struct mm_struct *mm,
                struct vm_area_struct *prev, unsigned long addr,
                unsigned long end, unsigned long vm_flags,
                struct anon_vma *anon_vma, struct file *file,
                pgoff_t pgoff, struct mempolicy *policy,
                struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
    {
        pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
        struct vm_area_struct *area, *next;
        int err;
    
        /*
         * We later require that vma->vm_flags == vm_flags,
         * so this tests vma->vm_flags & VM_SPECIAL, too.
         */
        if (vm_flags & VM_SPECIAL)
            return NULL;
    
        if (prev)
            next = prev->vm_next;
        else
            next = mm->mmap;
        area = next;
        if (area && area->vm_end == end)        /* cases 6, 7, 8 */
            next = next->vm_next;
    
        /* verify some invariant that must be enforced by the caller */
        VM_WARN_ON(prev && addr <= prev->vm_start);
        VM_WARN_ON(area && end > area->vm_end);
        VM_WARN_ON(addr >= end);
    
        /*
         * Can it merge with the predecessor?
         */
        if (prev && prev->vm_end == addr &&
                mpol_equal(vma_policy(prev), policy) &&
                can_vma_merge_after(prev, vm_flags,
                            anon_vma, file, pgoff,
                            vm_userfaultfd_ctx)) {
            /*
             * OK, it can.  Can we now merge in the successor as well?
             */
            if (next && end == next->vm_start &&
                    mpol_equal(policy, vma_policy(next)) &&
                    can_vma_merge_before(next, vm_flags,
                                 anon_vma, file,
                                 pgoff+pglen,
                                 vm_userfaultfd_ctx) &&
                    is_mergeable_anon_vma(prev->anon_vma,
                                  next->anon_vma, NULL)) {
                                /* cases 1, 6 */
                err = __vma_adjust(prev, prev->vm_start,
                         next->vm_end, prev->vm_pgoff, NULL,
                         prev);
            } else                    /* cases 2, 5, 7 */
                err = __vma_adjust(prev, prev->vm_start,
                         end, prev->vm_pgoff, NULL, prev);
            if (err)
                return NULL;
            khugepaged_enter_vma_merge(prev, vm_flags);
            return prev;
        }
    
        /*
         * Can this new request be merged in front of next?
         */
        if (next && end == next->vm_start &&
                mpol_equal(policy, vma_policy(next)) &&
                can_vma_merge_before(next, vm_flags,
                             anon_vma, file, pgoff+pglen,
                             vm_userfaultfd_ctx)) {
            if (prev && addr < prev->vm_end)    /* case 4 */
                err = __vma_adjust(prev, prev->vm_start,
                         addr, prev->vm_pgoff, NULL, next);
            else {                    /* cases 3, 8 */
                err = __vma_adjust(area, addr, next->vm_end,
                         next->vm_pgoff - pglen, NULL, next);
                /*
                 * In case 3 area is already equal to next and
                 * this is a noop, but in case 8 "area" has
                 * been removed and next was expanded over it.
                 */
                area = next;
            }
            if (err)
                return NULL;
            khugepaged_enter_vma_merge(area, vm_flags);
            return area;
        }
    
        return NULL;
    }

      (5)既然释放内存,相应的vma肯定也要从红黑树删除的,这个功能在detach_vmas_to_be_unmapped中实现的:

    /*
     * Create a list of vma's touched by the unmap, removing them from the mm's
     * vma list as we go..
       Remove the vma's, and unmap the actual pages
    
     */
    static void
    detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
        struct vm_area_struct *prev, unsigned long end)
    {
        struct vm_area_struct **insertion_point;
        struct vm_area_struct *tail_vma = NULL;
    
        insertion_point = (prev ? &prev->vm_next : &mm->mmap);
        vma->vm_prev = NULL;
        do {
            vma_rb_erase(vma, &mm->mm_rb);//从红黑树删除vma
            mm->map_count--;
            tail_vma = vma;
            vma = vma->vm_next;
        } while (vma && vma->vm_start < end);
        *insertion_point = vma;
        if (vma) {
            vma->vm_prev = prev;
            vma_gap_update(vma);
        } else
            mm->highest_vm_end = prev ? prev->vm_end : 0;
        tail_vma->vm_next = NULL;
    
        /* Kill the cache */
        vmacache_invalidate(mm);
    }

    总结:

    1、这里有虚拟内存和物理内存、进程内存和操作系统内存的映射图示,方便各位理解

     

     2、用rb_node关键词搜索,我一共找到了3千多个:可见红黑树在linux内核使用的范围之广!

     

     3、AVL树和红黑树很类似,但是AVL树的删除和插入需要多次旋转操作以及不断向根节点回溯,所以在大量删除和插入操作的情况下, AVL树的效率较低。红黑树是一种查找效率仅次于AVL树的不完全平衡二叉树,它摒弃了AVL要求的强平衡约束,能够以O(logn)的时间复杂度进行插入、删除操作;而且其插入和删除最多需要两次或者三次旋转即可保持树的平衡。虽然二者的算法复杂度相同,但在最坏情况下,红黑树提供了更加快速删除和插入一个节点的算法,显然红黑树的高效操作更适合Linux内核中大量VMA的添加、删除和查找!

    参考:

    1、https://stephenzhou.blog.csdn.net/article/details/89501437?spm=1001.2101.3001.6650.1&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1.pc_relevant_default&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1.pc_relevant_default&utm_relevant_index=2  虚拟内存VMA浅析

    2、http://edsionte.com/techblog/archives/3564 虚拟内存操作

    3、http://edsionte.com/techblog/archives/3586 虚拟内存操作

  • 相关阅读:
    关于汉密尔顿回路
    hdu 3018 Ant Trip
    hdu 1116 Play on Words
    关于欧拉回路、欧拉通路的一些定理及推论
    hdu 1531 King
    hdu 3440 House Man
    hdu 3666 THE MATRIX PROBLEM
    hdu 1384 Intervals
    关于差分约束系统
    hdu 1878 欧拉回路
  • 原文地址:https://www.cnblogs.com/theseventhson/p/15820092.html
Copyright © 2020-2023  润新知