• QEMU内存分析(四):ept页表构建


    在虚拟化环境下,intel CPU在处理器级别加入了对内存虚拟化的支持。即扩展页表EPT,而AMD也有类似的成为NPT。在此之前,内存虚拟化使用的一个重要技术为影子页表。

    一: EPT转换机制



    注意不管是32位客户机还是64位客户机,这里统一按照64位物理地址来寻址。EPT页表是4级页表,页表的大小仍然是一个页即4KB,但是一个表项是8个字节,所以一张表只能容纳512个表项,需要9位来定位具体的表项。客户机的物理地址使用低48位来完成这一工作。从上图可以看到,一个48位的客户机物理地址被分为5部分,前4部分按9位划分,最后12位作为页内偏移。当处于非根模式下的CPU使用客户机操作一个客户机虚拟地址时,首先使用客户机页表进行地址转换,得到客户机物理地址,然后CPU根据此物理地址查询EPT,在VMCS结构中有一个EPTP的指针,其中的12-51位指向EPT页表的一级目录即PML4 Table.这样根据客户机物理地址的首个9位就可以定位一个PML4 entry,一个PML4 entry理论上可以控制512GB的区域,这里不是重点,我们不在多说。PML4 entry的格式如下:

    1、其实这里我们只需要知道PML4 entry的12-51位记录下一级页表的地址,而这40位肯定是用不完的,根据CPU的架构,采取不同的位数,具体如下:

    当MAXPHYADDR 为36位,在Intel平台的桌面处理器上普遍实现了36位的最高物理地址值,也就是我们普通的个人计算机,可寻址64G空间;
    当MAXPHYADDR 为40位,在Inter的服务器产品和AMD 的平台上普遍实现40位的最高物理地址,可寻址达1TB;

    ① 当MAXPHYADDR为52位时,上一级table entry的12~51位提供下一级table物理基地址的高40位,低12位补零,达到基地址在4K边界对齐;
    ② 当MAXPHYADDR为40位时,上一级table entry的12~39位提供下一级table物理基地址的高28位,此时40~51是保留位,必须置0,低12位补零,达到基地址在4K边界对齐;
    ③ 当MAXPHYADDR为36位时,上一级table entry的12~35位提供下一级table物理基地址的高24位,此时36~51是保留位,必须置0,低12位补零,达到基地址在4K边界对齐。


    2、使用对应的地址位数定位下一级的页表EPT Page-Directory-Pointer-Table的基址,根据客户物理地址的30-38位定位此页表中的一个表项EPT Page-Directory-Pointer-Table entry。注意这里如果该表项的第7位为1,该表项指向一个1G字节的page.为0,则指向下一级页表。下面我们只考虑的是指向页表的情况。

    3、然后根据表项中的12-51位,继续往下定位到第三级页表EPT Page-Directory-Table的基址,在根据客户物理地址的21-29位来定位到一个EPT Page-Directory-Table Entry。如果此entry的第7位为1,则表示该entry指向一个2M的page,为0就指向下一级页表。

    4、根据entry的12-51位定位第四级页表EPT Page-Table基址 ,然后根据客户物理地址的12-20位定位一个PT。

    二 :EPT初始化

        2.1 ept-tree初始化

    hax_accel_init()         //hax-all.c
       hax_init()            //hax-all.c
         hax_vm_create()    //hax-all.c
          hax_create_vm()    //vm.c
            ept_tree_init()        //vm.c

         至此 ,我们已经建立了一张空的ept表,那么ept的实际内容什么时候填充呢?


         初始状态EPT页表为空,当客户机运行时,其使用的GVA转化成GPA后,还需要CPU根据GPA查找EPT,从而定位具体的HPA,但是由于此时EPT为空,所以会引发缺页中断,发生VM-exit, 此时CPU进入到根模式,运行VMM(这里指HAXM),在HAXM中定义了一个异常处理数组来处理对应的VM-exit,

    static int (*handler_funcs[])(struct vcpu_t *vcpu, struct hax_tunnel *htun) = {
        [VMX_EXIT_EPT_VIOLATION]      = exit_ept_violation,

        2.2 EPT页表填充

    main_impl()                //vl.c
      machine_class->init()::pc_init1()    //pc_piix.c
        pc_new_cpu()            //pc_piix.c
        x86_cpu_realizefn()        //cpu.c
            qemu_init_vcpu()    //cpus.c
              qemu_hax_start_vcpu()    //cpus.c
                qemu_hax_cpu_thread_fn()    //cpus.c
                  hax_init_vcpu()        //hax-all.c
                    hax_vcpu_creat()    //hax-all.c
                      hax_host_create_vcpu()    //hax-windows.c
                    hax_vcpu_exec()    //hax-all.c
                      hax_vcpu_hax_exec()        //hax-all.c
                        hax_vcpu_run()        //hax-windows.c


         VCPU运行的代码在HAXM侧的 vcpu_execute(),所以简单列一下调用流程,然后重点分析handle函exit_ept_violation()
    vcpu_execute()            //vcpu.c
      cpu_vmx_execute()        //vcpu.c
        cpu_vmexit_handler()    //cpu.c
          vcpu_vmexit_handler()    //vcpu.c
            handler_funcs[basic_reason] ==  exit_ept_violation


    static int exit_ept_violation(struct vcpu_t *vcpu, struct hax_tunnel *htun)
        htun->_exit_reason = vmx(vcpu, exit_reason).basic_reason;
        if (qual->ept.gla1 == 0 && qual->ept.gla2 == 1) {
            hax_log(HAX_LOGPANIC, "Incorrect EPT setting\n");
            return HAX_RESUME;
        gpa = vmx(vcpu, exit_gpa);
        ret = ept_handle_access_violation(&vcpu->vm->gpa_space, &vcpu->vm->ept_tree,
                                          *qual, gpa, &fault_gfn, &first_access);
        // ret == 0: The EPT violation is due to MMIO
        return vcpu_emulate_insn(vcpu);


    int ept_handle_access_violation(hax_gpa_space *gpa_space, hax_ept_tree *tree,
                                    exit_qualification_t qual, uint64_t gpa,
                                    uint64_t *fault_gfn, bool *first_access)
        //右偏移12位,获取4K的index作为gfn(guest frame number)
        gfn = gpa >> PG_ORDER_4K;
        hax_assert(gpa_space != NULL);
        slot = memslot_find(gpa_space, gfn);
        if (!slot) {
            // The faulting GPA is reserved for MMIO
            hax_log(HAX_LOGD, "%s: gpa=0x%llx is reserved for MMIO\n",
                    __func__, gpa);
            return 0;
        // Extract bits 5..3 from Exit Qualification
        //检查ept的权限,因为是ept violation所以qual从vcpu的exit_qualification得到的是ept结构中内容
            struct {
            uint32_t r           : 1;
            uint32_t w           : 1;
            uint32_t x           : 1;
            uint32_t _r          : 1;
            uint32_t _w          : 1;
            uint32_t _x          : 1;
            uint32_t res1        : 1;
            uint32_t gla1        : 1;
            uint32_t gla2        : 1;
            uint32_t res2        : 3;
            uint32_t nmi_block   : 1;
            uint32_t res3        : 19;
            uint32_t res4        : 32;
        } ept;
        combined_perm = (uint) ((qual.raw >> 3) & 7);
        if (combined_perm != HAX_EPT_PERM_NONE) {
            if ((qual.raw & HAX_EPT_ACC_W) && !(combined_perm & HAX_EPT_PERM_W) &&
                (slot->flags == HAX_MEMSLOT_READONLY)) {
                // Handle a write to ROM/ROM device as MMIO
                hax_log(HAX_LOGD, "%s: write to a read-only gpa=0x%llx\n",
                        __func__, gpa);
                return 0;
            // See IA SDM Vol. 3C 27.2.1 Table 27-7, especially note 2
            hax_log(HAX_LOGE, "%s: Cannot handle the case where the PTE "
                    "corresponding to the faulting GPA is present: qual=0x%llx, "
                    "gpa=0x%llx\n", __func__, qual.raw, gpa);
            return -EACCES;
        // Ideally we should call gpa_space_is_page_protected() and ask user space
        // to unprotect just the host virtual page that |gfn| maps to. But since we
        // pin host RAM one chunk (rather than one page) at a time, if the chunk
        // that |gfn| maps to contains any other host virtual page that is protected
        // (by means of a VirtualProtect() or mprotect() call from user space), we
        // will not be able to pin the chunk when we handle the next EPT violation
        // caused by the same |gfn|.
        // For now, we ask user space to unprotect all host virtual pages in the
        // chunk, so our next hax_pin_user_pages() call will not fail. This is a
        // dirty hack.
        // TODO: Make chunks more flexible, so we can pin host RAM in finer
        // granularity (as small as one page) and hide chunks from user space.
        if (gpa_space_is_chunk_protected(gpa_space, gfn, fault_gfn)) {
            hax_log(HAX_LOGE, "%s: gfn = 0x%llx(fault_gfn = 0x%llx) "
                    "is in protected chunk\n",
                    __func__, gfn, *fault_gfn);
            return -EFAULT;
        // The faulting GPA maps to RAM/ROM
        //此处 计算start_gpa和size,用来传给下一个函数
        is_rom = slot->flags & HAX_MEMSLOT_READONLY;
        offset_within_slot = gpa - (slot->base_gfn << PG_ORDER_4K);
        hax_assert(offset_within_slot < (slot->npages << PG_ORDER_4K));
        block = slot->block;
        hax_assert(block != NULL);
        offset_within_block = slot->offset_within_block + offset_within_slot;
        hax_assert(offset_within_block < block->size);
        chunk = ramblock_get_chunk(block, offset_within_block, true, first_access);
        // Compute the union of the UVA ranges covered by |slot| and |chunk|
        chunk_offset_low = chunk->base_uva - block->base_uva;
        start_gpa = slot->base_gfn << PG_ORDER_4K;
        if (chunk_offset_low > slot->offset_within_block) {
            start_gpa += chunk_offset_low - slot->offset_within_block;
            offset_within_chunk = 0;
        } else {
            offset_within_chunk = slot->offset_within_block - chunk_offset_low;
        chunk_offset_high = chunk_offset_low + chunk->size;
        slot_offset_high = slot->offset_within_block +
                           (slot->npages << PG_ORDER_4K);
        size = chunk->size - offset_within_chunk;
        //因为chunk和slot都 4K对齐,chunk只有一个页面大小,所以这个if其实也不会执行
        if (chunk_offset_high > slot_offset_high) {
            size -= chunk_offset_high - slot_offset_high;
        ret = ept_tree_create_entries(tree, start_gpa >> PG_ORDER_4K,
                                      size >> PG_ORDER_4K, chunk,
                                      offset_within_chunk, slot->flags);
        return 1;


    // Given a GFN and a pointer (KVA) to an EPT page table at a non-leaf level
    // (PML4, PDPT or PD) that covers the GFN, returns a pointer (KVA) to the next-
    // level page table that covers the GFN. This function can be used to walk a
    // |hax_ept_tree| from root to leaf.
    // |tree|: The |hax_ept_tree| to walk.
    // |gfn|: The GFN from which to obtain EPT page table indices.
    // |current_level|: The EPT level to which |current_table| belongs. Must be a
    //                  non-leaf level (PML4, PDPT or PD).
    // |current_table|: The KVA of the current EPT page table. Must not be NULL.
    // |kmap|: A buffer to store a host-specific KVA mapping descriptor, which may
    //         be created if the next-level EPT page table is not a frequently-used
    //         page. The caller must call hax_unmap_page_frame() to destroy the KVA
    //         mapping when it is done with the returned pointer.
    // |create|: If true and the next-level EPT page table does not yet exist,
    //           creates it and updates the corresponding |hax_epte| in
    //           |current_table|.
    // |visit_current_epte|: An optional callback to be invoked on the |hax_epte|
    //                       that belongs to |current_table| and covers |gfn|. May
    //                       be NULL.
    // |opaque|: An arbitrary pointer passed as-is to |visit_current_epte|.
    static hax_epte * ept_tree_get_next_table(hax_ept_tree *tree, uint64_t gfn,
                                              int current_level,
                                              hax_epte *current_table,
                                              hax_kmap_phys *kmap, bool create,
                                              epte_visitor visit_current_epte,
                                              void *opaque)
        int next_level = current_level - 1;
        hax_ept_page_kmap *freq_page;
        uint index;
        hax_epte *epte;
        hax_epte *next_table = NULL;
        hax_assert(tree != NULL);
        hax_assert(next_level >= HAX_EPT_LEVEL_PT && next_level <= HAX_EPT_LEVEL_PDPT);
        //下标取值:若为pt表则取11..20,pd则取21..29,pdpt则30..38, pml4则39..47
        index = (uint) ((gfn >> (HAX_EPT_TABLE_SHIFT * current_level)) &
                        (HAX_EPT_TABLE_SIZE - 1));
        hax_assert(current_table != NULL);
        epte = &current_table[index];
        //ept violation中下发为null,if不执行
        if (visit_current_epte) {
            visit_current_epte(tree, gfn, current_level, epte, opaque);
        //ept  violation中 create为true,if不执行
        if (epte->perm == HAX_EPT_PERM_NONE && !create) {
            return NULL;
        // Only HAX_EPT_FREQ_PAGE_COUNT EPT pages are considered frequently-used,
        // whose KVA mappings are cached in tree->freq_pages[]. They are:
        // a) The EPT PML4 table, covering the entire GPA space. Cached in
        //    freq_pages[0].
        // b) The first EPT PDPT table, pointed to by entry 0 of a), covering the
        //    first 512GB of the GPA space. Cached in freq_pages[1].
        // c) The first n EPT PD tables (n = HAX_EPT_FREQ_PAGE_COUNT - 2), pointed
        //    to by entries 0..(n - 1) of b), covering the first nGB of the GPA
        //    space. Cached in freq_pages[2..(n + 1)].
        freq_page = ept_tree_get_freq_page(tree, gfn, next_level);
        if (hax_cmpxchg64(0, INVALID_EPTE.value, &epte->value)) {
            // epte->value was 0, implying epte->perm == HAX_EPT_PERM_NONE, which
            // means the EPT entry pointing to the next-level page table is not
            // present, i.e. the next-level table does not exist
            hax_ept_page *page;
            uint64_t pfn;
            hax_epte temp_epte = { 0 };
            void *kva;
            page = ept_tree_alloc_page(tree);
            if (!page) {
                epte->value = 0;
                hax_log(HAX_LOGE, "%s: Failed to create EPT page table: gfn=0x%llx,"
                        " next_level=%d\n", __func__, gfn, next_level);
                return NULL;
            page->level = next_level;
            pfn = hax_get_pfn_phys(&page->memdesc);
            hax_assert(pfn != INVALID_PFN);
            temp_epte.perm = HAX_EPT_PERM_RWX;
            // This is a non-leaf |hax_epte|, so ept_mt and ignore_pat_mt are
            // reserved (see IA SDM Vol. 3C 28.2.2 Figure 28-1)
            temp_epte.pfn = pfn;
            kva = hax_get_kva_phys(&page->memdesc);
            hax_assert(kva != NULL);
            if (freq_page) {
                // The next-level EPT table is frequently used, so initialize its
                // KVA mapping cache
                freq_page->page = page;
                freq_page->kva = kva;
            // Create this non-leaf EPT entry
            epte->value = temp_epte.value;
            next_table = (hax_epte *) kva;
            hax_log(HAX_LOGD, "%s: Created EPT page table: gfn=0x%llx, "
                    "next_level=%d, pfn=0x%llx, kva=%p, freq_page_index=%ld\n",
                    __func__, gfn, next_level, pfn, kva,
                    freq_page ? freq_page - tree->freq_pages : -1);
        } else {  // !hax_cmpxchg64(0, INVALID_EPTE.value, &epte->value)
            // epte->value != 0, which could mean epte->perm != HAX_EPT_PERM_NONE,
            // i.e. the EPT entry pointing to the next-level EPT page table is
            // present. But there is another case: *epte == INVALID_EPTE, which
            // means the next-level page table is being created by another thread
            void *kva;
            int i = 0;
            while (epte->value == INVALID_EPTE.value) {
                // Eventually the other thread will set epte->pfn to either a valid
                // PFN or 0
                if (!(++i % 10000)) {  // 10^4
                    hax_log(HAX_LOGI, "%s: In iteration %d of while loop\n",
                            __func__, i);
                    if (i == 100000000) {  // 10^8 (< INT_MAX)
                        hax_log(HAX_LOGE, "%s: Breaking out of infinite loop: "
                                "gfn=0x%llx, next_level=%d\n", __func__, gfn,
                        return NULL;
            if (!epte->value) {
                // The other thread has cleared epte->value, indicating it could not
                // create the next-level page table
                hax_log(HAX_LOGE, "%s: Another thread tried to create the same EPT "
                        "page table first, but failed: gfn=0x%llx, next_level=%d\n",
                        __func__, gfn, next_level);
                return NULL;
            if (freq_page) {
                // The next-level EPT table is frequently used, so its KVA mapping
                // must have been cached
                kva = freq_page->kva;
                hax_assert(kva != NULL);
            } else {
                // The next-level EPT table is not frequently used, which means a
                // temporary KVA mapping needs to be created
                hax_assert(epte->pfn != INVALID_PFN);
                hax_assert(kmap != NULL);
                kva = hax_map_page_frame(epte->pfn, kmap);
                if (!kva) {
                    hax_log(HAX_LOGE, "%s: Failed to map pfn=0x%llx into "
                            "KVA space\n", __func__, epte->pfn);
            next_table = (hax_epte *) kva;
        return next_table;

    不考虑其他线程创建next_level的表项,也不考虑next_level已经创建的情况,假设next_level没有创建,那么这里会使用ept_tree_alloc_page创建一个页面,用来存储next_level的 表项,继续分析这个函数

    // Allocates a |hax_ept_page| for the given |hax_ept_tree|. Returns the
    // allocated |hax_ept_page|, whose underlying host page frame is filled with
    // zeroes, or NULL on error.
    static hax_ept_page * ept_tree_alloc_page(hax_ept_tree *tree)
        hax_ept_page *page;
        int ret;
        typedef struct hax_ept_page {
            hax_memdesc_phys memdesc;
            // Turns this object into a list node
            hax_list_node entry;
            int level;
        } hax_ept_page;
        page = (hax_ept_page *) hax_vmalloc(sizeof(*page), 0);
        if (!page) {
            hax_log(HAX_LOGE, "%s: hax_vmalloc for page fail\n", __func__);
            return NULL;
        ret = hax_alloc_page_frame(HAX_PAGE_ALLOC_ZEROED, &page->memdesc);
        if (ret) {
            hax_log(HAX_LOGE, "%s: hax_alloc_page_frame() returned %d\n",
                    __func__, ret);
            hax_vfree(page, sizeof(*page));
            return NULL;
        hax_assert(tree != NULL);
        hax_list_add(&page->entry, &tree->page_list);
        return page;


    int hax_alloc_page_frame(uint8_t flags, hax_memdesc_phys *memdesc)
        PHYSICAL_ADDRESS low_addr, high_addr, skip_bytes;
        ULONG options;
        PMDL pmdl;
        //填充 PHYSICAL_ADDRESS结构体
        low_addr.QuadPart = 0;
        high_addr.QuadPart = (int64_t)-1;
        skip_bytes.QuadPart = 0;
        if (!(flags & HAX_PAGE_ALLOC_ZEROED)) {
            options |= MM_DONT_ZERO_ALLOCATION;
        // This call may block
        pmdl = MmAllocatePagesForMdlEx(low_addr, high_addr, skip_bytes,
                                       PAGE_SIZE_4K, MmCached, options);
        if (!pmdl) {
            hax_log(HAX_LOGE, "%s: Failed to allocate 4KB of nonpaged memory\n", __func__);
            return -ENOMEM;
        memdesc->pmdl = pmdl;
        return 0;
  • 相关阅读:
    Scala 中的 apply 和 update 方法[转]
  • 原文地址:https://www.cnblogs.com/edver/p/14662609.html
Copyright © 2020-2023  润新知