• 在page cache中的页,如果当时没有进程read或者write,引用计数到底该为多少


    在一次偶然的机会,在研究如何降低pagecache占用的过程中,走查了 invalidate_mapping_pages的代码:

    通过调用 __pagevec_lookup 在radix树中收集一部分page,然后尝试调用 invalidate_inode_page  来释放这个page。

    我主要看__pagevec_lookup  对引用计数的修改:

    __pagevec_lookup   -- >__find_get_pages -->page_cache_get_speculative

    static inline int page_cache_get_speculative(struct page *page)
    {/*³¢ÊÔÔö¼ÓÒýÓüÆÊý£¬Èç¹ûpage²»ÎªfreeµÄ»°*/
        VM_BUG_ON(in_interrupt());
    
    #ifdef CONFIG_TINY_RCU
    # ifdef CONFIG_PREEMPT_COUNT
        VM_BUG_ON(!in_atomic());
    # endif
        /*
         * Preempt must be disabled here - we rely on rcu_read_lock doing
         * this for us.
         *
         * Pagecache won't be truncated from interrupt context, so if we have
         * found a page in the radix tree here, we have pinned its refcount by
         * disabling preempt, and hence no need for the "speculative get" that
         * SMP requires.
         */
        VM_BUG_ON_PAGE(page_count(page) == 0, page);
        page_ref_inc(page);
    
    #else
        if (unlikely(!get_page_unless_zero(page))) {-------------走这个分支
            /*
             * Either the page has been freed, or will be freed.
             * In either case, retry here and the caller should
             * do the right thing (see comments above).
             */
            return 0;
        }
    #endif
        VM_BUG_ON_PAGE(PageTail(page), page);
    
        return 1;
    }

    static inline int get_page_unless_zero(struct page *page)
    {
    return page_ref_add_unless(page, 1, 0);
    }

    
    

    static inline int page_ref_add_unless(struct page *page, int nr, int u)
    {
    return atomic_add_unless(&page->_count, nr, u);
    }

     
    /**
     * atomic_add_unless - add unless the number is already a given value
     * @v: pointer of type atomic_t
     * @a: the amount to add to v...
     * @u: ...unless v is equal to u.
     *
     * Atomically adds @a to @v, so long as @v was not already @u.
     * Returns non-zero if @v was not @u, and zero otherwise.
     */
    static inline int atomic_add_unless(atomic_t *v, int a, int u)
    {
        return __atomic_add_unless(v, a, u) != u;
    }

    最后一个函数注释很明白,除非 page->_count 的计数为1,否则不增加引用计数,也就是说,当原值 page->_count 为1的时候,增加到2,然后返回。

    然后看释放过程:

                if (!trylock_page(page))
                    continue;
                WARN_ON(page->index != index);
                ret = invalidate_inode_page(page);
                unlock_page(page);

    主要的调用链为:

    invalidate_inode_page--> invalidate_complete_page --> remove_mapping --> __remove_mapping

    /*
     * Same as remove_mapping, but if the page is removed from the mapping, it
     * gets returned with a refcount of 0.
     */
    static int __remove_mapping(struct address_space *mapping, struct page *page,
                    bool reclaimed)
    {
        BUG_ON(!PageLocked(page));
        BUG_ON(mapping != page_mapping(page));
    
        spin_lock_irq(&mapping->tree_lock);
        /*
         * The non racy check for a busy page.
         *
         * Must be careful with the order of the tests. When someone has
         * a ref to the page, it may be possible that they dirty it then
         * drop the reference. So if PageDirty is tested before page_count
         * here, then the following race may occur:
         *
         * get_user_pages(&page);
         * [user mapping goes away]
         * write_to(page);
         *                !PageDirty(page)    [good]
         * SetPageDirty(page);
         * put_page(page);
         *                !page_count(page)   [good, discard it]
         *
         * [oops, our write_to data is lost]
         *
         * Reversing the order of the tests ensures such a situation cannot
         * escape unnoticed. The smp_rmb is needed to ensure the page->flags
         * load is not satisfied before that of page->_count.
         *
         * Note that if SetPageDirty is always performed via set_page_dirty,
         * and thus under tree_lock, then this ordering is not required.
         */
        if (!page_ref_freeze(page, 2))
            goto cannot_free;
        /* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
        if (unlikely(PageDirty(page))) {
            page_ref_unfreeze(page, 2);
            goto cannot_free;
        }
    
        if (PageSwapCache(page)) {
            swp_entry_t swap = { .val = page_private(page) };
            __delete_from_swap_cache(page);
            spin_unlock_irq(&mapping->tree_lock);
            swapcache_free(swap, page);
        } else {
            void (*freepage)(struct page *);
            void *shadow = NULL;
    
            freepage = mapping->a_ops->freepage;
            /*
             * Remember a shadow entry for reclaimed file cache in
             * order to detect refaults, thus thrashing, later on.
             *
             * But don't store shadows in an address space that is
             * already exiting.  This is not just an optizimation,
             * inode reclaim needs to empty out the radix tree or
             * the nodes are lost.  Don't plant shadows behind its
             * back.
             *
             * We also don't store shadows for DAX mappings because the
             * only page cache pages found in these are zero pages
             * covering holes, and because we don't want to mix DAX
             * exceptional entries and shadow exceptional entries in the
             * same page_tree.
             */
            if (reclaimed && page_is_file_cache(page) &&
                !mapping_exiting(mapping) && !dax_mapping(mapping))
                shadow = workingset_eviction(mapping, page);
            __delete_from_page_cache(page, shadow);
            spin_unlock_irq(&mapping->tree_lock);
            mem_cgroup_uncharge_cache_page(page);
    
            if (freepage != NULL)
                freepage(page);
        }
    
        return 1;
    
    cannot_free:
        spin_unlock_irq(&mapping->tree_lock);
        return 0;
    }

    看第一行的注释,我们可以知道,remove_mapping 是__remove_mapping的包裹函数, __remove_mapping函数如果将页面从page cache中移除成功,则会将page的引用计数 返回0.

    我们关注最关键的一段:

       if (!page_ref_freeze(page, 2))
            goto cannot_free;
    
    static inline int page_ref_freeze(struct page *page, int count)
    {
        return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
    }
    
    
    atomic_cmpxchg函数实现了一个比较+交换的原子操作(原子就是说cpu要不就不
    做,要做就一定要做完,不会存在中间状态,对应这里就是比较和交换要一次过做完).
    

    关于atomic 和比较交换的一些函数,网上资料较多,在此不赘述,总体的意思就是,当page的count为2,则交换为0,并且返回旧值2,

    也就是 page->_count 为2的话,则可以释放,否则 会走 cannot_free 的分支。因为之前 __pagevec_lookup    执行后,page->_count肯定为2,所以能最终free。

    回到 包裹函数 remove_mapping

    /*
     * Attempt to detach a locked page from its ->mapping.  If it is dirty or if
     * someone else has a ref on the page, abort and return 0.  If it was
     * successfully detached, return 1.  Assumes the caller has a single ref on
     * this page.
     */
    int remove_mapping(struct address_space *mapping, struct page *page)
    {
        if (__remove_mapping(mapping, page, false)) {
            /*
             * Unfreezing the refcount with 1 rather than 2 effectively
             * drops the pagecache ref for us without requiring another
             * atomic operation.
             */
            page_ref_unfreeze(page, 1);
            return 1;
        }
        return 0;
    }

    因为在lookup阶段,将原有的因为计数page->_count为1的,增加到2,然后通过一个交换判断,如果是2的,则交换为0,然后最终调用 page_ref_unfreeze  将引用计数设置为1,

    释放成功。有兴趣的同学可以继续一下,因为在pagecache中的页,默认是加入到lru的,为了防止频繁地加入到lru,又设计了一个pagevec数组,当数组满了或者主动调用drain函数来将

    数组中缓存的page 刷入到lru链表中,为了区分这两种状态,加入到lru的pagevec数组的page,计数要加1,当真正加入到lru中的时候,计数又减1,恢复到之前的计数值。所以lru并不占用计数。

    我们知道,页面从freelist中分配出来的时候,引用计数是需要加1的。

    get_page_from_freelist->buffered_rmqueue->prep_new_page->set_page_refcounted,由此函数完成+1.

    写本文的原因是,我原来以为的加入到lru,加入到radix树,都需要增加引用计数的,在加入radix树,确实是加1了,但一般出来之后就会-1,所以在radix树的时候,计数增加只是临时行为,lru也是如此,因为加入到lru,我只看到了加入到lru的pagevec数组,这个时候确实是+1了,但是

    当真正加入到lru链表的时候,又减了1,也就是page真正加入到lru链表,会保持计数不变,当然PG_Lru肯定是要置位的。

    同理,我在看函数 add_to_page_cache_lru 的时候,确实对加入的page 增加了引用计数,所以一直认为pagecache中的页的引用计数至少是2, 调用 __pagevec_lookup    后应该为3,

    当我看到 如下代码之后,就死活不理解。

     if (!page_ref_freeze(page, 2))
            goto cannot_free;

    然后回过头再去看单凡是调用add_to_page_cache_lru 之后,都会调用put_page,不管成功失败,如果成功,相当于加入pagecache成功,对应的put_page就是减去对加入的时候的page的引用计数,那么此时计数为1,如果加入失败,那么对应的put_page就是释放内存,因为此时计数为0.

    回到文章的开头,如果一个页面,没人访问,在pagecache中,当然也在lru链表中的时候,引用计数为1,而这个1,还是从freelist中摘除的时候来增加的。也有可能为2,此时说明没有在lru中,只在pagevec中,

    如果一个页面,没人访问,在pagecache中,但是处于lru的lruvec中,此时的引用计数应该为2,所以才会有在调用fadvise64_64 的时候,

    case POSIX_FADV_DONTNEED:
            if (!bdi_write_congested(mapping->backing_dev_info))
                __filemap_fdatawrite_range(mapping, offset, endbyte,
                               WB_SYNC_NONE);
    
            /* First and last FULL page! */
            start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
            end_index = (endbyte >> PAGE_CACHE_SHIFT);
    
            if (end_index >= start_index) {
                unsigned long count = invalidate_mapping_pages(mapping,
                            start_index, end_index);
    
                /*
                 * If fewer pages were invalidated than expected then
                 * it is possible that some of the pages were on
                 * a per-cpu pagevec for a remote CPU. Drain all
                 * pagevecs and try again.
                 */
                if (count < (end_index - start_index + 1)) {
                    lru_add_drain_all();
                    invalidate_mapping_pages(mapping, start_index,
                            end_index);
                }
            }
            break;

    当发现释放的页面小于请求的页面数,会调用 lru_add_drain_all ,如果不调用这个,则有可能因为处于lru的pagevec的页无法释放,其实有大概率是能够释放的。

    当时的stap记录如下:

    调用 page_cache_get_speculative 之前的计数,为1,此时就是出于pagecache中的页的原本计数,(排除其他正在使用的页)

    enter 1147=page=0xffffea000e955080,flags=0x6fffff00020068,mapcount=-1,_count=1===
    0xffffffff81183381 : __find_get_pages+0x81/0x170 [kernel]
    0xffffffff8118ff2e : __pagevec_lookup+0x1e/0x30 [kernel]
    0xffffffff81191243 : invalidate_mapping_pages+0x93/0x1f0 [kernel]
    0xffffffff811850b4 : SyS_fadvise64_64+0x1a4/0x290 [kernel]
    0xffffffff811851ae : SyS_fadvise64+0xe/0x10 [kernel]
    0xffffffff81698b09 : system_call_fastpath+0x16/0x1b [kernel]

    调用 page_cache_get_speculative之后的计数,为2,
    enter 1163=page=0xffffea000e955080,flags=0x6fffff00020068,mapcount=-1,_count=2===
    0xffffffff811833b5 : __find_get_pages+0xb5/0x170 [kernel]
    0xffffffff8118ff2e : __pagevec_lookup+0x1e/0x30 [kernel]
    0xffffffff81191243 : invalidate_mapping_pages+0x93/0x1f0 [kernel]
    0xffffffff811850b4 : SyS_fadvise64_64+0x1a4/0x290 [kernel]
    0xffffffff811851ae : SyS_fadvise64+0xe/0x10 [kernel]
    0xffffffff81698b09 : system_call_fastpath+0x16/0x1b [kernel]

     有一个同事问到,在__generic_file_splice_read 函数中,有一个while循环

    	while (spd.nr_pages < nr_pages) {
    		/*
    		 * Page could be there, find_get_pages_contig() breaks on
    		 * the first hole.
    		 */
    		page = find_get_page(mapping, index);//找具体的page,之前连续的时候没找到的
    		if (!page) {//经过预读仍然没找到
    			/*
    			 * page didn't exist, allocate one.
    			 */
    			page = page_cache_alloc_cold(mapping);//分配页面
    			if (!page)
    				break;
    //加入到radix树,主要有修改page的mapping等
    			error = add_to_page_cache_lru(page, mapping, index,
    						GFP_KERNEL);
    			if (unlikely(error)) {
    				page_cache_release(page);
    				if (error == -EEXIST)
    					continue;
    				break;
    			}
    			/*
    			 * add_to_page_cache() locks the page, unlock it
    			 * to avoid convoluting the logic below even more.
    			 */
    			unlock_page(page);
    		}
    
    		spd.pages[spd.nr_pages++] = page;//将找到的或者分配的页面加入到spd
    		index++;
    	}
    

      经过page_cache_alloc_cold 再加入到 add_to_page_cache_lru 的页面,并没有-1啊,岂不是跟之前的描述矛盾,这个地方没有减1,其实是因为这个计数本就应该为2,因为这个page加入到了spd中,计数必须增加,既然-1又需要加1,干脆就不动,前面通过 find_get_pages_contig 加入到spd中的页,此时计数应该也是2,(排查并发操作的情况,否则就是>2).

    经过 find_get_pages_contig 加入到spd,或者通过 spd.pages[spd.nr_pages++] = page;//将找到的或者分配的页面加入到spd

    保证了此时在spd中的page的计数都至少为2.这个在spd进行release的时候,统一进行-1,计数又恢复了。

    所以说,pagecache中的且位于lru链表的page,在没有读写,也没有kswap正在对该page进行老化的情况下,引用计数就是1。

    水平有限,如果有错误,请帮忙提醒我。如果您觉得本文对您有帮助,可以点击下面的 推荐 支持一下我。版权所有,需要转发请带上本文源地址,博客一直在更新,欢迎 关注 。
  • 相关阅读:
    Java基础15 ThreadPoolTaskExecutor 说明
    系统设置了静态IP之后还会获取动态IP的问题解决
    linux 网卡配置 (centos 7)
    virtualbox nat和hostonly两种网络方式同时开启后不能上网的问题
    python logging多进程多线程输出到同一个日志文件
    string 转 byte 之零拷贝
    ORACLE数据泵expdp导出impdp导入
    【Oracle】EXPDP和IMPDP数据泵进行导出导入的方法1
    Vue3学习(四)集成eslint&git
    Vue3学习(二)使用Vite项目初始化,集成typescript
  • 原文地址:https://www.cnblogs.com/10087622blog/p/8919384.html
Copyright © 2020-2023  润新知