vfs open系统调用flow之overall

最近在看vfs open系统调用的flow，这个flow也是比较的复杂了，涉及到open file path的解析、四大struct（file、dentry、inode、super_block）。

而且open系统调用会建立很多关系，比如如果某个文件之前没有open过，则会建立这个文件的dentry并且会将这个dentry加入dentry hashtable(dcache)，同时这个文件的inode也会被加入inode hashtable(icache)，这样后续别人再open它时可以直接使用dcache/icache里的dentry/inode struct了，这样flow就简短很多了。同时会设置file struct f_op成员，这样open之后的read、write就是直接使用这个函数集来进行read/write了。

现在将最近trace这部分code总结一下，分成4篇文章来写：

1. vfs open系统调用flow之overall

2. vfs open系统调用flow之link_path_walk()

3. vfs open系统调用flow之do_last()

4. vfs open系统调用flow之具体文件系统lookup（ext4 fs lookup）

这4篇文章里会有很多“当前目录/文件”的表达，这个当前目录/文件是指正在解析的目录/文件，即nameiddata.last所表示的目录/文件，而非目前已经解析完的nameidata.path所表示的目录，这个目录是“当前目录/文件”的父目录

LAST_NORM：最后一个分量是普通文件名
LAST_ROOT：最后一个分量是“/”（也就是整个路径名为“/”）
LAST_DOT：最后一个分量是“.”
LAST_DOTDOT：最后一个分量是“..”
LAST_BIND：最后一个分量是链接到特殊文件系统的符号链接

open一个在文件系统上已经存在的文件为例，trace下这个open的flow，文件系统以ext4为例

#define EMBEDDED_LEVELS 2
struct nameidata {
    struct path    path;
    struct qstr    last;
    struct path    root;
    struct inode    *inode; /* path.dentry.d_inode */
    unsigned int    flags;
    unsigned    seq, m_seq;
    int        last_type;
    unsigned    depth;
    int        total_link_count;
    struct saved {
        struct path link;
        struct delayed_call done;
        const char *name;
        unsigned seq;
    } *stack, internal[EMBEDDED_LEVELS];
    struct filename    *name;
    struct nameidata *saved;
    struct inode    *link_inode;
    unsigned    root_seq;
    int        dfd;
} __randomize_layout;

path成员：里面有struct vfsmount *mnt & struct dentry *dentry成员，在解析查找filename完整文件名时，每解析查找一级，就会更新path结构体，基于更新后的path结构体再解析查找其下的下一级目录。mnt指针指向当前路径对应的文件系统的vfsmount，dentry则是当前目录的dentry，比如一个路径/data/test，/data是ext4类型文件系统，test是一个常规的目录，则mnt表示/data所mount的文件系统的vfsmount；dentry表示/data/test目录的dentry（目录也是一种文件，目录文件里存的是其子目录/文件name、inode num）
name成员：其里面的name成员是当前解析查找的目录名，和path类似，每解析查找一级目录，就会将它更新一次

vfs open系统调用flow概述

struct file *do_filp_open(int dfd, struct filename *pathname,
        const struct open_flags *op)
{
    struct nameidata nd;
    int flags = op->lookup_flags;
    struct file *filp;

    set_nameidata(&nd, dfd, pathname);
    filp = path_openat(&nd, op, flags | LOOKUP_RCU);
    if (unlikely(filp == ERR_PTR(-ECHILD)))
        filp = path_openat(&nd, op, flags);
    if (unlikely(filp == ERR_PTR(-ESTALE)))
        filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
    restore_nameidata();
    return filp;
}

在set_nameidata()将struct filename *name赋值给nameidata name成员，这个name代表open文件的完整文件名（包括路径）：

static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
{
    struct nameidata *old = current->nameidata;
    p->stack = p->internal;
    p->dfd = dfd;
    p->name = name;
    p->total_link_count = old ? old->total_link_count : 0;
    p->saved = old;
    current->nameidata = p;
}

path_openat()调用path_init()以及link_path_walk()以及do_last()

1. path_init()初始化nameidata里的path成员，如果open file的路径是绝对路径以根目录/开头，则init为根目录/对应的dentry以及vfsmount。为后续路径目录解析做准备。其返回值s指向open文件的完整路径字符串的开头；

2. link_path_walk(const char *name, struct nameidata *nd) name参数即是path_init的返回值s。link_path_walk()完成的工作是逐级解析file路径，直到解析到最后一级路径，最终会将filename保存到nameidata的last成员以供do_last()处理最后的file open动作。解析每一级路径时，会从dcache（dentry_hashtable）中查找（fast lookup），如果有找到，将找到的dentry保存到path结构体（mnt&dentry）；如果没有找到，说明这个目录之前没有被open过，需要创建dentry（slow path）。创建dentry会先alloc一个dentry，然后调用具体文件系统的lookup函数根据name去查找此目录的ext4_dir_entry_2，此结构体里有inode num，根据inode num到inode hash链表里查找，如果有找到，则不用分配inode；如果没有找到，则需要alloc一个inode，然后调用d_splice_alias()将dentry和inode关联起来，即将inode赋值给dentry里的d_inode成员。无论是fast path还是slow path，在各自path的最后会将找到的/分配的dentry保存到path结构体（dentry/mnt），然后调用step_into()将path结构体赋值给nameidata里的path成员（path_to_nameidata），这样nameidata即指向了当前目录，完成了一级目录的解析，然后返回link_path_walk()里接着下一级目录的解析。

这一阶段解析的是目录

3. do_last()根据link_path_walk()的最终解析查找结果，此时open file的路径已经都解析完了，只剩下最后的filename没有解析了。如果open flags里没有O_CREAT flag，do_last首先执行lookup_fast()查看file是否有对应的dentry，如果有则将此dentry保存至path结构体；如果有O_CREAT flag或者lookup_fast没有找到则执行lookup_open()，这个函数仍然会先在dcache中查找，如果没有找到，创建一个dentry，这个创建dentry的过程和link_path_walk() flow里的一样。无论是lookup_fast路径还是lookup_open路径，这两个路径都会设置path结构体，将找到的当前file的dentry或者分配的dentry保存到path结构体（dentry/mnt），然后会执行到step_into()，将path结构体赋值给nameidata.path（path_to_nameidata），此时nameidata已经'指向了'当前文件，也即完整的file路径。最后调用vfs_open()以执行具体文件系统的file_operations的open函数，比如ext4 fs，这个open函数是ext4_file_open()。

这一阶段解析的是最后的file

dentry的hash值计算

hash_len = hash_name(nd->path.dentry, name)，dentry的hash值计算是调用hash_name()来计算的，其参数dentry表示dentry的parent，name即是当前目录的名字，计算出来的hash值是一个32bit的整数，其值跟parent dentry和name均有关系，只要两者中一个有发生变化，计算出来的hash值就不一样。另外hash_name()还会计算出当前目录name字串长度，它里面有根据/分隔符来分割目录。比如name指向“test/test.txt”，此时计算出来的len是4。hash_name()函数计算hash/len的方法可以用下面的测试函数来测试得出：

void hash_name_test(void)
{
    int a = 0x1000;
    int b = 0x2000;
    
    int *p0 = &a;
    int *p1 = &b;

    u64 hashlen;
    
    hashlen = hash_name(p0, "p0 pointer");
    pr_info("p0(p0 pointer) hash: %#x, len: %d.\n", hashlen_hash(hashlen), hashlen_len(hashlen));
    hashlen = hash_name(p0, "p0 pointer");
    pr_info("p0(p0 pointer) hash: %#x, len: %d.\n", hashlen_hash(hashlen), hashlen_len(hashlen));

    hashlen = hash_name(p0, "p0 pointe");
    pr_info("p0(p0 pointe ) hash: %#x, len: %d.\n", hashlen_hash(hashlen), hashlen_len(hashlen));
    
    hashlen = hash_name(p1, "p1 pointer");
    pr_info("p1             hash: %#x, len: %d.\n", hashlen_hash(hashlen), hashlen_len(hashlen));

}

执行结果如下，可以看到hash_name()在第一个参数一样的情况下，第二个参数只少了一个r字符，计算出来的hash值也是相差迥异，计算出的len即是其第二个参数字串的长度（这个例子里这个字串里没有带/）：

[ 1762.282727] p0(p0 pointer) hash: 0x4272c576, len: 10.
[ 1762.287975] p0(p0 pointer) hash: 0x4272c576, len: 10.
[ 1762.293312] p0(p0 pointe ) hash: 0x08f34377, len: 9.
[ 1762.298429] p1             hash: 0x22eea1ab, len: 10.

在dcache里查找是否有当前目录对应的dentry匹配原则

在dcache里查找有两个API，一个是__d_lookup_rcu，另外一个是__d_lookup。这两个的差异是前者不会使用到rcu以及dentry d_lock spinlock锁，而后者会使用到，所以前者查找的效率要高一些。

其匹配原则是根据qstr name和比较对象dentry的hash值进行比较，如果相同，则会继续比较parent dentry/name是否一样.

有些奇怪的是__d_lookup_rcu()里如果当前目录的parent dentry没有DCACHE_OP_COMPARE flag时，则只会比较name string，而看起来对于大部分文件系统来说，此时的dentry是没有此flag的，即没有提供d_compare函数，比如ext4 fs就没有此flag，所以只是先比较目录string len，如果相等再比较name string内容是否一致。不过此时有先比较parent dentry是否一样，如果一样，说明是在同一个parent目录下，再比较下目录name string是否一样，这样看起来也能唯一匹配当前目录，毕竟同一目录下，不可能存在相同名字的目录/文件。

而__d_lookup()则先后比较了hash值、parent dentry、name string

struct dentry *__d_lookup_rcu(const struct dentry *parent,
                const struct qstr *name,
                unsigned *seqp)
{
    u64 hashlen = name->hash_len;
    const unsigned char *str = name->name;
    struct hlist_bl_head *b = d_hash(hashlen_hash(hashlen));
    struct hlist_bl_node *node;
    struct dentry *dentry;

    /*
     * Note: There is significant duplication with __d_lookup_rcu which is
     * required to prevent single threaded performance regressions
     * especially on architectures where smp_rmb (in seqcounts) are costly.
     * Keep the two functions in sync.
     */

    /*
     * The hash list is protected using RCU.
     *
     * Carefully use d_seq when comparing a candidate dentry, to avoid
     * races with d_move().
     *
     * It is possible that concurrent renames can mess up our list
     * walk here and result in missing our dentry, resulting in the
     * false-negative result. d_lookup() protects against concurrent
     * renames using rename_lock seqlock.
     *
     * See Documentation/filesystems/path-lookup.txt for more details.
     */
    hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {
        unsigned seq;

seqretry:
        /*
         * The dentry sequence count protects us from concurrent
         * renames, and thus protects parent and name fields.
         *
         * The caller must perform a seqcount check in order
         * to do anything useful with the returned dentry.
         *
         * NOTE! We do a "raw" seqcount_begin here. That means that
         * we don't wait for the sequence count to stabilize if it
         * is in the middle of a sequence change. If we do the slow
         * dentry compare, we will do seqretries until it is stable,
         * and if we end up with a successful lookup, we actually
         * want to exit RCU lookup anyway.
         *
         * Note that raw_seqcount_begin still *does* smp_rmb(), so
         * we are still guaranteed NUL-termination of ->d_name.name.
         */
        seq = raw_seqcount_begin(&dentry->d_seq);
        if (dentry->d_parent != parent)
            continue;
        if (d_unhashed(dentry))
            continue;

        if (unlikely(parent->d_flags & DCACHE_OP_COMPARE)) {
            int tlen;
            const char *tname;
            if (dentry->d_name.hash != hashlen_hash(hashlen))
                continue;
            tlen = dentry->d_name.len;
            tname = dentry->d_name.name;
            /* we want a consistent (name,len) pair */
            if (read_seqcount_retry(&dentry->d_seq, seq)) {
                cpu_relax();
                goto seqretry;
            }
            if (parent->d_op->d_compare(dentry,
                            tlen, tname, name) != 0)
                continue;
        } else {
            if (dentry->d_name.hash_len != hashlen)
                continue;
            if (dentry_cmp(dentry, str, hashlen_len(hashlen)) != 0)
                continue;
        }
        *seqp = seq;
        return dentry;
    }
    return NULL;
}

struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
{
    unsigned int hash = name->hash;
    struct hlist_bl_head *b = d_hash(hash);
    struct hlist_bl_node *node;
    struct dentry *found = NULL;
    struct dentry *dentry;

    /*
     * Note: There is significant duplication with __d_lookup_rcu which is
     * required to prevent single threaded performance regressions
     * especially on architectures where smp_rmb (in seqcounts) are costly.
     * Keep the two functions in sync.
     */

    /*
     * The hash list is protected using RCU.
     *
     * Take d_lock when comparing a candidate dentry, to avoid races
     * with d_move().
     *
     * It is possible that concurrent renames can mess up our list
     * walk here and result in missing our dentry, resulting in the
     * false-negative result. d_lookup() protects against concurrent
     * renames using rename_lock seqlock.
     *
     * See Documentation/filesystems/path-lookup.txt for more details.
     */
    rcu_read_lock();
    
    hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {

        if (dentry->d_name.hash != hash)
            continue;

        spin_lock(&dentry->d_lock);
        if (dentry->d_parent != parent)
            goto next;
        if (d_unhashed(dentry))
            goto next;

        if (!d_same_name(dentry, parent, name))
            goto next;

        dentry->d_lockref.count++;
        found = dentry;
        spin_unlock(&dentry->d_lock);
        break;
next:
        spin_unlock(&dentry->d_lock);
     }
     rcu_read_unlock();

     return found;
}

path_openat返回-ECHILD

lookup_fast()如果在dcache里没有找到当前目录对应的dentry，然后调用unlazy_walk()后返回了-ECHILD后会终结当前的path_openat()，重新调用path_openat()，此时调用path_openat()将不会带有LOOKUP_RCU，这样后续的flow调用lookup_fast在dcache里查找时将会使用到rcu锁和dentry的spinlock，这个会降低在dcache里查找的效率；而之前的path_openat带有LOOKUP_RCU，所以lookup_fast()在dcache里lookup时将不需要去操作这些锁，效率会提高。

struct file *do_filp_open(int dfd, struct filename *pathname,
        const struct open_flags *op)
{
    struct nameidata nd;
    int flags = op->lookup_flags;
    struct file *filp;

    set_nameidata(&nd, dfd, pathname);
    filp = path_openat(&nd, op, flags | LOOKUP_RCU);
    if (unlikely(filp == ERR_PTR(-ECHILD)))
        filp = path_openat(&nd, op, flags);
    if (unlikely(filp == ERR_PTR(-ESTALE)))
        filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
    restore_nameidata();
    return filp;
}

file struct f_op成员

对于操作文件系统文件来说，file struct f_op成员，即file_operations成员，是该文件inode struct的i_fop成员，此成员是在文件系统在alloc inode后初始化inode时设置的，以ext4 fs为例，这个设置的地方在ext4_lookup/__ext4_iget()，其被赋值为ext4_file_operations

__ext4_iget()
    if (S_ISREG(inode->i_mode)) { //常规文件
        inode->i_op = &ext4_file_inode_operations;
        inode->i_fop = &ext4_file_operations;
        ext4_set_aops(inode);
    } else if (S_ISDIR(inode->i_mode)) { //目录
        inode->i_op = &ext4_dir_inode_operations;
        inode->i_fop = &ext4_dir_operations;
    } else if (S_ISLNK(inode->i_mode)) { //link文件

而又在do_open_dentry()将inode->i_fop赋值给file.f_op:

do_dentry_open
    f->f_op = fops_get(inode->i_fop); //对于文件系统非驱动case，fops_get是会返回inode->i_fop的

相关阅读:
【转】突破区块链不可能三角:异步共识组 [Monoxide]
[转]王嘉平：Monoxide 原理详解，如何用极简架构突破不可能三角
 【转】区块链公链的 3 大性能难点、5 大体验障碍
 使用ShowDoc在线管理API接口文档
 云和恩墨大讲堂电子刊2019年4月刊发布
 墙裂推荐 | 漫画解读Elasticsearch原理，看完你就懂
 DBASK数据库提问平台问题集萃，首批近二十位专家团曝光
 WIN10安装GPU版tensorflow
cobbler的网页操作
 cobbler的网页操作
原文地址：https://www.cnblogs.com/aspirs/p/15730173.html