vfs open系统调用flow之overall
最近在看vfs open系统调用的flow,这个flow也是比较的复杂了,涉及到open file path的解析、四大struct(file、dentry、inode、super_block)。
而且open系统调用会建立很多关系,比如如果某个文件之前没有open过,则会建立这个文件的dentry并且会将这个dentry加入dentry hashtable(dcache),同时这个文件的inode也会被加入inode hashtable(icache),这样后续别人再open它时可以直接使用dcache/icache里的dentry/inode struct了,这样flow就简短很多了。同时会设置file struct f_op成员,这样open之后的read、write就是直接使用这个函数集来进行read/write了。
现在将最近trace这部分code总结一下,分成4篇文章来写:
1. vfs open系统调用flow之overall
2. vfs open系统调用flow之link_path_walk()
3. vfs open系统调用flow之do_last()
4. vfs open系统调用flow之具体文件系统lookup(ext4 fs lookup)
这4篇文章里会有很多“当前目录/文件”的表达,这个当前目录/文件是指正在解析的目录/文件,即nameiddata.last所表示的目录/文件,而非目前已经解析完的nameidata.path所表示的目录,这个目录是“当前目录/文件”的父目录
LAST_NORM:最后一个分量是普通文件名
LAST_ROOT:最后一个分量是“/”(也就是整个路径名为“/”)
LAST_DOT:最后一个分量是“.”
LAST_DOTDOT:最后一个分量是“..”
LAST_BIND:最后一个分量是链接到特殊文件系统的符号链接
open一个在文件系统上已经存在的文件为例,trace下这个open的flow,文件系统以ext4为例
#define EMBEDDED_LEVELS 2 struct nameidata { struct path path; struct qstr last; struct path root; struct inode *inode; /* path.dentry.d_inode */ unsigned int flags; unsigned seq, m_seq; int last_type; unsigned depth; int total_link_count; struct saved { struct path link; struct delayed_call done; const char *name; unsigned seq; } *stack, internal[EMBEDDED_LEVELS]; struct filename *name; struct nameidata *saved; struct inode *link_inode; unsigned root_seq; int dfd; } __randomize_layout;
- path成员:里面有struct vfsmount *mnt & struct dentry *dentry成员,在解析查找filename完整文件名时,每解析查找一级,就会更新path结构体,基于更新后的path结构体再解析查找其下的下一级目录。mnt指针指向当前路径对应的文件系统的vfsmount,dentry则是当前目录的dentry,比如一个路径/data/test,/data是ext4类型文件系统,test是一个常规的目录,则mnt表示/data所mount的文件系统的vfsmount;dentry表示/data/test目录的dentry(目录也是一种文件,目录文件里存的是其子目录/文件name、inode num)
- name成员:其里面的name成员是当前解析查找的目录名,和path类似,每解析查找一级目录,就会将它更新一次
vfs open系统调用flow概述
struct file *do_filp_open(int dfd, struct filename *pathname, const struct open_flags *op) { struct nameidata nd; int flags = op->lookup_flags; struct file *filp; set_nameidata(&nd, dfd, pathname); filp = path_openat(&nd, op, flags | LOOKUP_RCU); if (unlikely(filp == ERR_PTR(-ECHILD))) filp = path_openat(&nd, op, flags); if (unlikely(filp == ERR_PTR(-ESTALE))) filp = path_openat(&nd, op, flags | LOOKUP_REVAL); restore_nameidata(); return filp; }
在set_nameidata()将struct filename *name赋值给nameidata name成员,这个name代表open文件的完整文件名(包括路径):
static void set_nameidata(struct nameidata *p, int dfd, struct filename *name) { struct nameidata *old = current->nameidata; p->stack = p->internal; p->dfd = dfd; p->name = name; p->total_link_count = old ? old->total_link_count : 0; p->saved = old; current->nameidata = p; }
path_openat()调用path_init()以及link_path_walk()以及do_last()
1. path_init()初始化nameidata里的path成员,如果open file的路径是绝对路径以根目录/开头,则init为根目录/对应的dentry以及vfsmount。为后续路径目录解析做准备。其返回值s指向open文件的完整路径字符串的开头;
2. link_path_walk(const char *name, struct nameidata *nd) name参数即是path_init的返回值s。link_path_walk()完成的工作是逐级解析file路径,直到解析到最后一级路径,最终会将filename保存到nameidata的last成员以供do_last()处理最后的file open动作。解析每一级路径时,会从dcache(dentry_hashtable)中查找(fast lookup),如果有找到,将找到的dentry保存到path结构体(mnt&dentry);如果没有找到,说明这个目录之前没有被open过,需要创建dentry(slow path)。创建dentry会先alloc一个dentry,然后调用具体文件系统的lookup函数根据name去查找此目录的ext4_dir_entry_2,此结构体里有inode num,根据inode num到inode hash链表里查找,如果有找到,则不用分配inode;如果没有找到,则需要alloc一个inode,然后调用d_splice_alias()将dentry和inode关联起来,即将inode赋值给dentry里的d_inode成员。无论是fast path还是slow path,在各自path的最后会将找到的/分配的dentry保存到path结构体(dentry/mnt),然后调用step_into()将path结构体赋值给nameidata里的path成员(path_to_nameidata),这样nameidata即指向了当前目录,完成了一级目录的解析,然后返回link_path_walk()里接着下一级目录的解析。
这一阶段解析的是目录
3. do_last()根据link_path_walk()的最终解析查找结果,此时open file的路径已经都解析完了,只剩下最后的filename没有解析了。如果open flags里没有O_CREAT flag,do_last首先执行lookup_fast()查看file是否有对应的dentry,如果有则将此dentry保存至path结构体;如果有O_CREAT flag或者lookup_fast没有找到则执行lookup_open(),这个函数仍然会先在dcache中查找,如果没有找到,创建一个dentry,这个创建dentry的过程和link_path_walk() flow里的一样。无论是lookup_fast路径还是lookup_open路径,这两个路径都会设置path结构体,将找到的当前file的dentry或者分配的dentry保存到path结构体(dentry/mnt),然后会执行到step_into(),将path结构体赋值给nameidata.path(path_to_nameidata),此时nameidata已经'指向了'当前文件,也即完整的file路径。最后调用vfs_open()以执行具体文件系统的file_operations的open函数,比如ext4 fs,这个open函数是ext4_file_open()。
这一阶段解析的是最后的file
dentry的hash值计算
hash_len = hash_name(nd->path.dentry, name),dentry的hash值计算是调用hash_name()来计算的,其参数dentry表示dentry的parent,name即是当前目录的名字,计算出来的hash值是一个32bit的整数,其值跟parent dentry和name均有关系,只要两者中一个有发生变化,计算出来的hash值就不一样。另外hash_name()还会计算出当前目录name字串长度,它里面有根据/分隔符来分割目录。比如name指向“test/test.txt”,此时计算出来的len是4。hash_name()函数计算hash/len的方法可以用下面的测试函数来测试得出:
void hash_name_test(void) { int a = 0x1000; int b = 0x2000; int *p0 = &a; int *p1 = &b; u64 hashlen; hashlen = hash_name(p0, "p0 pointer"); pr_info("p0(p0 pointer) hash: %#x, len: %d.\n", hashlen_hash(hashlen), hashlen_len(hashlen)); hashlen = hash_name(p0, "p0 pointer"); pr_info("p0(p0 pointer) hash: %#x, len: %d.\n", hashlen_hash(hashlen), hashlen_len(hashlen)); hashlen = hash_name(p0, "p0 pointe"); pr_info("p0(p0 pointe ) hash: %#x, len: %d.\n", hashlen_hash(hashlen), hashlen_len(hashlen)); hashlen = hash_name(p1, "p1 pointer"); pr_info("p1 hash: %#x, len: %d.\n", hashlen_hash(hashlen), hashlen_len(hashlen)); }
执行结果如下,可以看到hash_name()在第一个参数一样的情况下,第二个参数只少了一个r字符,计算出来的hash值也是相差迥异,计算出的len即是其第二个参数字串的长度(这个例子里这个字串里没有带/):
[ 1762.282727] p0(p0 pointer) hash: 0x4272c576, len: 10. [ 1762.287975] p0(p0 pointer) hash: 0x4272c576, len: 10. [ 1762.293312] p0(p0 pointe ) hash: 0x08f34377, len: 9. [ 1762.298429] p1 hash: 0x22eea1ab, len: 10.
在dcache里查找是否有当前目录对应的dentry匹配原则
在dcache里查找有两个API,一个是__d_lookup_rcu,另外一个是__d_lookup。这两个的差异是前者不会使用到rcu以及dentry d_lock spinlock锁,而后者会使用到,所以前者查找的效率要高一些。
其匹配原则是根据qstr name和比较对象dentry的hash值进行比较,如果相同,则会继续比较parent dentry/name是否一样.
有些奇怪的是__d_lookup_rcu()里如果当前目录的parent dentry没有DCACHE_OP_COMPARE flag时,则只会比较name string,而看起来对于大部分文件系统来说,此时的dentry是没有此flag的,即没有提供d_compare函数,比如ext4 fs就没有此flag,所以只是先比较目录string len,如果相等再比较name string内容是否一致。不过此时有先比较parent dentry是否一样,如果一样,说明是在同一个parent目录下,再比较下目录name string是否一样,这样看起来也能唯一匹配当前目录,毕竟同一目录下,不可能存在相同名字的目录/文件。
而__d_lookup()则先后比较了hash值、parent dentry、name string
struct dentry *__d_lookup_rcu(const struct dentry *parent, const struct qstr *name, unsigned *seqp) { u64 hashlen = name->hash_len; const unsigned char *str = name->name; struct hlist_bl_head *b = d_hash(hashlen_hash(hashlen)); struct hlist_bl_node *node; struct dentry *dentry; /* * Note: There is significant duplication with __d_lookup_rcu which is * required to prevent single threaded performance regressions * especially on architectures where smp_rmb (in seqcounts) are costly. * Keep the two functions in sync. */ /* * The hash list is protected using RCU. * * Carefully use d_seq when comparing a candidate dentry, to avoid * races with d_move(). * * It is possible that concurrent renames can mess up our list * walk here and result in missing our dentry, resulting in the * false-negative result. d_lookup() protects against concurrent * renames using rename_lock seqlock. * * See Documentation/filesystems/path-lookup.txt for more details. */ hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) { unsigned seq; seqretry: /* * The dentry sequence count protects us from concurrent * renames, and thus protects parent and name fields. * * The caller must perform a seqcount check in order * to do anything useful with the returned dentry. * * NOTE! We do a "raw" seqcount_begin here. That means that * we don't wait for the sequence count to stabilize if it * is in the middle of a sequence change. If we do the slow * dentry compare, we will do seqretries until it is stable, * and if we end up with a successful lookup, we actually * want to exit RCU lookup anyway. * * Note that raw_seqcount_begin still *does* smp_rmb(), so * we are still guaranteed NUL-termination of ->d_name.name. */ seq = raw_seqcount_begin(&dentry->d_seq); if (dentry->d_parent != parent) continue; if (d_unhashed(dentry)) continue; if (unlikely(parent->d_flags & DCACHE_OP_COMPARE)) { int tlen; const char *tname; if (dentry->d_name.hash != hashlen_hash(hashlen)) continue; tlen = dentry->d_name.len; tname = dentry->d_name.name; /* we want a consistent (name,len) pair */ if (read_seqcount_retry(&dentry->d_seq, seq)) { cpu_relax(); goto seqretry; } if (parent->d_op->d_compare(dentry, tlen, tname, name) != 0) continue; } else { if (dentry->d_name.hash_len != hashlen) continue; if (dentry_cmp(dentry, str, hashlen_len(hashlen)) != 0) continue; } *seqp = seq; return dentry; } return NULL; }
struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name) { unsigned int hash = name->hash; struct hlist_bl_head *b = d_hash(hash); struct hlist_bl_node *node; struct dentry *found = NULL; struct dentry *dentry; /* * Note: There is significant duplication with __d_lookup_rcu which is * required to prevent single threaded performance regressions * especially on architectures where smp_rmb (in seqcounts) are costly. * Keep the two functions in sync. */ /* * The hash list is protected using RCU. * * Take d_lock when comparing a candidate dentry, to avoid races * with d_move(). * * It is possible that concurrent renames can mess up our list * walk here and result in missing our dentry, resulting in the * false-negative result. d_lookup() protects against concurrent * renames using rename_lock seqlock. * * See Documentation/filesystems/path-lookup.txt for more details. */ rcu_read_lock(); hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) { if (dentry->d_name.hash != hash) continue; spin_lock(&dentry->d_lock); if (dentry->d_parent != parent) goto next; if (d_unhashed(dentry)) goto next; if (!d_same_name(dentry, parent, name)) goto next; dentry->d_lockref.count++; found = dentry; spin_unlock(&dentry->d_lock); break; next: spin_unlock(&dentry->d_lock); } rcu_read_unlock(); return found; }
path_openat返回-ECHILD
lookup_fast()如果在dcache里没有找到当前目录对应的dentry,然后调用unlazy_walk()后返回了-ECHILD后会终结当前的path_openat(),重新调用path_openat(),此时调用path_openat()将不会带有LOOKUP_RCU,这样后续的flow调用lookup_fast在dcache里查找时将会使用到rcu锁和dentry的spinlock,这个会降低在dcache里查找的效率;而之前的path_openat带有LOOKUP_RCU,所以lookup_fast()在dcache里lookup时将不需要去操作这些锁,效率会提高。
struct file *do_filp_open(int dfd, struct filename *pathname, const struct open_flags *op) { struct nameidata nd; int flags = op->lookup_flags; struct file *filp; set_nameidata(&nd, dfd, pathname); filp = path_openat(&nd, op, flags | LOOKUP_RCU); if (unlikely(filp == ERR_PTR(-ECHILD))) filp = path_openat(&nd, op, flags); if (unlikely(filp == ERR_PTR(-ESTALE))) filp = path_openat(&nd, op, flags | LOOKUP_REVAL); restore_nameidata(); return filp; }
file struct f_op成员
对于操作文件系统文件来说,file struct f_op成员,即file_operations成员,是该文件inode struct的i_fop成员,此成员是在文件系统在alloc inode后初始化inode时设置的,以ext4 fs为例,这个设置的地方在ext4_lookup/__ext4_iget(),其被赋值为ext4_file_operations
__ext4_iget() if (S_ISREG(inode->i_mode)) { //常规文件 inode->i_op = &ext4_file_inode_operations; inode->i_fop = &ext4_file_operations; ext4_set_aops(inode); } else if (S_ISDIR(inode->i_mode)) { //目录 inode->i_op = &ext4_dir_inode_operations; inode->i_fop = &ext4_dir_operations; } else if (S_ISLNK(inode->i_mode)) { //link文件
而又在do_open_dentry()将inode->i_fop赋值给file.f_op:
do_dentry_open
f->f_op = fops_get(inode->i_fop); //对于文件系统非驱动case,fops_get是会返回inode->i_fop的