Linux虚拟文件系统VFS

Linux虚拟文件系统VFS
一、为什么要有虚拟文件系统?

文件存储方式有很多种方式,对应的不同文件系统,例如ext3、NFS、XFS,也可能来自于硬盘SSD、HDD等不同存储介质,如果我们的服务要去调用文件的话,那么针对不同的文件系统要有不同的实现,为了降低这种复杂度,Linux在服务端与存储的文件系统之间加了一层抽象,为服务调用提供通用的文件操作和文件系统操作接口,屏蔽不同文件系统操作的差异,让服务端感受不到底层文件系统的区别。

通过VFS系统，Linux提供了通用的系统调用，可以跨越不同文件系统和介质之间执行，极大简化了用户访问不同文件系统的过程。另一方面，新的文件系统、新类型的存储介质，可以无须编译的情况下，动态加载到Linux中。

"一切皆文件"是Linux的基本哲学之一，不仅是普通的文件，包括目录、字符设备、块设备、套接字等，都可以以文件的方式被对待。实现这一行为的基础，正是Linux的虚拟文件系统机制。

二、虚拟文件系统原理

VFS之所以能够衔接各种各样的文件系统，是因为它抽象了一个通用的文件系统模型，定义了通用文件系统都支持的、概念上的接口。新的文件系统只要支持并实现这些接口，并注册到Linux内核中，即可安装和使用。

举个例子，比如Linux写一个文件：
```
int ret = write(fd, buf, len);
```
调用了write()系统调用，它的过程简要如下：
- 首先，勾起VFS通用系统调用sys_write()处理。
- 接着，sys_write()根据fd找到所在的文件系统提供的写操作函数，比如op_write()。
- 最后，调用op_write()实际的把数据写入到文件中。
操作示意图如下

三、虚拟文件系统组成部分

Linux为了实现这种VFS系统，采用面向对象的设计思路，主要抽象了四种对象类型：
- 超级块对象：代表一个已安装的文件系统。
- 索引节点对象：代表具体的文件。
- 目录项对象：代表一个目录项，是文件路径的一个组成部分。
- 文件对象：代表进程打开的文件。
每个对象都包含一组操作方法，用于操作相应的文件系统。

注意

Linux将目录当做文件对象来处理，是另一种形式的文件，它里面包含了一个或多个目录项。而目录项是单独抽象的对象，主要包括文件名和索引节点号。因为目录是可以层层嵌套，以形成文件路径，而路径中的每一部分，其实就是目录项。

超级块

超级块用于管理挂载点对实际文件系统中的一些参数，包括：块长度，文件系统可处理的最大文件长度，文件系统类型，对应的存储设备等。在之前的整体结构图中superblock会有一个files指向所有打开的文件，但是在下面的数据结构中并没有找到相关的代码，是因为之前该结构会用于判断umount逻辑时，确保所有文件都已被关闭，新版的不知道怎么处理这个逻辑了，后续看到了再补上,相关superblock的管理主要在文件系统的挂载逻辑，这个后续在讲到挂载相关的模块是详细分析。而superblock主要功能是管理inode。

超级块用于存储文件系统的元信息，元信息里面包含文件系统的基本属性信息，比如有：
- 索引节点信息
- 挂载的标志
- 操作方法 s_op
- 安装权限
- 文件系统类型、大小、区块数
其中操作方法 s_op 对每个文件系统来说，是非常重要的，它指向该超级块的操作函数表，包含一系列操作方法的实现，这些方法有：
- 分配inode
- 销毁inode
- 读、写inode
- 文件同步
下面是super_block的代码:
```
struct super_block {
    struct list_head    s_list;     /* Keep this first */
    dev_t           s_dev;      /* search index; _not_ kdev_t */
    unsigned char       s_blocksize_bits; // 块字节
    unsigned long       s_blocksize; // log2(块字节)
    loff_t          s_maxbytes; /* Max file size */
    struct file_system_type *s_type; // 文件系统类型
    const struct super_operations   *s_op; // 超级块的操作
    const struct dquot_operations   *dq_op;
    const struct quotactl_ops   *s_qcop;
    const struct export_operations *s_export_op;
    unsigned long       s_flags;
    unsigned long       s_iflags;   /* internal SB_I_* flags */
    unsigned long       s_magic;
    struct dentry       *s_root; // 根目录项。所有的path lookup 都是从此开始
    struct rw_semaphore s_umount;
    int         s_count;
    atomic_t        s_active;
#ifdef CONFIG_SECURITY
    void                    *s_security;
#endif
    const struct xattr_handler **s_xattr;
#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
    const struct fscrypt_operations *s_cop;
#endif
    struct hlist_bl_head    s_roots;    /* alternate root dentries for NFS */
    struct list_head    s_mounts;   /* list of mounts; _not_ for fs use */
    struct block_device *s_bdev;
    struct backing_dev_info *s_bdi;
    struct mtd_info     *s_mtd;
    struct hlist_node   s_instances;
    unsigned int        s_quota_types;  /* Bitmask of supported quota types */
    struct quota_info   s_dquot;    /* Diskquota specific options */
 
    struct sb_writers   s_writers;
 
    /*
     * Keep s_fs_info, s_time_gran, s_fsnotify_mask, and
     * s_fsnotify_marks together for cache efficiency. They are frequently
     * accessed and rarely modified.
     */
    void            *s_fs_info; /* Filesystem private info */
 
    /* Granularity of c/m/atime in ns (cannot be worse than a second) */
    u32         s_time_gran;
#ifdef CONFIG_FSNOTIFY
    __u32           s_fsnotify_mask;
    struct fsnotify_mark_connector __rcu    *s_fsnotify_marks;
#endif
 
    char            s_id[32];   /* Informational name */
    uuid_t          s_uuid;     /* UUID */
 
    unsigned int        s_max_links;
    fmode_t         s_mode;
 
    /*
     * The next field is for VFS *only*. No filesystems have any business
     * even looking at it. You had been warned.
     */
    struct mutex s_vfs_rename_mutex;    /* Kludge */
 
    /*
     * Filesystem subtype.  If non-empty the filesystem type field
     * in /proc/mounts will be "type.subtype"
     */
    char *s_subtype;
 
    const struct dentry_operations *s_d_op; /* default d_op for dentries */
 
    /*
     * Saved pool identifier for cleancache (-1 means none)
     */
    int cleancache_poolid;
 
    struct shrinker s_shrink;   /* per-sb shrinker handle */
 
    /* Number of inodes with nlink == 0 but still referenced */
    atomic_long_t s_remove_count;
 
    /* Pending fsnotify inode refs */
    atomic_long_t s_fsnotify_inode_refs;
 
    /* Being remounted read-only */
    int s_readonly_remount;
 
    /* AIO completions deferred from interrupt context */
    struct workqueue_struct *s_dio_done_wq;
    struct hlist_head s_pins;
 
    /*
     * Owning user namespace and default context in which to
     * interpret filesystem uids, gids, quotas, device nodes,
     * xattrs and security labels.
     */
    struct user_namespace *s_user_ns;
 
    /*
     * The list_lru structure is essentially just a pointer to a table
     * of per-node lru lists, each of which has its own spinlock.
     * There is no need to put them into separate cachelines.
     */
    struct list_lru     s_dentry_lru; // 目录项缓存
    struct list_lru     s_inode_lru; // inode 缓存
    struct rcu_head     rcu;
    struct work_struct  destroy_work;
 
    struct mutex        s_sync_lock;    /* sync serialisation lock */
 
    /*
     * Indicates how deep in a filesystem stack this SB is
     */
    int s_stack_depth;
 
    /* s_inode_list_lock protects s_inodes */
    spinlock_t      s_inode_list_lock ____cacheline_aligned_in_smp;
    struct list_head    s_inodes;   /* all inodes */
 
    spinlock_t      s_inode_wblist_lock;
    struct list_head    s_inodes_wb;    /* writeback inodes */
} __randomize_layout;
struct super_operations {
    struct inode *(*alloc_inode)(struct super_block *sb); // 在当前sb创建inode
    void (*destroy_inode)(struct inode *); // 在当前sb删除inode
    void (*dirty_inode) (struct inode *, int flags); // 标记为脏inode
    int (*write_inode) (struct inode *, struct writeback_control *wbc);// inode 写回
    int (*drop_inode) (struct inode *); // 同delete，不过inode的引用必须为0
    void (*evict_inode) (struct inode *);
    void (*put_super) (struct super_block *);  // 卸载sb
    int (*sync_fs)(struct super_block *sb, int wait); 
    int (*freeze_super) (struct super_block *);
    int (*freeze_fs) (struct super_block *);
    int (*thaw_super) (struct super_block *);
    int (*unfreeze_fs) (struct super_block *);
    int (*statfs) (struct dentry *, struct kstatfs *); // 查询元信息
    int (*remount_fs) (struct super_block *, int *, char *); //重新挂载
    void (*umount_begin) (struct super_block *); // 主要用于NFS
        // 查询相关
    int (*show_options)(struct seq_file *, struct dentry *);
    int (*show_devname)(struct seq_file *, struct dentry *);
    int (*show_path)(struct seq_file *, struct dentry *);
    int (*show_stats)(struct seq_file *, struct dentry *);
#ifdef CONFIG_QUOTA
    ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
    ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
    struct dquot **(*get_dquots)(struct inode *);
#endif
    int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
    long (*nr_cached_objects)(struct super_block *,
                  struct shrink_control *);
    long (*free_cached_objects)(struct super_block *,
                    struct shrink_control *);
};
```
当VFS需要对超级块进行操作时，首先要在超级块的操作方法 s_op 中，找到对应的操作方法后再执行。比如文件系统要写自己的超级块：
```
superblock->s_op->write_supper(sb);
```
创建文件系统时，其实就是往存储介质的特定位置，写入超级块信息；而卸载文件系统时，由VFS调用释放超级块。

Linux支持众多不同的文件系统，file_system_type结构体用于描述每种文件系统的功能和行为，包括：
- 名称、类型等
- 超级块对象链表
- 等
当向内核注册新的文件系统时，其实是将file_system_type对象实例化，然后加入到Linux的根文件系统的目录树结构上。

索引

索引节点对象包含Linux内核在操作文件、目录时，所需要的全部信息，这些信息由inode结构体来描述，定义在<linux/fs.h>中，主要包含：
- 超级块相关信息
- 目录相关信息
- 文件大小、访问时间、权限相关信息
- 引用计数
关于inode,可以参考我这篇博文:Linux文件系统之INode

一个索引节点inode代表文件系统中的一个文件，只有当文件被访问时，才在内存中创建索引节点。与超级块类似的是，索引节点对象也提供了许多操作接口，供VFS系统使用，这些接口包括：
- create(): 创建新的索引节点（创建新的文件）
- link(): 创建硬链接
- symlink(): 创建符号链接。
- mkdir(): 创建新的目录。
我们常规的文件操作，都能在索引节点中找到相应的操作接口。

Inode结构源码:
```
struct inode {
	/* 全局的散列表 */
	struct hlist_node	i_hash;
	/* 根据inode的状态可能处理不同的链表中（inode_unused/inode_in_use/super_block->dirty） */
	struct list_head	i_list;
	/* super_block->s_inodes链表的节点 */
	struct list_head	i_sb_list;
	/* inode对应的dentry链表，可能多个dentry指向同一个文件 */
	struct list_head	i_dentry;
	/* inode编号 */
	unsigned long		i_ino;
	/* 访问该inode的进程数目 */
	atomic_t		i_count;
	/* inode的硬链接数 */
	unsigned int		i_nlink;
	uid_t			i_uid;
	gid_t			i_gid;
	/* inode表示设备文件时的设备号 */
	dev_t			i_rdev;
	u64			i_version;
	/* 文件的大小，以字节为单位 */
	loff_t			i_size;
#ifdef __NEED_I_SIZE_ORDERED
	seqcount_t		i_size_seqcount;
#endif
	/* 最后访问时间 */
	struct timespec		i_atime;
	/* 最后修改inode数据的时间 */
	struct timespec		i_mtime;
	/* 最后修改inode自身的时间 */
	struct timespec		i_ctime;
	/* 以block为单位的inode的大小 */
	blkcnt_t		i_blocks;
	unsigned int		i_blkbits;
	unsigned short          i_bytes;
	/* 文件属性，低12位为文件访问权限，同chmod参数含义，其余位为文件类型，如普通文件、目录、socket、设备文件等 */
	umode_t			i_mode;
	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
	struct mutex		i_mutex;
	struct rw_semaphore	i_alloc_sem;
	/* inode操作 */
	const struct inode_operations	*i_op;
	/* file操作 */
	const struct file_operations	*i_fop;
	/* inode所属的super_block */
	struct super_block	*i_sb;
	struct file_lock	*i_flock;
	/* inode的地址空间映射 */
	struct address_space	*i_mapping;
	struct address_space	i_data;
#ifdef CONFIG_QUOTA
	struct dquot		*i_dquot[MAXQUOTAS];
#endif
	struct list_head	i_devices; /* 若为设备文件的inode，则为设备的打开文件列表节点 */
	union {
		struct pipe_inode_info	*i_pipe;
		struct block_device	*i_bdev; /* 若为块设备的inode，则指向该设备实例 */
		struct cdev		*i_cdev; /* 若为字符设备的inode，则指向该设备实例 */
	};
 
	__u32			i_generation;
 
#ifdef CONFIG_FSNOTIFY
	__u32			i_fsnotify_mask; /* all events this inode cares about */
	struct hlist_head	i_fsnotify_mark_entries; /* fsnotify mark entries */
#endif
 
#ifdef CONFIG_INOTIFY
	struct list_head	inotify_watches; /* watches on this inode */
	struct mutex		inotify_mutex;	/* protects the watches list */
#endif
 
	unsigned long		i_state;
	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
	unsigned int		i_flags; /* 文件打开标记，如noatime */
 
	atomic_t		i_writecount;
#ifdef CONFIG_SECURITY
	void			*i_security;
#endif
#ifdef CONFIG_FS_POSIX_ACL
	struct posix_acl	*i_acl;
	struct posix_acl	*i_default_acl;
#endif
	void			*i_private; /* fs or device private pointer */
};
```
目录项

前面提到VFS把目录当做文件对待，比如/usr/bin/vim，usr、bin和vim都是文件，不过vim是一个普通文件，usr和bin都是目录文件，都是由索引节点对象标识。

由于VFS会经常的执行目录相关的操作，比如切换到某个目录、路径名的查找等等，为了提高这个过程的效率，VFS引入了目录项的概念。一个路径的组成部分，不管是目录还是普通文件，都是一个目录项对象。/、usr、bin、vim都对应一个目录项对象。不过目录项对象没有对应的磁盘数据结构，是VFS在遍历路径的过程中，将它们逐个解析成目录项对象。

目录项由dentry结构体标识，定义在``中，主要包含：
- 父目录项对象地址
- 子目录项链表
- 目录关联的索引节点对象
- 目录项操作指针
- 等等
目录项有三种状态：
- 被使用：该目录项指向一个有效的索引节点，并有一个或多个使用者，不能被丢弃。
- 未被使用：也对应一个有效的索引节点，但VFS还未使用，被保留在缓存中。如果要回收内存的话，可以撤销未使用的目录项。
- 负状态：没有对应有效的索引节点，因为索引节点被删除了，或者路径不正确，但是目录项仍被保留了。
将整个文件系统的目录结构解析成目录项，是一件费力的工作，为了节省VFS操作目录项的成本，内核会将目录项缓存起来。

文件

文件对象是进程打开的文件在内存中的实例。Linux用户程序可以通过open()系统调用来打开一个文件，通过close()系统调用来关闭一个文件。由于多个进程可以同时打开和操作同一个文件，所以同一个文件，在内存中也存在多个对应的文件对象，但对应的索引节点和目录项是唯一的。

文件对象由file结构体表示，定义在``中，主要包含：
- 文件操作方法
- 文件对象的引用计数
- 文件指针的偏移
- 打开文件时的读写标识
类似于目录项，文件对象也没有实际的磁盘数据，只有当进程打开文件时，才会在内存中产生一个文件对象。

每个进程都有自己打开的一组文件，由file_struct结构体标识，该结构体由进程描述符中的files字段指向。主要包括：
- fdt
- fd_array[NR_OPEN_DEFAULT]
- 引用计数
fd_array数组指针指向已打开的文件对象，如果打开的文件对象个数 > NR_OPEN_DEFAULT，内核会分配一个新数组，并将 fdt 指向该数组。

除此之外，内核还为所有打开文件维持一张文件表，包括：
- 文件状态标志
- 文件偏移量
四、虚拟文件系统实战

由此对于虚拟文件的基本架构有了一定的理解，但是如果想要对于虚拟文件有比较深刻的认识还是比较模糊的，那么我们来通过自己伪码来操作一下文件，以描述linux内核是如何来读写文件的，我们以写文件为例来过一下整个

流程

需求：从0开始向文件/testmount/testdir/testfile1.txt 中写入 hello world

基本过程其基本系统调用过程为:1.mkdir 2. creat 3. open 4. write
mkdir对应的函数调用的执行过程如下：

rootInode = sb->s_root->d_inode;
testDirDentry = dentry("testdir")
testDirInode = rootInode->i_op->mkdir(rootInode , testDirDentry, 777))

creat对应的函数调用的执行过程如下：

testFileDentry = dentry("testfile1.txt")
testFileInode = testDirInode->i_op->create(testDirInode, testFileDentry, 777 )

open 的系统调用的执行过程如下:

testFileInode->f_op->open(testFileInode, testfile)

write的系统调用的执行过程如下

testfile->f_op->write(file, "hello world", len, 0)

具体流程
1. 假设现在我们有一个快磁盘设备/dev/sda，我们将其格式化为EX2文件系统，具体怎么将块设备格式化这个我们再设备管理章节在描述。
2. 我们将该磁盘挂载到/testmount 目录，这样内核就会通过挂载模块注册对应的superblock，具体如何挂载且听下回分解。
3. 我们想要写文件/testmount/testdir/testfile1.txt文件，那么首先会要根据文件名完整路径查找对应的目录项，并在不存在的时候创建对应的inode文件。
  3.1 根据完整路径找到对应的挂载点的superblock，我们这里最精确的匹配sb是/testmount
  3.2 找到sb后，找到当前sb的root dentry，找到root dentry对应的inode，通过inode中的address_space从磁盘中读取信息，如果是目录则其中存储内容为所有子条目信息，从而构建完整的root dentry中的子条目；发现没有对应testdir的目录，这时候就会报目录不存在的错误；用户开始创建对应的目录，并将对应的信息写回inode对应的设备；同理也需要在/testdir目录下创建testfile1.txt文件并写回/testdir对应的inode设备。
4. 找到inode之后，我们需要通过open系统调用打开对应的文件，进程通过files_struct中的next_fd申请分配一个文件描述符，然后调用inode->f_op->open(inode, file)，生成一个file对象，并将inode中的address_space信息传到file中，然后将用户空间的fd关联到该file对象。
5. 打开文件之后所有后续的读写操作都是通过该fd来进行，在内核层面就是通过对应的file数据结构操作文件，比如我们要写入hello world，那么就是通过调用file->f_op->write；
  其实file->f_op其实是讲对应的字节内容写入到address_space中对应的内存中，address_space再选择合适的时间写回磁盘，这就是我们常说的缓存系统，当然我们也可以通过fsync系统调用强制将数据同步回存储系统。在f_op的函数中都可以看到__user描述信息，说明数据是来自用户空间的内存地址，这些数据最终要写到内核缓存的address_space中的page内存中，这就是我们常说的内核拷贝，后来就出来了大家所熟知的零拷贝sendfile，直接在两个fd直接拷贝数据，操作的都是内核里面的page数据，不需要到用户地址空间走一遭。
相关阅读:
MIPS——分支语句
 MIPS简单入门
 迷宫问题——最短路
 用dfs遍历联通块（优化）
用protractor測试canvas绘制(二)
android 用java代码设置布局、视图View的宽度/高度或自适应
 HBase编程实例
 Top10Servlet
Delete Node in a Linked List
atitit.html5动画特效----打水漂 ducks_and_drakes
原文地址：https://www.cnblogs.com/Courage129/p/14305705.html

Linux虚拟文件系统VFS

一、为什么要有虚拟文件系统?

二、虚拟文件系统原理

操作示意图如下

三、虚拟文件系统组成部分

超级块

索引

目录项

文件

四、虚拟文件系统实战

流程

具体流程