HDFS的数据保护篇-快照案例
作者:尹正杰
版权声明:原创作品,谢绝转载!否则将追究法律责任。
一.HDFS保护数据的解决方案
HDFS提供了两个非常有用的功能,帮助防止用户删除文件和目录,即回收站和快照。
HDFS回收站功能:
将删除的文件和目录存储在特定回收站目录中一定的时间,然后再将其永久删除。
HDFS快照功能:
允许为HDFS文件或目录创建只读时间点副本,如果需要,可以将其还原到该文件或目录。
二.使用HDFS回收站放置意外数据删除
1>.使用回收站需要注意的事项
如果您的集群启用了回收站功能(默认是没有启用),当从HDFS删除文件时,将释放与该文件相关联的块。但是,这种释放不会立即发生,因为HDFS不会立即移除删除的文件。相反,它会将删除的文件移动到回收站目录中。
需要注意的是,HDFS回收站是一个用户级功能,这意味着只有使用"hdfs dfs"文件系统命名删除的文件才会存储在回收站中。如果以编程方式删除HDFS文件,则它会立即被永久删除!
如果你想通过程序防止意外或错误的文件删除,则可以这样做:创建一个"trash"实例,并使用要删除的文件的路径调用"moveToTrash()"。如果未启用回收站,则moveToTrash()方法返回false。
2>.配置回收站
[root@hadoop101.yinzhengjie.com ~]# vim ${HADOOP_HOME}/etc/hadoop/core-site.xml #启用回收站功能只需配置修改该文件的下面2个属性即可。 ...... <property> <name>fs.trash.interval</name> <value>4320</value> <description>指定删除检查点的分钟数,如果为0(官方默认即为0),垃圾箱功能将被禁用。我这里指定了4320分钟(3天),删除文件72小时后,Hadoop会将其从HDFS存储中永久删除.</description> </property> <property> <name>fs.trash.checkpoint.interval</name> <value>30</value> <description>指定回收站检查点之间的分钟数,该值应该小于或等于fs.trash.interval。默认值为0(则将该值设置为fs.trash.interval的值),我这里设置间隔检查时间为30分钟.</description> </property> ...... [root@hadoop101.yinzhengjie.com ~]#
温馨提示:
仅在NameNode上设置fs.trash.interval参数是不够的。还需要再可从其访问HDFS的所有客户端节点上设置它。否则,当删除文件时,可能会看到文件被立即删除!
回收站的间隔是从删除文件被移动到回收站的时间开始计算。
3>.绕过回收站
有时在删除一些数据时,我们希望一劳永逸,如果启用了回收站,已删除的文件将继续占用与该文件关联的块,并且无法释放空间。这时就需要绕过回收站设施。
如果想要执行常规删除以节省HDFS空间,请在删除文件时指定"-skipTrash"选项,如下图所示。
温馨提示:
因为指定了"-skipTrash"选项,所以绕过了回收站设施,被删除的文件被立即清除,文件占用的空间被释放,并且NameNode命名空间被更新。
在从用户目录删除超过空间配额的文件时,可以使用"-skipTrash"选项。
三.使用HDFS快照保护重要数据
1>.使用快照需要注意的事项
可以使用HDFS快照来保护集群以及从灾难中恢复。可以创建整个文件系统或目录子树的快照。
DataNode和块管理模块不与快照交互,在NameNode中存储所有的快照元数据。
在默认情况下,不会为快照启用HDFS目录,但是一旦正式将目录设置为允许创建快照,就可以在HDFS上创建快照。可以选择将特定目录或整个文件系统设置为快照。
当创建快照时,不会复制任何快。快照文件仅包含块列表和文件大小。可以将任何HDFS目录指定为快照。为了删除可快照的HDFS目录,它不能包含任何快照,换句话说,如果已启用的快照目录下有快照信息,则无法删除该目录。
温馨提示:
有了快照以后,可查询以前版本的数据。可以访问当前数据而不减速,但访问快照数据会有一些延迟。
2>.启用/禁用HDFS目录的快照功能
在使用快照之前,必须在目录中启用快照创建。可以使用dfsadmin工具启用HDFS目录快照功能,如下所示。
可以使用"-allowSnapshot"选项启用目录的快照功能,也可以使用"-disallowSnapshot"选项禁用目录的快照功能。
[root@hadoop101.yinzhengjie.com ~]# hdfs dfsadmin -help allowSnapshot -allowSnapshot <snapshotDir>: Allow snapshots to be taken on a directory. [root@hadoop101.yinzhengjie.com ~]#
[root@hadoop101.yinzhengjie.com ~]# hdfs dfsadmin -help disallowSnapshot -disallowSnapshot <snapshotDir>: Do not allow snapshots to be taken on a directory any more. [root@hadoop101.yinzhengjie.com ~]#
[root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/ Found 2 items -rw-r--r-- 2 root admingroup 69 2020-08-14 23:22 /yinzhengjie/wc.txt.gz drwxr-xr-x - root admingroup 0 2020-08-14 23:13 /yinzhengjie/yum.repos.d [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfsadmin -allowSnapshot /yinzhengjie/ Allowing snaphot on /yinzhengjie/ succeeded [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/ Found 2 items -rw-r--r-- 2 root admingroup 69 2020-08-14 23:22 /yinzhengjie/wc.txt.gz drwxr-xr-x - root admingroup 0 2020-08-14 23:13 /yinzhengjie/yum.repos.d [root@hadoop101.yinzhengjie.com ~]#
[root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/ Found 2 items -rw-r--r-- 2 root admingroup 69 2020-08-14 23:22 /yinzhengjie/wc.txt.gz drwxr-xr-x - root admingroup 0 2020-08-14 23:13 /yinzhengjie/yum.repos.d [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfsadmin -disallowSnapshot /yinzhengjie/ Disallowing snaphot on /yinzhengjie/ succeeded [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/ Found 2 items -rw-r--r-- 2 root admingroup 69 2020-08-14 23:22 /yinzhengjie/wc.txt.gz drwxr-xr-x - root admingroup 0 2020-08-14 23:13 /yinzhengjie/yum.repos.d [root@hadoop101.yinzhengjie.com ~]#
3>.创建快照
[root@hadoop101.yinzhengjie.com ~]# hdfs dfs -help createSnapshot -createSnapshot <snapshotDir> [<snapshotName>] : Create a snapshot on a directory [root@hadoop101.yinzhengjie.com ~]#
[root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/ Found 2 items -rw-r--r-- 2 root admingroup 69 2020-08-14 23:22 /yinzhengjie/wc.txt.gz drwxr-xr-x - root admingroup 0 2020-08-14 23:13 /yinzhengjie/yum.repos.d [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -createSnapshot /yinzhengjie/ mySnapshot001 #创建快照之前必须得先对该目录启用快照功能 createSnapshot: Directory is not a snapshottable directory: /yinzhengjie [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfsadmin -allowSnapshot /yinzhengjie/ #启用快照功能 Allowing snaphot on /yinzhengjie/ succeeded [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -createSnapshot /yinzhengjie/ mySnapshot001 #创建一个名为"mySnapshot001"的快照名称,它会生成一个对应目录哟~ Created snapshot /yinzhengjie/.snapshot/mySnapshot001 [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/ Found 2 items -rw-r--r-- 2 root admingroup 69 2020-08-14 23:22 /yinzhengjie/wc.txt.gz drwxr-xr-x - root admingroup 0 2020-08-14 23:13 /yinzhengjie/yum.repos.d [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/.snapshot Found 1 items drwxr-xr-x - root admingroup 0 2020-08-21 16:26 /yinzhengjie/.snapshot/mySnapshot001 [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/.snapshot/mySnapshot001 Found 2 items -rw-r--r-- 2 root admingroup 69 2020-08-14 23:22 /yinzhengjie/.snapshot/mySnapshot001/wc.txt.gz drwxr-xr-x - root admingroup 0 2020-08-14 23:13 /yinzhengjie/.snapshot/mySnapshot001/yum.repos.d [root@hadoop101.yinzhengjie.com ~]#
4>.列出已启用快照功能的目录
[root@hadoop101.yinzhengjie.com ~]# hdfs lsSnapshottableDir -help Usage: hdfs lsSnapshottableDir: Get the list of snapshottable directories that are owned by the current user. Return all the snapshottable directories if the current user is a super user. [root@hadoop101.yinzhengjie.com ~]#
[root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls / Found 4 items drwxr-xr-x - root admingroup 0 2020-08-21 16:40 /bigdata drwxr-xr-x - root admingroup 0 2020-08-20 19:26 /system drwx------ - root admingroup 0 2020-08-14 19:19 /user drwxr-xr-x - root admingroup 0 2020-08-21 16:26 /yinzhengjie [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs lsSnapshottableDir #查看已经启用快照功能的目录 drwxr-xr-x 0 root admingroup 0 2020-08-21 16:40 0 65536 /bigdata drwx------ 0 root admingroup 0 2020-08-14 19:19 0 65536 /user drwxr-xr-x 0 root admingroup 0 2020-08-21 16:26 1 65536 /yinzhengjie [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfsadmin -disallowSnapshot /bigdata #此时我们禁用一个快照目录 Disallowing snaphot on /bigdata succeeded [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs lsSnapshottableDir #发现被禁用的快照目录并不会被显示啦~ drwx------ 0 root admingroup 0 2020-08-14 19:19 0 65536 /user drwxr-xr-x 0 root admingroup 0 2020-08-21 16:26 1 65536 /yinzhengjie [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]#
5>.获取快照差异报告
[root@hadoop101.yinzhengjie.com ~]# hdfs snapshotDiff -help Usage: hdfs snapshotDiff <snapshotDir> <from> <to>: Get the difference between two snapshots, or between a snapshot and the current tree of a directory. For <from>/<to>, users can use "." to present the current status, and use ".snapshot/snapshot_name" to present a snapshot, where ".snapshot/" can be omitted [root@hadoop101.yinzhengjie.com ~]#
[root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/ Found 2 items -rw-r--r-- 2 root admingroup 69 2020-08-14 23:22 /yinzhengjie/wc.txt.gz drwxr-xr-x - root admingroup 0 2020-08-14 23:13 /yinzhengjie/yum.repos.d [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -put /etc/hosts /yinzhengjie/ [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/ Found 3 items -rw-r--r-- 3 root admingroup 371 2020-08-21 16:45 /yinzhengjie/hosts -rw-r--r-- 2 root admingroup 69 2020-08-14 23:22 /yinzhengjie/wc.txt.gz drwxr-xr-x - root admingroup 0 2020-08-14 23:13 /yinzhengjie/yum.repos.d [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -createSnapshot /yinzhengjie/ mySnapshot002 #创建一个新的快照取名为"mySnapshot002" Created snapshot /yinzhengjie/.snapshot/mySnapshot002 [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/.snapshot #查看"/yinzhengjie"目录下已经创建的快照信息 Found 2 items drwxr-xr-x - root admingroup 0 2020-08-21 16:26 /yinzhengjie/.snapshot/mySnapshot001 drwxr-xr-x - root admingroup 0 2020-08-21 16:46 /yinzhengjie/.snapshot/mySnapshot002 [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs snapshotDiff /yinzhengjie/ mySnapshot001 mySnapshot002 #对比"/yinzhengjie"目录下2个快照的差异 Difference between snapshot mySnapshot001 and snapshot mySnapshot002 under directory /yinzhengjie: M . + ./hosts [root@hadoop101.yinzhengjie.com ~]#
6>.快照的内容是只读的(话句话说,已创建的快照是不可修改的!)
用户可以启用自己的快照,管理员通过指定用户可以获取快照的位置来管理快照。快照目录中的文件和目录是不可变的,不能在该目录中添加或删除任何内容! 如下所示,我尝试删除快照的文件,发现抛出异常: ".snapshot" is a reserved name.
7>.从快照恢复已删除的文件
[root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/ Found 3 items -rw-r--r-- 3 root admingroup 371 2020-08-21 16:45 /yinzhengjie/hosts -rw-r--r-- 2 root admingroup 69 2020-08-14 23:22 /yinzhengjie/wc.txt.gz drwxr-xr-x - root admingroup 0 2020-08-14 23:13 /yinzhengjie/yum.repos.d [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -rm -skipTrash /yinzhengjie/wc.txt.gz Deleted /yinzhengjie/wc.txt.gz [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/ Found 2 items -rw-r--r-- 3 root admingroup 371 2020-08-21 16:45 /yinzhengjie/hosts drwxr-xr-x - root admingroup 0 2020-08-14 23:13 /yinzhengjie/yum.repos.d [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/.snapshot/ Found 2 items drwxr-xr-x - root admingroup 0 2020-08-21 16:26 /yinzhengjie/.snapshot/mySnapshot001 drwxr-xr-x - root admingroup 0 2020-08-21 16:46 /yinzhengjie/.snapshot/mySnapshot002 [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/.snapshot/mySnapshot001 Found 2 items -rw-r--r-- 2 root admingroup 69 2020-08-14 23:22 /yinzhengjie/.snapshot/mySnapshot001/wc.txt.gz drwxr-xr-x - root admingroup 0 2020-08-14 23:13 /yinzhengjie/.snapshot/mySnapshot001/yum.repos.d [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -cp -ptopax /yinzhengjie/.snapshot/mySnapshot001/wc.txt.gz /yinzhengjie/ #选项"-ptopax"表示时间戳,所有权,权限,ACLs和XAttr都保留在复制的快照中。 [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/ Found 3 items -rw-r--r-- 3 root admingroup 371 2020-08-21 16:45 /yinzhengjie/hosts -rw-r--r-- 3 root admingroup 69 2020-08-14 23:22 /yinzhengjie/wc.txt.gz drwxr-xr-x - root admingroup 0 2020-08-14 23:13 /yinzhengjie/yum.repos.d [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]#
8>.删除快照
[root@hadoop101.yinzhengjie.com ~]# hdfs dfs -help deleteSnapshot -deleteSnapshot <snapshotDir> <snapshotName> : Delete a snapshot from a directory [root@hadoop101.yinzhengjie.com ~]#
[root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/.snapshot/ Found 2 items drwxr-xr-x - root admingroup 0 2020-08-21 16:26 /yinzhengjie/.snapshot/mySnapshot001 drwxr-xr-x - root admingroup 0 2020-08-21 16:46 /yinzhengjie/.snapshot/mySnapshot002 [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -deleteSnapshot /yinzhengjie/ mySnapshot001 #删除名为"mySnapshot001"的快照目录 [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/.snapshot/ Found 1 items drwxr-xr-x - root admingroup 0 2020-08-21 16:46 /yinzhengjie/.snapshot/mySnapshot002 [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]#
9>.删除快照目录
如下图所示,Hadoop拒绝删除已有快照的目录。换句话说,HDFS超级用户或目录的所有者不能删除目录,在删除所有快照之前,不能删除或重命名快照目录。
[root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls / Found 4 items drwxr-xr-x - root admingroup 0 2020-08-21 16:40 /bigdata drwxr-xr-x - root admingroup 0 2020-08-20 19:26 /system drwx------ - root admingroup 0 2020-08-14 19:19 /user drwxr-xr-x - root admingroup 0 2020-08-21 17:16 /yinzhengjie [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs lsSnapshottableDir #查看启用快照功能的目录 drwx------ 0 root admingroup 0 2020-08-14 19:19 0 65536 /user drwxr-xr-x 0 root admingroup 0 2020-08-21 17:16 1 65536 /yinzhengjie [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/.snapshot/ #查看"/yinzhengjie"目录下的现有快照信息 Found 1 items drwxr-xr-x - root admingroup 0 2020-08-21 16:46 /yinzhengjie/.snapshot/mySnapshot002 [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -rm -r /yinzhengjie/ #删除失败,原因是"/yinzhengjie"目录下还有快照未被删除 rm: Failed to move to trash: hdfs://hadoop101.yinzhengjie.com:9000/yinzhengjie: The directory /yinzhengjie cannot be deleted since /yinzhengjie is snapshottable and already has snapshots [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -deleteSnapshot /yinzhengjie/ mySnapshot002 [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/.snapshot/ [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -rm -r /yinzhengjie/ #删除成功,因为"/yinzhengjie"目录下的快照信息被删除了,此处是因为我们没有添加"-skipTrash"选项,所以目录被移动到回收站里啦~ 20/08/21 17:28:52 INFO fs.TrashPolicyDefault: Moved: 'hdfs://hadoop101.yinzhengjie.com:9000/yinzhengjie' to trash at: hdfs://hadoop101.yinzhengjie.com:9000/user/root/.Trash/Current/yinzheng jie1598002132957 [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls / Found 3 items drwxr-xr-x - root admingroup 0 2020-08-21 16:40 /bigdata drwxr-xr-x - root admingroup 0 2020-08-20 19:26 /system drwx------ - root admingroup 0 2020-08-14 19:19 /user [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]#
[root@hadoop101.yinzhengjie.com ~]# hdfs lsSnapshottableDir #查看已经启用快照的目录 drwx------ 0 root admingroup 0 2020-08-14 19:19 0 65536 /user drwxr-xr-x 0 root admingroup 0 2020-08-21 17:16 0 65536 /user/root/.Trash/Current/yinzhengjie1598002132957 [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -mv /user/root/.Trash/Current/yinzhengjie1598002132957 /yinzhengjie #如果想要恢复数据,直接将其从回收站移动回原来的路径即可。 [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs lsSnapshottableDir #再次查看已启用快照的目录 drwx------ 0 root admingroup 0 2020-08-14 19:19 0 65536 /user drwxr-xr-x 0 root admingroup 0 2020-08-21 17:16 0 65536 /yinzhengjie [root@hadoop101.yinzhengjie.com ~]#
10>.为已有快照重命名
[root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/.snapshot Found 2 items drwxr-xr-x - root admingroup 0 2020-08-21 18:42 /yinzhengjie/.snapshot/mySnapshot001 drwxr-xr-x - root admingroup 0 2020-09-02 01:31 /yinzhengjie/.snapshot/mySnapshot002 [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -renameSnapshot /yinzhengjie mySnapshot001 myNewSnapshot #将"/yinzhengjie"目录下的mySnapshot001更名为"myNewSnapshot" [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]# hdfs dfs -ls /yinzhengjie/.snapshot Found 2 items drwxr-xr-x - root admingroup 0 2020-08-21 18:42 /yinzhengjie/.snapshot/myNewSnapshot drwxr-xr-x - root admingroup 0 2020-09-02 01:31 /yinzhengjie/.snapshot/mySnapshot002 [root@hadoop101.yinzhengjie.com ~]# [root@hadoop101.yinzhengjie.com ~]#