Apache Hadoop 2.9.2 的归档案例剖析
作者:尹正杰
版权声明:原创作品,谢绝转载!否则将追究法律责任。
能看到这篇文章说明你对NameNode 工作原理是有深入的理解啦!我们知道每个文件按照块存储,没饿过块的元数据存储在NameNode的内存中,因此Hadoop存储小文件会非常低效。因为大量的小文件会耗尽NameNode中的大部分内存。但注意,存储小文件所需要的磁盘容量和存储这些文件原始内容所需要的磁盘空间相比也不会增多。例如,一个2MB的文件大小为128MB的块存储,使用的是2MB的磁盘空间,而不是128MB。
一.Hadoop存档
Hadoop归档文件或HAR文件,是一个更高效的文件存档工具,它将文件存入HDFS块,在减少NameNode内存使用的同时,允许对文件进行透明访问。具体说来,Hadoop归档文件可以用作MapReduce的输入。
二.归档操作
1>.上传测试文件到hdfs集群中
[root@node101.yinzhengjie.org.cn ~]# ll -R .: total 20 -rw-r--r--. 1 root root 3124 Apr 12 13:31 edits.xml -rw-r--r--. 1 root root 1264 Apr 12 12:49 fsimage.xml drwxr-xr-x 2 root root 4096 Apr 16 18:05 krb5.conf.d -rw-r--r--. 1 root root 3 Apr 12 15:16 seen_txid drwxr-xr-x 4 root root 4096 Apr 16 18:05 yum.repos.d ./krb5.conf.d: total 4 -rw-r--r-- 1 root root 641 Apr 16 18:05 krb5.conf ./yum.repos.d: total 20 drwxr-xr-x 2 root root 4096 Apr 16 18:05 back -rw-r--r-- 1 root root 2523 Apr 16 18:05 CentOS-Base.repo drwxr-xr-x 2 root root 4096 Apr 16 18:05 default -rw-r--r-- 1 root root 951 Apr 16 18:05 epel.repo -rw-r--r-- 1 root root 1050 Apr 16 18:05 epel-testing.repo ./yum.repos.d/back: total 4 -rw-r--r-- 1 root root 2523 Apr 16 18:05 CentOS-Base.repo ./yum.repos.d/default: total 32 -rw-r--r-- 1 root root 1664 Apr 16 18:05 CentOS-Base.repo -rw-r--r-- 1 root root 1309 Apr 16 18:05 CentOS-CR.repo -rw-r--r-- 1 root root 649 Apr 16 18:05 CentOS-Debuginfo.repo -rw-r--r-- 1 root root 314 Apr 16 18:05 CentOS-fasttrack.repo -rw-r--r-- 1 root root 630 Apr 16 18:05 CentOS-Media.repo -rw-r--r-- 1 root root 1331 Apr 16 18:05 CentOS-Sources.repo -rw-r--r-- 1 root root 5701 Apr 16 18:05 CentOS-Vault.repo [root@node101.yinzhengjie.org.cn ~]# [root@node101.yinzhengjie.org.cn ~]#
[root@node101.yinzhengjie.org.cn ~]# ll total 20 -rw-r--r--. 1 root root 3124 Apr 12 13:31 edits.xml -rw-r--r--. 1 root root 1264 Apr 12 12:49 fsimage.xml drwxr-xr-x 2 root root 4096 Apr 16 18:05 krb5.conf.d -rw-r--r--. 1 root root 3 Apr 12 15:16 seen_txid drwxr-xr-x 4 root root 4096 Apr 16 18:05 yum.repos.d [root@node101.yinzhengjie.org.cn ~]# [root@node101.yinzhengjie.org.cn ~]# hadoop fs -mkdir /yinzhengjie [root@node101.yinzhengjie.org.cn ~]# [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -put ./* /yinzhengjie [root@node101.yinzhengjie.org.cn ~]# [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie Found 5 items -rw-r--r-- 2 root supergroup 3124 2019-04-16 18:10 /yinzhengjie/edits.xml -rw-r--r-- 2 root supergroup 1264 2019-04-16 18:10 /yinzhengjie/fsimage.xml drwxr-xr-x - root supergroup 0 2019-04-16 18:10 /yinzhengjie/krb5.conf.d -rw-r--r-- 2 root supergroup 3 2019-04-16 18:10 /yinzhengjie/seen_txid drwxr-xr-x - root supergroup 0 2019-04-16 18:10 /yinzhengjie/yum.repos.d [root@node101.yinzhengjie.org.cn ~]#
2>. 启动yarn进程(我们使用归档时需要用到该服务进行资源调度)
[root@node101.yinzhengjie.org.cn ~]# ansible all -m shell -a 'jps' node110.yinzhengjie.org.cn | SUCCESS | rc=0 >> 6900 Jps 4441 DataNode node101.yinzhengjie.org.cn | SUCCESS | rc=0 >> 16692 FsShell 12389 NameNode 12553 DataNode 12857 SecondaryNameNode 17646 Jps node103.yinzhengjie.org.cn | SUCCESS | rc=0 >> 1560 DataNode 1149 Jps [root@node101.yinzhengjie.org.cn ~]#
[root@node101.yinzhengjie.org.cn ~]# start-yarn.sh starting yarn daemons starting resourcemanager, logging to /yinzhengjie/softwares/hadoop-2.9.2/logs/yarn-root-resourcemanager-node101.yinzhengjie.org.cn.out node103.yinzhengjie.org.cn: starting nodemanager, logging to /yinzhengjie/softwares/hadoop-2.9.2/logs/yarn-root-nodemanager-node103.yinzhengjie.org.cn.out node101.yinzhengjie.org.cn: starting nodemanager, logging to /yinzhengjie/softwares/hadoop-2.9.2/logs/yarn-root-nodemanager-node101.yinzhengjie.org.cn.out node110.yinzhengjie.org.cn: starting nodemanager, logging to /yinzhengjie/softwares/hadoop-2.9.2/logs/yarn-root-nodemanager-node110.yinzhengjie.org.cn.out node102.yinzhengjie.org.cn: ssh: connect to host node102.yinzhengjie.org.cn port 22: No route to host [root@node101.yinzhengjie.org.cn ~]#
[root@node101.yinzhengjie.org.cn ~]# ansible all -m shell -a 'jps' node110.yinzhengjie.org.cn | SUCCESS | rc=0 >> 4441 DataNode 6939 NodeManager 7084 Jps node103.yinzhengjie.org.cn | SUCCESS | rc=0 >> 1332 NodeManager 1560 DataNode 1576 Jps node101.yinzhengjie.org.cn | SUCCESS | rc=0 >> 17969 NodeManager 16692 FsShell 12389 NameNode 18440 Jps 12553 DataNode 12857 SecondaryNameNode 17855 ResourceManager [root@node101.yinzhengjie.org.cn ~]# [root@node101.yinzhengjie.org.cn ~]#
3>.将多个目录进行归档操作
[root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie Found 5 items -rw-r--r-- 2 root supergroup 3124 2019-04-16 18:10 /yinzhengjie/edits.xml -rw-r--r-- 2 root supergroup 1264 2019-04-16 18:10 /yinzhengjie/fsimage.xml drwxr-xr-x - root supergroup 0 2019-04-16 18:10 /yinzhengjie/krb5.conf.d -rw-r--r-- 2 root supergroup 3 2019-04-16 18:10 /yinzhengjie/seen_txid drwxr-xr-x - root supergroup 0 2019-04-16 18:19 /yinzhengjie/yum.repos.d [root@node101.yinzhengjie.org.cn ~]#
[root@node101.yinzhengjie.org.cn ~]# hadoop archive -archiveName yinzhengjie-test.har -p /yinzhengjie/yum.repos.d /yinzhengjie/output 19/04/16 18:42:58 INFO client.RMProxy: Connecting to ResourceManager at node101.yinzhengjie.org.cn/172.30.1.101:8032 19/04/16 18:42:58 INFO client.RMProxy: Connecting to ResourceManager at node101.yinzhengjie.org.cn/172.30.1.101:8032 19/04/16 18:42:58 INFO client.RMProxy: Connecting to ResourceManager at node101.yinzhengjie.org.cn/172.30.1.101:8032 19/04/16 18:42:59 INFO mapreduce.JobSubmitter: number of splits:1 19/04/16 18:42:59 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled 19/04/16 18:42:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1555408586551_0006 19/04/16 18:42:59 INFO impl.YarnClientImpl: Submitted application application_1555408586551_0006 19/04/16 18:43:00 INFO mapreduce.Job: The url to track the job: http://node101.yinzhengjie.org.cn:8088/proxy/application_1555408586551_0006/ 19/04/16 18:43:00 INFO mapreduce.Job: Running job: job_1555408586551_0006 19/04/16 18:43:05 INFO mapreduce.Job: Job job_1555408586551_0006 running in uber mode : false 19/04/16 18:43:05 INFO mapreduce.Job: map 0% reduce 0% 19/04/16 18:43:11 INFO mapreduce.Job: map 100% reduce 0% 19/04/16 18:43:16 INFO mapreduce.Job: map 100% reduce 100% 19/04/16 18:43:16 INFO mapreduce.Job: Job job_1555408586551_0006 completed successfully 19/04/16 18:43:16 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=1379 FILE: Number of bytes written=403843 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=19909 HDFS: Number of bytes written=19956 HDFS: Number of read operations=34 HDFS: Number of large read operations=0 HDFS: Number of write operations=11 Job Counters Launched map tasks=1 Launched reduce tasks=1 Other local map tasks=1 Total time spent by all maps in occupied slots (ms)=3567 Total time spent by all reduces in occupied slots (ms)=2841 Total time spent by all map tasks (ms)=3567 Total time spent by all reduce tasks (ms)=2841 Total vcore-milliseconds taken by all map tasks=3567 Total vcore-milliseconds taken by all reduce tasks=2841 Total megabyte-milliseconds taken by all map tasks=3652608 Total megabyte-milliseconds taken by all reduce tasks=2909184 Map-Reduce Framework Map input records=14 Map output records=14 Map output bytes=1344 Map output materialized bytes=1379 Input split bytes=116 Combine input records=0 Combine output records=0 Reduce input groups=14 Reduce shuffle bytes=1379 Reduce input records=14 Reduce output records=0 Spilled Records=28 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=177 CPU time spent (ms)=1090 Physical memory (bytes) snapshot=317157376 Virtual memory (bytes) snapshot=4319100928 Total committed heap usage (bytes)=137498624 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=1148 File Output Format Counters Bytes Written=0 [root@node101.yinzhengjie.org.cn ~]#
[root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie Found 6 items -rw-r--r-- 2 root supergroup 3124 2019-04-16 18:10 /yinzhengjie/edits.xml -rw-r--r-- 2 root supergroup 1264 2019-04-16 18:10 /yinzhengjie/fsimage.xml drwxr-xr-x - root supergroup 0 2019-04-16 18:10 /yinzhengjie/krb5.conf.d drwxr-xr-x - root supergroup 0 2019-04-16 18:25 /yinzhengjie/output #这个目录就是我们存储归档文件的,我们在上一步已经指明了,我们可以查看该目录下存放文件的名称! -rw-r--r-- 2 root supergroup 3 2019-04-16 18:10 /yinzhengjie/seen_txid drwxr-xr-x - root supergroup 0 2019-04-16 18:19 /yinzhengjie/yum.repos.d [root@node101.yinzhengjie.org.cn ~]# [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie/output Found 1 items drwxr-xr-x - root supergroup 0 2019-04-16 18:25 /yinzhengjie/output/yinzhengjie-test.har #大家看这个名称,我们在归档时使用了-archiveName参数归档文件目录! [root@node101.yinzhengjie.org.cn ~]# [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie/output/yinzhengjie-test.har Found 4 items -rw-r--r-- 2 root supergroup 0 2019-04-16 18:25 /yinzhengjie/output/yinzhengjie-test.har/_SUCCESS -rw-r--r-- 3 root supergroup 123 2019-04-16 18:25 /yinzhengjie/output/yinzhengjie-test.har/_index -rw-r--r-- 3 root supergroup 22 2019-04-16 18:25 /yinzhengjie/output/yinzhengjie-test.har/_masterindex -rw-r--r-- 3 root supergroup 641 2019-04-16 18:25 /yinzhengjie/output/yinzhengjie-test.har/part-0 [root@node101.yinzhengjie.org.cn ~]# [root@node101.yinzhengjie.org.cn ~]#
4>.查看归档文件
[root@node101.yinzhengjie.org.cn ~]# hadoop fs -ls -R /yinzhengjie/output/yinzhengjie-test.har -rw-r--r-- 2 root supergroup 0 2019-04-16 18:43 /yinzhengjie/output/yinzhengjie-test.har/_SUCCESS -rw-r--r-- 3 root supergroup 1287 2019-04-16 18:43 /yinzhengjie/output/yinzhengjie-test.har/_index -rw-r--r-- 3 root supergroup 24 2019-04-16 18:43 /yinzhengjie/output/yinzhengjie-test.har/_masterindex -rw-r--r-- 3 root supergroup 18645 2019-04-16 18:43 /yinzhengjie/output/yinzhengjie-test.har/part-0 [root@node101.yinzhengjie.org.cn ~]# [root@node101.yinzhengjie.org.cn ~]# hadoop fs -ls -R har:///yinzhengjie/output/yinzhengjie-test.har -rw-r--r-- 3 root supergroup 2523 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/CentOS-Base.repo drwxr-xr-x - root supergroup 0 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/back -rw-r--r-- 3 root supergroup 2523 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/back/CentOS-Base.repo drwxr-xr-x - root supergroup 0 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/default -rw-r--r-- 3 root supergroup 1664 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/default/CentOS-Base.repo -rw-r--r-- 3 root supergroup 1309 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/default/CentOS-CR.repo -rw-r--r-- 3 root supergroup 649 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/default/CentOS-Debuginfo.repo -rw-r--r-- 3 root supergroup 630 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/default/CentOS-Media.repo -rw-r--r-- 3 root supergroup 1331 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/default/CentOS-Sources.repo -rw-r--r-- 3 root supergroup 5701 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/default/CentOS-Vault.repo -rw-r--r-- 3 root supergroup 314 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/default/CentOS-fasttrack.repo -rw-r--r-- 3 root supergroup 1050 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/epel-testing.repo -rw-r--r-- 3 root supergroup 951 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/epel.repo [root@node101.yinzhengjie.org.cn ~]# [root@node101.yinzhengjie.org.cn ~]#
三.解归档文件
1>.查看解归档之前的目录情况
[root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie Found 6 items -rw-r--r-- 2 root supergroup 3124 2019-04-16 18:10 /yinzhengjie/edits.xml -rw-r--r-- 2 root supergroup 1264 2019-04-16 18:10 /yinzhengjie/fsimage.xml drwxr-xr-x - root supergroup 0 2019-04-16 18:10 /yinzhengjie/krb5.conf.d drwxr-xr-x - root supergroup 0 2019-04-16 18:43 /yinzhengjie/output -rw-r--r-- 2 root supergroup 3 2019-04-16 18:10 /yinzhengjie/seen_txid drwxr-xr-x - root supergroup 0 2019-04-16 18:19 /yinzhengjie/yum.repos.d [root@node101.yinzhengjie.org.cn ~]#
2>.进行解归档操作
[root@node101.yinzhengjie.org.cn ~]# [root@node101.yinzhengjie.org.cn ~]# hadoop fs -cp har:///yinzhengjie/output/yinzhengjie-test.har /yinzhengjie/output2019 [root@node101.yinzhengjie.org.cn ~]# [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie Found 7 items -rw-r--r-- 2 root supergroup 3124 2019-04-16 18:10 /yinzhengjie/edits.xml -rw-r--r-- 2 root supergroup 1264 2019-04-16 18:10 /yinzhengjie/fsimage.xml drwxr-xr-x - root supergroup 0 2019-04-16 18:10 /yinzhengjie/krb5.conf.d drwxr-xr-x - root supergroup 0 2019-04-16 18:43 /yinzhengjie/output drwxr-xr-x - root supergroup 0 2019-04-16 18:49 /yinzhengjie/output2019 -rw-r--r-- 2 root supergroup 3 2019-04-16 18:10 /yinzhengjie/seen_txid drwxr-xr-x - root supergroup 0 2019-04-16 18:19 /yinzhengjie/yum.repos.d [root@node101.yinzhengjie.org.cn ~]#
3>.对比归档前和解压后的数据是否一致
[root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie/output2019 Found 5 items -rw-r--r-- 2 root supergroup 2523 2019-04-16 18:49 /yinzhengjie/output2019/CentOS-Base.repo drwxr-xr-x - root supergroup 0 2019-04-16 18:49 /yinzhengjie/output2019/back drwxr-xr-x - root supergroup 0 2019-04-16 18:49 /yinzhengjie/output2019/default -rw-r--r-- 2 root supergroup 1050 2019-04-16 18:49 /yinzhengjie/output2019/epel-testing.repo -rw-r--r-- 2 root supergroup 951 2019-04-16 18:49 /yinzhengjie/output2019/epel.repo [root@node101.yinzhengjie.org.cn ~]# [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie/yum.repos.d Found 5 items -rw-r--r-- 2 root supergroup 2523 2019-04-16 18:10 /yinzhengjie/yum.repos.d/CentOS-Base.repo drwxr-xr-x - root supergroup 0 2019-04-16 18:10 /yinzhengjie/yum.repos.d/back drwxr-xr-x - root supergroup 0 2019-04-16 18:10 /yinzhengjie/yum.repos.d/default -rw-r--r-- 2 root supergroup 1050 2019-04-16 18:10 /yinzhengjie/yum.repos.d/epel-testing.repo -rw-r--r-- 2 root supergroup 951 2019-04-16 18:10 /yinzhengjie/yum.repos.d/epel.repo [root@node101.yinzhengjie.org.cn ~]#