• Apache Hadoop 2.9.2 的归档案例剖析


                  Apache Hadoop 2.9.2 的归档案例剖析

                                            作者:尹正杰

    版权声明:原创作品,谢绝转载!否则将追究法律责任。

     

     

      能看到这篇文章说明你对NameNode 工作原理是有深入的理解啦!我们知道每个文件按照块存储,没饿过块的元数据存储在NameNode的内存中,因此Hadoop存储小文件会非常低效。因为大量的小文件会耗尽NameNode中的大部分内存。但注意,存储小文件所需要的磁盘容量和存储这些文件原始内容所需要的磁盘空间相比也不会增多。例如,一个2MB的文件大小为128MB的块存储,使用的是2MB的磁盘空间,而不是128MB。

     

    一.Hadoop存档

      Hadoop归档文件或HAR文件,是一个更高效的文件存档工具,它将文件存入HDFS块,在减少NameNode内存使用的同时,允许对文件进行透明访问。具体说来,Hadoop归档文件可以用作MapReduce的输入。

    二.归档操作

    1>.上传测试文件到hdfs集群中

    [root@node101.yinzhengjie.org.cn ~]# ll -R
    .:
    total 20
    -rw-r--r--. 1 root root 3124 Apr 12 13:31 edits.xml
    -rw-r--r--. 1 root root 1264 Apr 12 12:49 fsimage.xml
    drwxr-xr-x  2 root root 4096 Apr 16 18:05 krb5.conf.d
    -rw-r--r--. 1 root root    3 Apr 12 15:16 seen_txid
    drwxr-xr-x  4 root root 4096 Apr 16 18:05 yum.repos.d
    
    ./krb5.conf.d:
    total 4
    -rw-r--r-- 1 root root 641 Apr 16 18:05 krb5.conf
    
    ./yum.repos.d:
    total 20
    drwxr-xr-x 2 root root 4096 Apr 16 18:05 back
    -rw-r--r-- 1 root root 2523 Apr 16 18:05 CentOS-Base.repo
    drwxr-xr-x 2 root root 4096 Apr 16 18:05 default
    -rw-r--r-- 1 root root  951 Apr 16 18:05 epel.repo
    -rw-r--r-- 1 root root 1050 Apr 16 18:05 epel-testing.repo
    
    ./yum.repos.d/back:
    total 4
    -rw-r--r-- 1 root root 2523 Apr 16 18:05 CentOS-Base.repo
    
    ./yum.repos.d/default:
    total 32
    -rw-r--r-- 1 root root 1664 Apr 16 18:05 CentOS-Base.repo
    -rw-r--r-- 1 root root 1309 Apr 16 18:05 CentOS-CR.repo
    -rw-r--r-- 1 root root  649 Apr 16 18:05 CentOS-Debuginfo.repo
    -rw-r--r-- 1 root root  314 Apr 16 18:05 CentOS-fasttrack.repo
    -rw-r--r-- 1 root root  630 Apr 16 18:05 CentOS-Media.repo
    -rw-r--r-- 1 root root 1331 Apr 16 18:05 CentOS-Sources.repo
    -rw-r--r-- 1 root root 5701 Apr 16 18:05 CentOS-Vault.repo
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# ll -R
    [root@node101.yinzhengjie.org.cn ~]# ll
    total 20
    -rw-r--r--. 1 root root 3124 Apr 12 13:31 edits.xml
    -rw-r--r--. 1 root root 1264 Apr 12 12:49 fsimage.xml
    drwxr-xr-x  2 root root 4096 Apr 16 18:05 krb5.conf.d
    -rw-r--r--. 1 root root    3 Apr 12 15:16 seen_txid
    drwxr-xr-x  4 root root 4096 Apr 16 18:05 yum.repos.d
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# hadoop fs -mkdir /yinzhengjie
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -put ./*  /yinzhengjie
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie
    Found 5 items
    -rw-r--r--   2 root supergroup       3124 2019-04-16 18:10 /yinzhengjie/edits.xml
    -rw-r--r--   2 root supergroup       1264 2019-04-16 18:10 /yinzhengjie/fsimage.xml
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:10 /yinzhengjie/krb5.conf.d
    -rw-r--r--   2 root supergroup          3 2019-04-16 18:10 /yinzhengjie/seen_txid
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:10 /yinzhengjie/yum.repos.d
    [root@node101.yinzhengjie.org.cn ~]# 

    2>. 启动yarn进程(我们使用归档时需要用到该服务进行资源调度)

    [root@node101.yinzhengjie.org.cn ~]# ansible all -m shell -a 'jps'
    node110.yinzhengjie.org.cn | SUCCESS | rc=0 >>
    6900 Jps
    4441 DataNode
    
    node101.yinzhengjie.org.cn | SUCCESS | rc=0 >>
    16692 FsShell
    12389 NameNode
    12553 DataNode
    12857 SecondaryNameNode
    17646 Jps
    
    node103.yinzhengjie.org.cn | SUCCESS | rc=0 >>
    1560 DataNode
    1149 Jps
    
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# ansible all -m shell -a 'jps'                #yarn服务启动之前存在的进程
    [root@node101.yinzhengjie.org.cn ~]# start-yarn.sh 
    starting yarn daemons
    starting resourcemanager, logging to /yinzhengjie/softwares/hadoop-2.9.2/logs/yarn-root-resourcemanager-node101.yinzhengjie.org.cn.out
    node103.yinzhengjie.org.cn: starting nodemanager, logging to /yinzhengjie/softwares/hadoop-2.9.2/logs/yarn-root-nodemanager-node103.yinzhengjie.org.cn.out
    node101.yinzhengjie.org.cn: starting nodemanager, logging to /yinzhengjie/softwares/hadoop-2.9.2/logs/yarn-root-nodemanager-node101.yinzhengjie.org.cn.out
    node110.yinzhengjie.org.cn: starting nodemanager, logging to /yinzhengjie/softwares/hadoop-2.9.2/logs/yarn-root-nodemanager-node110.yinzhengjie.org.cn.out
    node102.yinzhengjie.org.cn: ssh: connect to host node102.yinzhengjie.org.cn port 22: No route to host
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# start-yarn.sh
    [root@node101.yinzhengjie.org.cn ~]# ansible all -m shell -a 'jps'
    node110.yinzhengjie.org.cn | SUCCESS | rc=0 >>
    4441 DataNode
    6939 NodeManager
    7084 Jps
    
    node103.yinzhengjie.org.cn | SUCCESS | rc=0 >>
    1332 NodeManager
    1560 DataNode
    1576 Jps
    
    node101.yinzhengjie.org.cn | SUCCESS | rc=0 >>
    17969 NodeManager
    16692 FsShell
    12389 NameNode
    18440 Jps
    12553 DataNode
    12857 SecondaryNameNode
    17855 ResourceManager
    
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# ansible all -m shell -a 'jps'                #启动yarn服务之后,我们观察哪些进程启动成功啦! 

    3>.将多个目录进行归档操作

    [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie
    Found 5 items
    -rw-r--r--   2 root supergroup       3124 2019-04-16 18:10 /yinzhengjie/edits.xml
    -rw-r--r--   2 root supergroup       1264 2019-04-16 18:10 /yinzhengjie/fsimage.xml
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:10 /yinzhengjie/krb5.conf.d
    -rw-r--r--   2 root supergroup          3 2019-04-16 18:10 /yinzhengjie/seen_txid
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:19 /yinzhengjie/yum.repos.d
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie                    #归档之前查看目录结构
    [root@node101.yinzhengjie.org.cn ~]# hadoop archive -archiveName yinzhengjie-test.har  -p /yinzhengjie/yum.repos.d /yinzhengjie/output
    19/04/16 18:42:58 INFO client.RMProxy: Connecting to ResourceManager at node101.yinzhengjie.org.cn/172.30.1.101:8032
    19/04/16 18:42:58 INFO client.RMProxy: Connecting to ResourceManager at node101.yinzhengjie.org.cn/172.30.1.101:8032
    19/04/16 18:42:58 INFO client.RMProxy: Connecting to ResourceManager at node101.yinzhengjie.org.cn/172.30.1.101:8032
    19/04/16 18:42:59 INFO mapreduce.JobSubmitter: number of splits:1
    19/04/16 18:42:59 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
    19/04/16 18:42:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1555408586551_0006
    19/04/16 18:42:59 INFO impl.YarnClientImpl: Submitted application application_1555408586551_0006
    19/04/16 18:43:00 INFO mapreduce.Job: The url to track the job: http://node101.yinzhengjie.org.cn:8088/proxy/application_1555408586551_0006/
    19/04/16 18:43:00 INFO mapreduce.Job: Running job: job_1555408586551_0006
    19/04/16 18:43:05 INFO mapreduce.Job: Job job_1555408586551_0006 running in uber mode : false
    19/04/16 18:43:05 INFO mapreduce.Job:  map 0% reduce 0%
    19/04/16 18:43:11 INFO mapreduce.Job:  map 100% reduce 0%
    19/04/16 18:43:16 INFO mapreduce.Job:  map 100% reduce 100%
    19/04/16 18:43:16 INFO mapreduce.Job: Job job_1555408586551_0006 completed successfully
    19/04/16 18:43:16 INFO mapreduce.Job: Counters: 49
            File System Counters
                    FILE: Number of bytes read=1379
                    FILE: Number of bytes written=403843
                    FILE: Number of read operations=0
                    FILE: Number of large read operations=0
                    FILE: Number of write operations=0
                    HDFS: Number of bytes read=19909
                    HDFS: Number of bytes written=19956
                    HDFS: Number of read operations=34
                    HDFS: Number of large read operations=0
                    HDFS: Number of write operations=11
            Job Counters 
                    Launched map tasks=1
                    Launched reduce tasks=1
                    Other local map tasks=1
                    Total time spent by all maps in occupied slots (ms)=3567
                    Total time spent by all reduces in occupied slots (ms)=2841
                    Total time spent by all map tasks (ms)=3567
                    Total time spent by all reduce tasks (ms)=2841
                    Total vcore-milliseconds taken by all map tasks=3567
                    Total vcore-milliseconds taken by all reduce tasks=2841
                    Total megabyte-milliseconds taken by all map tasks=3652608
                    Total megabyte-milliseconds taken by all reduce tasks=2909184
            Map-Reduce Framework
                    Map input records=14
                    Map output records=14
                    Map output bytes=1344
                    Map output materialized bytes=1379
                    Input split bytes=116
                    Combine input records=0
                    Combine output records=0
                    Reduce input groups=14
                    Reduce shuffle bytes=1379
                    Reduce input records=14
                    Reduce output records=0
                    Spilled Records=28
                    Shuffled Maps =1
                    Failed Shuffles=0
                    Merged Map outputs=1
                    GC time elapsed (ms)=177
                    CPU time spent (ms)=1090
                    Physical memory (bytes) snapshot=317157376
                    Virtual memory (bytes) snapshot=4319100928
                    Total committed heap usage (bytes)=137498624
            Shuffle Errors
                    BAD_ID=0
                    CONNECTION=0
                    IO_ERROR=0
                    WRONG_LENGTH=0
                    WRONG_MAP=0
                    WRONG_REDUCE=0
            File Input Format Counters 
                    Bytes Read=1148
            File Output Format Counters 
                    Bytes Written=0
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# hadoop archive -archiveName yinzhengjie-test.har -p /yinzhengjie/yum.repos.d /yinzhengjie/output
    [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie
    Found 6 items
    -rw-r--r--   2 root supergroup       3124 2019-04-16 18:10 /yinzhengjie/edits.xml
    -rw-r--r--   2 root supergroup       1264 2019-04-16 18:10 /yinzhengjie/fsimage.xml
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:10 /yinzhengjie/krb5.conf.d
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:25 /yinzhengjie/output                  #这个目录就是我们存储归档文件的,我们在上一步已经指明了,我们可以查看该目录下存放文件的名称!
    -rw-r--r--   2 root supergroup          3 2019-04-16 18:10 /yinzhengjie/seen_txid
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:19 /yinzhengjie/yum.repos.d
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie/output
    Found 1 items
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:25 /yinzhengjie/output/yinzhengjie-test.har      #大家看这个名称,我们在归档时使用了-archiveName参数归档文件目录!
    
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie/output/yinzhengjie-test.har
    Found 4 items
    -rw-r--r-- 2 root supergroup 0 2019-04-16 18:25 /yinzhengjie/output/yinzhengjie-test.har/_SUCCESS
    -rw-r--r-- 3 root supergroup 123 2019-04-16 18:25 /yinzhengjie/output/yinzhengjie-test.har/_index
    -rw-r--r-- 3 root supergroup 22 2019-04-16 18:25 /yinzhengjie/output/yinzhengjie-test.har/_masterindex
    -rw-r--r-- 3 root supergroup 641 2019-04-16 18:25 /yinzhengjie/output/yinzhengjie-test.har/part-0
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]#

    4>.查看归档文件

    [root@node101.yinzhengjie.org.cn ~]# hadoop fs -ls -R /yinzhengjie/output/yinzhengjie-test.har      
    -rw-r--r--   2 root supergroup          0 2019-04-16 18:43 /yinzhengjie/output/yinzhengjie-test.har/_SUCCESS
    -rw-r--r--   3 root supergroup       1287 2019-04-16 18:43 /yinzhengjie/output/yinzhengjie-test.har/_index
    -rw-r--r--   3 root supergroup         24 2019-04-16 18:43 /yinzhengjie/output/yinzhengjie-test.har/_masterindex
    -rw-r--r--   3 root supergroup      18645 2019-04-16 18:43 /yinzhengjie/output/yinzhengjie-test.har/part-0
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# hadoop fs -ls -R har:///yinzhengjie/output/yinzhengjie-test.har
    -rw-r--r--   3 root supergroup       2523 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/CentOS-Base.repo
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/back
    -rw-r--r--   3 root supergroup       2523 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/back/CentOS-Base.repo
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/default
    -rw-r--r--   3 root supergroup       1664 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/default/CentOS-Base.repo
    -rw-r--r--   3 root supergroup       1309 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/default/CentOS-CR.repo
    -rw-r--r--   3 root supergroup        649 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/default/CentOS-Debuginfo.repo
    -rw-r--r--   3 root supergroup        630 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/default/CentOS-Media.repo
    -rw-r--r--   3 root supergroup       1331 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/default/CentOS-Sources.repo
    -rw-r--r--   3 root supergroup       5701 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/default/CentOS-Vault.repo
    -rw-r--r--   3 root supergroup        314 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/default/CentOS-fasttrack.repo
    -rw-r--r--   3 root supergroup       1050 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/epel-testing.repo
    -rw-r--r--   3 root supergroup        951 2019-04-16 18:10 har:///yinzhengjie/output/yinzhengjie-test.har/epel.repo
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# 

    三.解归档文件

    1>.查看解归档之前的目录情况

    [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie
    Found 6 items
    -rw-r--r--   2 root supergroup       3124 2019-04-16 18:10 /yinzhengjie/edits.xml
    -rw-r--r--   2 root supergroup       1264 2019-04-16 18:10 /yinzhengjie/fsimage.xml
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:10 /yinzhengjie/krb5.conf.d
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:43 /yinzhengjie/output
    -rw-r--r--   2 root supergroup          3 2019-04-16 18:10 /yinzhengjie/seen_txid
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:19 /yinzhengjie/yum.repos.d
    [root@node101.yinzhengjie.org.cn ~]#  

    2>.进行解归档操作

    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# hadoop fs -cp har:///yinzhengjie/output/yinzhengjie-test.har /yinzhengjie/output2019
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie
    Found 7 items
    -rw-r--r--   2 root supergroup       3124 2019-04-16 18:10 /yinzhengjie/edits.xml
    -rw-r--r--   2 root supergroup       1264 2019-04-16 18:10 /yinzhengjie/fsimage.xml
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:10 /yinzhengjie/krb5.conf.d
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:43 /yinzhengjie/output
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:49 /yinzhengjie/output2019
    -rw-r--r--   2 root supergroup          3 2019-04-16 18:10 /yinzhengjie/seen_txid
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:19 /yinzhengjie/yum.repos.d
    [root@node101.yinzhengjie.org.cn ~]# 

    3>.对比归档前和解压后的数据是否一致

    [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie/output2019
    Found 5 items
    -rw-r--r--   2 root supergroup       2523 2019-04-16 18:49 /yinzhengjie/output2019/CentOS-Base.repo
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:49 /yinzhengjie/output2019/back
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:49 /yinzhengjie/output2019/default
    -rw-r--r--   2 root supergroup       1050 2019-04-16 18:49 /yinzhengjie/output2019/epel-testing.repo
    -rw-r--r--   2 root supergroup        951 2019-04-16 18:49 /yinzhengjie/output2019/epel.repo
    [root@node101.yinzhengjie.org.cn ~]# 
    [root@node101.yinzhengjie.org.cn ~]# hdfs dfs -ls /yinzhengjie/yum.repos.d
    Found 5 items
    -rw-r--r--   2 root supergroup       2523 2019-04-16 18:10 /yinzhengjie/yum.repos.d/CentOS-Base.repo
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:10 /yinzhengjie/yum.repos.d/back
    drwxr-xr-x   - root supergroup          0 2019-04-16 18:10 /yinzhengjie/yum.repos.d/default
    -rw-r--r--   2 root supergroup       1050 2019-04-16 18:10 /yinzhengjie/yum.repos.d/epel-testing.repo
    -rw-r--r--   2 root supergroup        951 2019-04-16 18:10 /yinzhengjie/yum.repos.d/epel.repo
    [root@node101.yinzhengjie.org.cn ~]# 

  • 相关阅读:
    SQL行列转换
    ASP.NET 〈%# 〉与〈%=〉的区别
    超级强大的 分页Sql存储过程
    ASP.Net 路径问题
    sql server分页
    JavaScript Rules2
    JavaScript Rules
    Draggable
    PHP面向对象编程静态变量(类变量)
    PHP 面向对象成员方法
  • 原文地址:https://www.cnblogs.com/yinzhengjie/p/10708591.html
Copyright © 2020-2023  润新知