• 2.hadoop基本配置,本地模式,伪分布式搭建


    2. Hadoop三种集群方式

    1. 三种集群方式

    1. 本地模式

          hdfs dfs -ls /
          不需要启动任何进程
      
    2. 伪分布式

          所有进程跑在一个机器上
      
    3. 完全分布式

          每个机器运行不同的进程
      

    2. 服务器基本配置

    2.1 服务器配置及系统版本

    1. CPU: 2核
    2. 内存: 4G
    3. 系统版本: Centos7 1511

    2.2 服务器IP及主机名设置

    1. 服务器数量: 五台机器

      主机名 公网IP 内网IP
      hadoop-1 192.168.10.145 172.16.1.207
      hadoop-2 192.168.10.149 172.16.1.206
      hadoop-3 192.168.10.152 172.16.1.204
      hadoop-4 192.168.10.153 172.16.1.208
      hadoop-5 192.168.10.156 172.16.1.205
    2. 根据以上表格修改hosts表和主机名

      修改Hosts
      #vim /etc/hosts
      192.168.10.145  hadoop-1
      192.168.10.149  hadoop-2
      192.168.10.152  hadoop-3
      192.168.10.153  hadoop-4
      192.168.10.156  hadoop-5
      
      #scp /etc/hosts hadoop-2:/etc/
      #scp /etc/hosts hadoop-3:/etc/
      #scp /etc/hosts hadoop-4:/etc/
      #scp /etc/hosts hadoop-5:/etc/
      
      设置主机名
      #hostnamectl set-hostname hadoop-1
      #hostnamectl set-hostname hadoop-2
      #hostnamectl set-hostname hadoop-3
      #hostnamectl set-hostname hadoop-4
      #hostnamectl set-hostname hadoop-5
      
    3. ssh认证

      hadoop-1主机上执行
      #ssh-keygen -t rsa -P ''
      #ssh-copy-id 192.168.10.145
      #scp -r .ssh 192.168.10.149:/root/
      #scp -r .ssh 192.168.10.152:/root/
      #scp -r .ssh 192.168.10.153:/root/
      #scp -r .ssh 192.168.10.156:/root/
      
    4. 创建 /soft 存放jdk和Hadoop目录

      #ssh hadoop-1 'mkdir /soft'
      #ssh hadoop-2 'mkdir /soft'
      #ssh hadoop-3 'mkdir /soft'
      #ssh hadoop-4 'mkdir /soft'
      #ssh hadoop-5 'mkdir /soft'
      
    5. 安装jdk

      #cd /root/
      #确保已经下载了相关jdk包
      #scp jdk-8u131-linux-x64.tar.gz hadoop-1:/soft/
      #scp jdk-8u131-linux-x64.tar.gz hadoop-2:/soft/
      #scp jdk-8u131-linux-x64.tar.gz hadoop-3:/soft/
      #scp jdk-8u131-linux-x64.tar.gz hadoop-4:/soft/
      #scp jdk-8u131-linux-x64.tar.gz hadoop-5:/soft/
      
      所有的服务器
      #tar xf /soft/jdk-8u131-linux-x64.tar.gz -C /soft
      #ln -s /soft/jdk1.8.0_131 /soft/jdk  创建软连接
      配置环境变量
      #vim /etc/profile
      JAVA_HOME=/soft/jdk
      PATH=$PATH:$JAVA_HOME/bin
      #source /etc/profile
      
    6. 配置hadoop

      #scp hadoop-2.7.3.tar.gz hadoop-1:/soft/
      #scp hadoop-2.7.3.tar.gz hadoop-2:/soft/
      #scp hadoop-2.7.3.tar.gz hadoop-3:/soft/
      #scp hadoop-2.7.3.tar.gz hadoop-4:/soft/
      #scp hadoop-2.7.3.tar.gz hadoop-5:/soft/
      所有服务器
      #tar xf /soft/hadoop-2.7.3.tar.gz -C /soft
      #ln -s /soft/hadoop-2.7.3 /soft/hadoop  创建软连接
      修改环境变量
      #vim /etc/profile
      HADOOP_HOME=/soft/hadoop
      PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
      #source /etc/profile
      修改hadoop-env.sh
      # vim /soft/hadoop/etc/hadoop/hadoop-env.sh
      修改hadoop-env.sh 修改hadoop的jAVA_HOME
      export JAVA_HOME=/soft/jdk
      

    3. 本地模式

    1. 本地就是单机模式,hadoop默认安装完就是单机模式
    2. hdfs默认使用本地的文件系统
    3. hdfs dfs -ls / 查看本地文件系统 和linux的ls /一样

    3.1 测试单词统计

    #hdfs dfs -mkdir /input
    #cd /input/
    #echo “hello word” > file1.txt
    #echo “hello hadoop” > file2.txt
    #echo “hello mapreduce” >> file2.txt
    #cd /soft/hadoop/
    #hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /input /output
    #hdfs dfs  -ls /output/
    Found 2 items
    -rw-r--r--   1 root root          0 2017-06-21 11:56 /output/_SUCCESS
    -rw-r--r--   1 root root         48 2017-06-21 11:56 /output/part-r-00000
    

    4. 伪分布式

    4.1 伪分布式介绍

    Hadoop 可以在单节点上以伪分布式的方式运行,Hadoop 进程以分离的 Java 进程来运行,节点既作为 NameNode 也作为 DataNode,同时,读取的是 HDFS 中的文件。
    Hadoop 的配置文件位于 /soft/hadoop/etc/hadoop/ 中,伪分布式需要修改2个配置文件 core-site.xml 和 hdfs-site.xml 。Hadoop的配置文件是 xml 格式,每个配置以声明 property 的 name 和 value 的方式来实现。

    4.2 伪分布式搭建

    1. 配置伪分布式

      #mkdir /data/hadoop
      #cd /soft/hadoop/etc/
      #mv hadoop local
      #cp -r local pseudo
      #ln -s pseudo hadoop
      #cd hadoop
      
      
    2. 修改core-site.xml配置文件

      #vim core-site.xml 
      [core-site.xml配置如下]
      
      <?xml version="1.0"?>
      <configuration>
          <property>
              <name>hadoop.tmp.dir</name>
              <value>file:/data/hadoop/tmp</value>
              <description>Abase for other temporary directories.</description>
          </property>
          <property>
              <name>fs.defaultFS</name>
      		    <value>hdfs://localhost/</value>
      	   </property>
      </configuration>
      
    3. 修改hdfs-site.xml配置文件

      #vim hdfs-site.xml 
      [hdfs-site.xml配置如下]
      <?xml version="1.0"?>
      <configuration>
      	<property>
      		<name>dfs.replication</name>
      		<value>1</value>
      	</property>
      	  <property>
              <name>dfs.namenode.name.dir</name>
              <value>file:/data/hadoop/tmp/dfs/name</value>
         </property>
         <property>
              <name>dfs.datanode.data.dir</name>
              <value>file:/data/hadoop/tmp/dfs/data</value>
         </property>
      </configuration>
      
      hadoop 的运行方式是由配置文件决定的(运行 Hadoop 时会读取配置文件)
      因此如果需要从伪分布式模式切换回非分布式模式,需要删除 core-site.xml 中的配置项。
      此外,伪分布式虽然只需要配置 fs.defaultFS 和 dfs.replication 就可以运行(官方教程如此)
      不过若没有配置 hadoop.tmp.dir 参数,则默认使用的临时目录为 /tmp/hadoo-hadoop,而这个目录在重启时有可能被系统清理掉,导致必须重新执行 format 才行。所以我们进行了设置,同时也指定 dfs.namenode.name.dir 和 dfs.datanode.data.dir,否则在接下来的步骤中可能会出错。
      YARN 是从 MapReduce 中分离出来的,负责资源管理与任务调度。YARN 运行于 MapReduce 之上,提供了高可用性、高扩展性,YARN 的更多介绍在此不展开,有兴趣的可查阅相关资料
      
    4. 修改mapred-site.xml配置文件

      #cp mapred-site.xml.template mapred-site.xml
      #vim mapred-site.xml
      [mapred-site.xml配置如下]
      <?xml version="1.0"?>
      <configuration>
          	<property>
          		<name>mapreduce.framework.name</name>
          		<value>yarn</value>
          	</property>
      </configuration>
      
    5. 修改yarn-site.xml配置文件

      #vim yarn-site.xml 
      [yarn-site.xml配置如下]
      	<?xml version="1.0"?>
      <configuration>
      <property>
      	<name>yarn.resourcemanager.hostname</name>
      	<value>localhost</value>
      </property>
      <property>
      	<name>yarn.nodemanager.aux-services</name>
      	<value>mapreduce_shuffle</value>
      </property>
      </configuration>
      
    6. 修改slaves配置文件

      #vim slaves
      [slaves配置如下]
        localhost
      
    7. 格式化hdfs分布式文件系统

      #hadoop namenode -format
      [root@hadoop-1 hadoop]# hadoop namenode -format
      
      省略--------
      17/05/15 09:29:01 INFO util.ExitUtil: Exiting with status 0
      17/05/15 09:29:01 INFO namenode.NameNode: SHUTDOWN_MSG: 
      /************************************************************
      SHUTDOWN_MSG: Shutting down NameNode at hadoop-1/172.16.1.207
      ************************************************************/
      
      
    8. 启动hadoop服务

      #start-all.sh
      [root@hadoop-1 hadoop]# start-all.sh
      This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
      Starting namenodes on [localhost]
      The authenticity of host 'localhost (::1)' can't be established.
      ECDSA key fingerprint is da:38:db:62:7e:97:52:6e:11:1b:81:93:1b:a4:b4:e6.
      Are you sure you want to continue connecting (yes/no)? yes
      localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
      localhost: starting namenode, logging to /soft/hadoop-2.7.3/logs/hadoop-root-namenode-hadoop-1.out
      localhost: starting datanode, logging to /soft/hadoop-2.7.3/logs/hadoop-root-datanode-hadoop-1.out
      Starting secondary namenodes [0.0.0.0]
      The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
      ECDSA key fingerprint is da:38:db:62:7e:97:52:6e:11:1b:81:93:1b:a4:b4:e6.
      Are you sure you want to continue connecting (yes/no)? yes
      0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
      0.0.0.0: starting secondarynamenode, logging to /soft/hadoop-2.7.3/logs/hadoop-root-secondarynamenode-hadoop-1.out
      starting yarn daemons
      starting resourcemanager, logging to /soft/hadoop-2.7.3/logs/yarn-root-resourcemanager-hadoop-1.out
      localhost: starting nodemanager, logging to /soft/hadoop-2.7.3/logs/yarn-root-nodemanager-hadoop-1.out
      
    
    9. **判断是否启动成功**
    
        ```
        启动完成后,可以通过命令 jps 来判断是否成功启动,若成功启动则会列出如下进程:   
        [root@hadoop-1 hadoop]# jps
        14784 NameNode
        15060 SecondaryNameNode
        14904 DataNode
        15211 ResourceManager
        15628 Jps
        15374 NodeManager
        如果 SecondaryNameNode 没有启动,请运行 sbin/stop-dfs.sh 关闭进程,
        然后再次尝试启动尝试)。如果没有 NameNode 或 DataNode ,那就是配置不成功,
        请仔细检查之前步骤,或通过查看启动日志排查原因
        ```
    10. **登陆WEB查看**
    
        打开http://192.168.10.145:50070
        ![](http://img.liuyao.me/14981417208845.png)
    
    
    ##4.2 伪分布式单词统计
    
    1. **在本地创建目录和分析的log**
    
        ```
        #mkdir  /input
        #cd /input
        #echo "hello world" > file1.log
        #echo "hello world" > file2.log
        #echo "hello hadoop" > file3.log
        #echo "hello hadoop" > file4.log
        #echo "map" > file5.log
        ```
    2. **在hdfs创建目录和上传本地log**
        
        ```
        #hdfs dfs -mkdir -p /input/
        #hdfs dfs -ls /
        Found 1 items
        drwxr-xr-x   - root supergroup          0 2017-06-22 10:29 /input
        #hdfs dfs -put file* /input/
        # hdfs dfs -ls /input/
        Found 5 items
        -rw-r--r--   1 root supergroup         12 2017-06-22 10:32 /input/file1.log
        -rw-r--r--   1 root supergroup         12 2017-06-22 10:32 /input/file2.log
        -rw-r--r--   1 root supergroup         13 2017-06-22 10:32 /input/file3.log
        -rw-r--r--   1 root supergroup         13 2017-06-22 10:32 /input/file4.log
        -rw-r--r--   1 root supergroup          4 2017-06-22 10:32 /input/file5.log
        ```
    
    3. **使用自带jar包进行单词统计**
    
        ```
        # hadoop jar /soft/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /input /output
        17/05/15 09:48:38 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
        17/05/15 09:48:39 INFO input.FileInputFormat: Total input paths to process : 5
        17/05/15 09:48:40 INFO mapreduce.JobSubmitter: number of splits:5
        17/05/15 09:48:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1494855010567_0001
        17/05/15 09:48:40 INFO impl.YarnClientImpl: Submitted application application_1494855010567_0001
        17/05/15 09:48:40 INFO mapreduce.Job: The url to track the job: http://hadoop-1:8088/proxy/application_1494855010567_0001/
        17/05/15 09:48:40 INFO mapreduce.Job: Running job: job_1494855010567_0001
        17/05/15 09:48:48 INFO mapreduce.Job: Job job_1494855010567_0001 running in uber mode : false
        17/05/15 09:48:48 INFO mapreduce.Job:  map 0% reduce 0%
        17/05/15 09:48:59 INFO mapreduce.Job:  map 20% reduce 0%
        17/05/15 09:49:00 INFO mapreduce.Job:  map 80% reduce 0%
        17/05/15 09:49:01 INFO mapreduce.Job:  map 100% reduce 0%
        17/05/15 09:49:06 INFO mapreduce.Job:  map 100% reduce 100%
        17/05/15 09:49:06 INFO mapreduce.Job: Job job_1494855010567_0001 completed successfully
        17/05/15 09:49:06 INFO mapreduce.Job: Counters: 50
                    File System Counters
                    FILE: Number of bytes read=114
                    FILE: Number of bytes written=711875
                    FILE: Number of read operations=0
                    FILE: Number of large read operations=0
                    FILE: Number of write operations=0
                    HDFS: Number of bytes read=539
                    HDFS: Number of bytes written=31
                    HDFS: Number of read operations=18
                    HDFS: Number of large read operations=0
                    HDFS: Number of write operations=2
            Job Counters 
                    Killed map tasks=1
                    Launched map tasks=5
                    Launched reduce tasks=1
                    Data-local map tasks=5
                    Total time spent by all maps in occupied slots (ms)=48562
                    Total time spent by all reduces in occupied slots (ms)=4413
                    Total time spent by all map tasks (ms)=48562
                    Total time spent by all reduce tasks (ms)=4413
                    Total vcore-milliseconds taken by all map tasks=48562
                    Total vcore-milliseconds taken by all reduce tasks=4413
                    Total megabyte-milliseconds taken by all map tasks=49727488
                    Total megabyte-milliseconds taken by all reduce tasks=4518912
            Map-Reduce Framework
                    Map input records=5
                    Map output records=9
                    Map output bytes=90
                    Map output materialized bytes=138
                    Input split bytes=485
                    Combine input records=9
                    Combine output records=9
                    Reduce input groups=4
                    Reduce shuffle bytes=138
                    Reduce input records=9
                    Reduce output records=4
                    Spilled Records=18
                    Shuffled Maps =5
                    Failed Shuffles=0
                    Merged Map outputs=5
                    GC time elapsed (ms)=1662
                    CPU time spent (ms)=2740
                    Physical memory (bytes) snapshot=1523605504
                    Virtual memory (bytes) snapshot=12609187840
                    Total committed heap usage (bytes)=1084227584
            Shuffle Errors
                    BAD_ID=0
                    CONNECTION=0
                    IO_ERROR=0
                    WRONG_LENGTH=0
                    WRONG_MAP=0
                    WRONG_REDUCE=0
            File Input Format Counters 
                    Bytes Read=54
            File Output Format Counters 
                    Bytes Written=31
        ```
     
    4. ** 查看结果**
    
        ```
        [root@hadoop-1 ~]# hdfs dfs -cat /output/*
        hadoop  2
        hello   4
        map     1
        world   
        [root@hadoop-1 ~]# hdfs dfs -get /output/*  .
        [root@hadoop-1 ~]# ls
        file1.log  file2.log  file3.log  file4.log  file5.log  part-r-00000  _SUCCESS
        [root@hadoop-1 ~]# cat part-r-00000 
        hadoop  2
        hello   4
        map     1
    
  • 相关阅读:
    Js通用验证
    C#实现马尔科夫模型例子
    C# 生成pdf文件客户端下载
    Js跨一级域名同步cookie
    C#数据库连接池 MySql SqlServer
    关于Oracle row_number() over()的简单使用
    开发中mybatis的一些常见问题记录
    Java通过图片url地址获取图片base64位字符串的两种方式
    基于apache httpclient的常用接口调用方法
    通过jcrop和canvas的画布功能完成对图片的截图功能与视频的截图功能实现
  • 原文地址:https://www.cnblogs.com/liu-yao/p/7067615.html
Copyright © 2020-2023  润新知