• 在虚拟机上配置安装hadoop集群


    原本以为有大神已经总结的很清楚了,就不自己在写了, 但是在自己安装的过程中还是出现了一些问题, 所以打算以自己的方式重新总结一下。    参考https://blog.csdn.net/hliq5399/article/details/78193113

    完全分布式安装

    对于hadoop的本地模式,伪分布式的安装,由于在实际工作中用处不大, 这里就省略不写了。

    下载最新版本hadoop

    https://hadoop.apache.org/releases.html

    服务器功能规划

    之前在VirtualBox网络的Host-Only配置 中我已经配置了三台虚拟机, 具体的功能划分如下

    masterslave1slave2
    NameNode ResourceManage  
    DataNode DataNode DataNode
    NodeManager NodeManager NodeManager
    HistoryServer   SecondaryNameNode

     

     

     

                                                                                                                                                                                                                                                                                                                                                                                                     

    配置Hostname

    这部分由于我在安装虚拟机的时候就输入了主机名,所以并不需要额外配置。

    具体的修改方法(以master机器为例, 其他机器修改为对应的主机名slave1,slave2)

    [root@master hadoop]# vi /etc/sysconfig/network

    NETWORKING=yes
    HOSTNAME=master

    [root@master hadoop]# service network restart

    配置hosts

    [root@master hadoop]# vi /etc/hosts

    192.168.102.3 master
    192.168.102.4 slave1
    192.168.102.5 slave2

    此处还是以master为例, 另外两台slave机器也同样配置。

    设置SSH无密码登录

    Hadoop集群中的各个机器间会相互地通过SSH访问,每次访问都输入密码是不现实的,所以要配置各个机器间的SSH是无密码登录的。

    1、 在master上生成公钥

    [root@master hadoop]# ssh-keygen -t rsa

    一路回车,都设置为默认值,然后再当前用户(我直接用了root用户,所以没有配置hadoop用户和用户组的过程)的Home目录下的.ssh目录中会生成公钥文件(id_rsa.pub)和私钥文件(id_rsa)

    [root@master hadoop]# cd /root/.ssh/
    authorized_keys  id_rsa           id_rsa.pub       known_hosts

    2、 分发公钥(本机也要分发

    [root@master hadoop]# ssh-copy-id master

    [root@master hadoop]# ssh-copy-id slave1

    [root@master hadoop]# ssh-copy-id slave2

    3、以同样方式设置slave1、slave2到其他机器的无密钥登录。

    在master机器上安装Hadoop

    我们采用先在master机器上解压、配置Hadoop,然后再分发到其他两台机器上的方式来安装集群。

     解压Hadoop目录:

    [root@master hadoop]# tar xzvf hadoop-2.9.2.tar.gz

    修改目录的用户名:

    [root@master hadoop]# chown -R root:root hadoop-2.9.2

    配置Hadoop JDK路径修改hadoop-env.sh、mapred-env.sh、yarn-env.sh文件中的JDK路径:

    [root@master hadoop]#cd /opt/hadoop/hadoop-2.9.2/etc/hadoop
    [root@master hadoop]# vi hadoop-env.sh
    [root@master hadoop]# vi mapred-env.sh
    [root@master hadoop]# vi yarn-env.sh
    export JAVA_HOME="/opt/java/jdk1.8.0_191"

     配置core-site.xml:

    [root@master hadoop]# vi /opt/hadoop/hadoop-2.9.2/etc/hadoop/core-site.xml
    <configuration>
    <property>
       <name>fs.defaultFS</name>
       <value>hdfs://master:8020</value>
     </property>
     <property>
       <name>hadoop.tmp.dir</name>
       <value>/opt/tmp</value>
     </property>
    </configuration>

    fs.defaultFS为NameNode的地址。

    hadoop.tmp.dir为hadoop临时目录的地址,默认情况下,NameNode和DataNode的数据文件都会存在这个目录下的对应子目录下。应该保证此目录是存在的,如果不存在,先创建。

    配置hdfs-site.xml:

    [root@master hadoop]# vi /opt/hadoop/hadoop-2.9.2/etc/hadoop/hdfs-site.xml
    <configuration>
     <property>
       <name>dfs.namenode.secondary.http-address</name>
       <value>slave2:50090</value>
     </property>
    </configuration>

    dfs.namenode.secondary.http-address是指定secondaryNameNode的http访问地址和端口号,因为在规划中,我们将slave2规划为SecondaryNameNode服务器。

    所以这里设置为:slave2:50090

     配置slaves:

    [root@master hadoop]# vi /opt/hadoop/hadoop-2.9.2/etc/hadoop/slaves
    master
    slave1
    slave2

    slaves文件是指定HDFS上有哪些DataNode节点。

    配置yarn-site.xml:

    [root@master hadoop]# vi /opt/hadoop/hadoop-2.9.2/etc/hadoop/yarn-site.xml
    <configuration>
    
    <!-- Site specific YARN configuration properties -->
       <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
        <property>
            <name>yarn.resourcemanager.hostname</name>
            <value>slave1</value>
        </property>
        <property>
            <name>yarn.log-aggregation-enable</name>
            <value>true</value>
        </property>
        <property>
            <name>yarn.log-aggregation.retain-seconds</name>
            <value>106800</value>
        </property>
    </configuration>

    根据规划yarn.resourcemanager.hostname这个指定resourcemanager服务器指向slave1

    yarn.log-aggregation-enable是配置是否启用日志聚集功能。

    yarn.log-aggregation.retain-seconds是配置聚集的日志在HDFS上最多保存多长时间。

     配置mapred-site.xml:

    从mapred-site.xml.template复制一个mapred-site.xml文件。

    [root@master hadoop]# cp /opt/hadoop/hadoop-2.9.2/etc/hadoop/mapred-site.xml.template /opt/hadoop/hadoop-2.9.2/etc/hadoop/mapred-site.xml
    [root@master hadoop]# vi /opt/hadoop/hadoop-2.9.2/etc/hadoop/mapred-site.xml
    <configuration>
        <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
        </property>
        <property>
            <name>mapreduce.jobhistory.address</name>
            <value>master:10020</value>
        </property>
        <property>
            <name>mapreduce.jobhistory.webapp.address</name>
            <value>master:19888</value>
        </property>
    </configuration>

    mapreduce.framework.name设置mapreduce任务运行在yarn上。

    mapreduce.jobhistory.address是设置mapreduce的历史服务器安装在master机器上。

    mapreduce.jobhistory.webapp.address是设置历史服务器的web页面地址和端口号。

    分发Hadoop文件

    1、 首先在其他两台机器上创建存放Hadoop的目录

    [root@slave1 opt]# mkdir /opt/hadoop/
    [root@slave2 opt]# mkdir /opt/hadoop/

    2、 通过Scp分发

    Hadoop根目录下的share/doc目录是存放的hadoop的文档,文件相当大,建议在分发之前将这个目录删除掉,可以节省硬盘空间并能提高分发的速度。

    doc目录大小有1.6G。

    [root@master hadoop]# scp -r /opt/hadoop/hadoop-2.9.2/ slave1:/opt/hadoop/
    [root@master hadoop]# scp -r /opt/hadoop/hadoop-2.9.2/ slave2:/opt/hadoop/

    配置Hadoop环境变量:

    在所有机器上执行下面命令。

    千万注意:

    1、如果你使用root用户进行安装。 vi /etc/profile 即可 系统变量

    2、如果你使用普通用户进行安装。 vi ~/.bashrc 用户变量

    [root@master hadoop]#vi /etc/profile

    在文件最后加上

    export HADOOP_HOME=/opt/hadoop/hadoop-2.9.2
    export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:

    修改后重新加载一下环境变量 source /etc/profile 或 source ~/.bashrc

    格式化NameNode

    在master(NameNode)机器上执行格式化:

    [root@master bin]# hdfs namenode -format

    启动集群

    1、 启动HDFS

    [root@master current]# start-dfs.sh
    18/12/20 14:13:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Starting namenodes on [master]
    master: starting namenode, logging to /opt/hadoop/hadoop-2.9.2/logs/hadoop-root-namenode-master.out
    slave1: starting datanode, logging to /opt/hadoop/hadoop-2.9.2/logs/hadoop-root-datanode-slave1.out
    slave2: starting datanode, logging to /opt/hadoop/hadoop-2.9.2/logs/hadoop-root-datanode-slave2.out
    master: starting datanode, logging to /opt/hadoop/hadoop-2.9.2/logs/hadoop-root-datanode-master.out
    Starting secondary namenodes [slave2]
    slave2: starting secondarynamenode, logging to /opt/hadoop/hadoop-2.9.2/logs/hadoop-root-secondarynamenode-slave2.out
    18/12/20 14:13:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

    2、 启动YARN

    在slave1(ResourceManager)机器上启动YARN服务

    [root@slave1 opt]# start-yarn.sh
    starting yarn daemons
    starting resourcemanager, logging to /opt/hadoop/hadoop-2.9.2/logs/yarn-root-resourcemanager-slave1.out
    slave2: starting nodemanager, logging to /opt/hadoop/hadoop-2.9.2/logs/yarn-root-nodemanager-slave2.out
    master: starting nodemanager, logging to /opt/hadoop/hadoop-2.9.2/logs/yarn-root-nodemanager-master.out
    slave1: starting nodemanager, logging to /opt/hadoop/hadoop-2.9.2/logs/yarn-root-nodemanager-slave1.out

    3、 启动日志服务器

    因为我们规划的是在master服务器上运行MapReduce日志服务,所以要在master上启动。

    [root@master sbin]# mr-jobhistory-daemon.sh start historyserver
    starting historyserver, logging to /opt/hadoop/hadoop-2.9.2/logs/mapred-root-historyserver-master.out

    4、查看3台服务器的进程启动状态

    [root@master logs]# jps
    6180 Jps
    4790 NodeManager
    5694 NameNode
    6079 JobHistoryServer
    [root@slave1 opt]# jps
    4592 DataNode
    4048 ResourceManager
    4155 NodeManager
    4686 Jps
    [root@slave2 hadoop]# jps
    3744 SecondaryNameNode
    3673 DataNode
    3852 Jps
    3373 NodeManager

    5、 查看HDFS Web页面

    http://192.168.102.3:50070/

    6、 查看YARN Web 页面

    http://192.168.102.4:8088/cluster

    测试Job

    我们这里用hadoop自带的wordcount例子来在本地模式下测试跑mapreduce。

    1、 准备mapreduce输入文件wc.input

    [root@master logs]# cat /opt/data/wc.input
    hadoop mapreduce hive
    hbase spark storm
    sqoop hadoop hive
    spark hadoop

    2、 在HDFS创建输入目录input

    [root@master logs]# hdfs dfs -mkdir /input

    3、 将wc.input上传到HDFS

    [root@master logs]# hdfs dfs -put /opt/data/wc.input /input/wc.input

    4、 运行hadoop自带的mapreduce Demo

    [root@master logs]# yarn jar /opt/hadoop/hadoop-2.9.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordcount /input/wc.input /output/
    18/12/20 14:43:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    18/12/20 14:43:29 INFO client.RMProxy: Connecting to ResourceManager at slave1/192.168.102.4:8032
    18/12/20 14:43:30 INFO input.FileInputFormat: Total input files to process : 1
    18/12/20 14:43:31 INFO mapreduce.JobSubmitter: number of splits:1
    18/12/20 14:43:31 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
    18/12/20 14:43:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1545286577071_0001
    18/12/20 14:43:31 INFO impl.YarnClientImpl: Submitted application application_1545286577071_0001
    18/12/20 14:43:31 INFO mapreduce.Job: The url to track the job: http://slave1:8088/proxy/application_1545286577071_0001/
    18/12/20 14:43:31 INFO mapreduce.Job: Running job: job_1545286577071_0001
    18/12/20 14:43:41 INFO mapreduce.Job: Job job_1545286577071_0001 running in uber mode : false
    18/12/20 14:43:41 INFO mapreduce.Job:  map 0% reduce 0%
    18/12/20 14:43:47 INFO mapreduce.Job:  map 100% reduce 0%
    18/12/20 14:43:54 INFO mapreduce.Job:  map 100% reduce 100%
    18/12/20 14:43:55 INFO mapreduce.Job: Job job_1545286577071_0001 completed successfully
    18/12/20 14:43:55 INFO mapreduce.Job: Counters: 49
            File System Counters
                    FILE: Number of bytes read=94
                    FILE: Number of bytes written=396815
                    FILE: Number of read operations=0
                    FILE: Number of large read operations=0
                    FILE: Number of write operations=0
                    HDFS: Number of bytes read=169
                    HDFS: Number of bytes written=60
                    HDFS: Number of read operations=6
                    HDFS: Number of large read operations=0
                    HDFS: Number of write operations=2
            Job Counters
                    Launched map tasks=1
                    Launched reduce tasks=1
                    Data-local map tasks=1
                    Total time spent by all maps in occupied slots (ms)=3250
                    Total time spent by all reduces in occupied slots (ms)=4671
                    Total time spent by all map tasks (ms)=3250
                    Total time spent by all reduce tasks (ms)=4671
                    Total vcore-milliseconds taken by all map tasks=3250
                    Total vcore-milliseconds taken by all reduce tasks=4671
                    Total megabyte-milliseconds taken by all map tasks=3328000
                    Total megabyte-milliseconds taken by all reduce tasks=4783104
            Map-Reduce Framework
                    Map input records=4
                    Map output records=11
                    Map output bytes=115
                    Map output materialized bytes=94
                    Input split bytes=98
                    Combine input records=11
                    Combine output records=7
                    Reduce input groups=7
                    Reduce shuffle bytes=94
                    Reduce input records=7
                    Reduce output records=7
                    Spilled Records=14
                    Shuffled Maps =1
                    Failed Shuffles=0
                    Merged Map outputs=1
                    GC time elapsed (ms)=136
                    CPU time spent (ms)=1160
                    Physical memory (bytes) snapshot=368492544
                    Virtual memory (bytes) snapshot=4127322112
                    Total committed heap usage (bytes)=165810176
            Shuffle Errors
                    BAD_ID=0
                    CONNECTION=0
                    IO_ERROR=0
                    WRONG_LENGTH=0
                    WRONG_MAP=0
                    WRONG_REDUCE=0
            File Input Format Counters
                    Bytes Read=71
            File Output Format Counters
                    Bytes Written=60

    5、 查看输出文件

    [root@master logs]# hdfs dfs -ls /output
    18/12/20 14:45:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 2 items
    -rw-r--r--   3 root supergroup          0 2018-12-20 14:43 /output/_SUCCESS
    -rw-r--r--   3 root supergroup         60 2018-12-20 14:43 /output/part-r-00000
    [root@master logs]# hdfs dfs -cat /output/part-r-00000
    18/12/20 14:45:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    hadoop  3
    hbase   1
    hive    2
    mapreduce       1
    spark   2
    sqoop   1
    storm   1
  • 相关阅读:
    c++中sort()及qsort()的用法总结
    POJ的层次感分类
    DFS练习-HDU1010
    Dijkstra&&Floyd
    DFS练习一---HDU 1342
    快速幂取模算法
    树的实现
    C++的队列和pair
    BFS练习-POJ.2386
    Codeforces 1139E(二分图最大匹配)
  • 原文地址:https://www.cnblogs.com/scarecrow-blog/p/10144273.html
Copyright © 2020-2023  润新知