• hadoop入门学习


    学习资料:wiki,官网,《hadoop海量数据处理技术详解与项目实战》

    Apache Hadoop版本分为两代,我们将第一代Hadoop称为Hadoop 1.0,第二代Hadoop称为Hadoop2.0。第一代Hadoop包含三个大版本,分别是0.20.x,0.21.x和0.22.x,其 中,0.20.x最后演化成1.0.x,变成了稳定版。第二代Hadoop包含两个版本,分别是0.23.x和2.x,它们完全不同于Hadoop 1.0,是一套全新的架构,均包含HDFS Federation和YARN两个系统,相比于0.23.x,2.x增加了NameNode HA和Wire-compatibility两个重大特性。

    hadoop其实是一系列软件库组成的框架,这些软件库也可称作功能模块。其中最主要的是common(远程调用RPC,序列化机制),HDFS(数据的存储),mapreduce(数据的计算)。

    开源的世界创造力是无穷的,围绕hadoop有越来越多的软件出现。hadoop有时也指代hadoop生态圈。

    hadoop2.7.2伪分布式安装

    环境

    • 虚拟机 ubuntu64位
    • JDK:1.7.0_79 64位
    • Hadoop:2.7.2

    在Hadoop安装过程中需要关闭防火墙和SElinux,否则会出现异常。

    软件安装要求:

    1.java

    java -version
    java version "1.7.0_79"
    Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
    Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)

    2.ssh---sshd必须运行,使得能够用hadoop脚本管理hadoop节点。

    $ sudo apt-get install ssh
    $ sudo apt-get install rsync

    hadoop2.7.2伪分布式安装

    1.edit the file etc/hadoop/hadoop-env.sh

    # set to the root of your Java installation export

    JAVA_HOME=/usr/java/latest

    2.etc/hadoop/core-site.xml:

    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://localhost:9000</value>
        </property>
    </configuration>
    

    3.etc/hadoop/hdfs-site.xml:

    <configuration>
        <property>
            <name>dfs.replication</name>
            <value>1</value>
        </property>
    </configuration>
    
    Setup passphraseless ssh
    
    Now check that you can ssh to the localhost without a passphrase:
    
      $ ssh localhost
    
    If you cannot ssh to localhost without a passphrase, execute the following commands:
    
      $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
      $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
      $ chmod 0600 ~/.ssh/authorized_keys

    执行

    Execution
    
    The following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see YARN on Single Node.
    
        Format the filesystem:
    
          $ bin/hdfs namenode -format
    
        Start NameNode daemon and DataNode daemon:
    
          $ sbin/start-dfs.sh
    
        The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).
    
        Browse the web interface for the NameNode; by default it is available at:
            NameNode - http://localhost:50070/
    
        Make the HDFS directories required to execute MapReduce jobs:
    
          $ bin/hdfs dfs -mkdir /user
          $ bin/hdfs dfs -mkdir /user/<username>
    
        Copy the input files into the distributed filesystem:
    
          $ bin/hdfs dfs -put etc/hadoop input
    
        Run some of the examples provided:
    
          $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
    
        Examine the output files: Copy the output files from the distributed filesystem to the local filesystem and examine them:
    
          $ bin/hdfs dfs -get output output
          $ cat output/*
    
        or
    
        View the output files on the distributed filesystem:
    
          $ bin/hdfs dfs -cat output/*
    
        When you’re done, stop the daemons with:
    
          $ sbin/stop-dfs.sh

    YARN

    YARN on a Single Node
    
    You can run a MapReduce job on YARN in a pseudo-distributed mode by setting a few parameters and running ResourceManager daemon and NodeManager daemon in addition.
    
    The following instructions assume that 1. ~ 4. steps of the above instructions are already executed.
    
        Configure parameters as follows:etc/hadoop/mapred-site.xml:
    
        <configuration>
            <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
            </property>
        </configuration>
    
        etc/hadoop/yarn-site.xml:
    
        <configuration>
            <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
            </property>
        </configuration>
    
        Start ResourceManager daemon and NodeManager daemon:
    
          $ sbin/start-yarn.sh
    
        Browse the web interface for the ResourceManager; by default it is available at:
            ResourceManager - http://localhost:8088/
    
        Run a MapReduce job.
    
        When you’re done, stop the daemons with:
    
          $ sbin/stop-yarn.sh

    伪分布式环境搭建成功!





  • 相关阅读:
    Linux tail 命令详解
    解决ArrayList的ConcurrentModificationException
    DOS简单实用的批量输出
    sqlite显示查询所消耗时间
    监听短信增删以及短信会话增删
    getContentResolver().query()方法selection参数使用详解(转)
    intellij编译报错:Internal error: com.intellij.psi.tree.IFileElementType cannot be cast to com.intellij.psi.tree.IStubFileElementType
    Android开发中如何改变RadioButton背景图片和文字的相对位置(转)
    php 获取系统时间
    优化android studio编译的apk大小
  • 原文地址:https://www.cnblogs.com/flyingbee6/p/5204363.html
Copyright © 2020-2023  润新知