• Ubuntu 14.10 下Spark on yarn安装


    1 服务器分布

    服务器 说明
    192.168.1.100 NameNode
    192.168.1.101 DataNode
    192.168.1.102 DataNode

    2 软件环境

      2.1 安装JDK,添加环境变量

      2.2 安装Scala,添加环境变量

      2.3 SSH免密码登陆,A登陆A,A登陆B,可参考http://blog.csdn.net/codepeak/article/details/14447627

    ssh-keygen -t rsa -P ''
    cat ~./ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    scp ~/.ssh/id_rsa.pub username@ipaddress:/location
    cat id_rsa.pub >> authorized_keys

      2.4 主机名设置

    sudo nano /etc/hosts
    
    192.168.1.100 cloud001
    192.168.1.101 cloud002
    192.168.1.102 cloud003

    3 Hadoop集群配置(各个机器相同配置)

      3.1 Hadoop环境安装,环境变量配置

    export HADOOP_HOME=/home/hadoop/hadoop-2.2.0
    export HADOOP_COMMON_HOME=$HADOOP_HOME
    export HADOOP_HDFS_HOME=$HADOOP_HOME
    export HADOOP_MAPRED_HOME=$HADOOP_HOME
    export HADOOP_YARN_HOME=$HADOOP_HOME
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    
    export SCALA_HOME=/home/hadoop/software/spark/scala-2.11.4
    export SPARK_EXAMPLES_JAR=/home/hadoop/software/spark/spark-1.0.0/examples/target/scala-2.11.4/spar$
    export SPARK_HOME=/home/hadoop/software/spark/spark-1.0.0
    export IDEA_HOME=/home/hadoop/software/dev/idea-IU-139.1117.1
    
    export PATH=$PATH:$SCALA_HOME/bin:$SPARK_HOME/bin:$IDEA_HOME/bin:$HADOOP_HOME/bin:$HADOOP/sbin:$M2_HOME/bin

      3.2 core.site.xml配置

    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://cloud001:9000</value>
        </property>
        <property>
            <name>hadoop.tmp.dir</name>
            <value>/home/hadoop/hadoop-2.2.0/tmp</value>
        </property>
    <!--    <property>
            <name>io.file.buffer.size</name>
            <value>131072</value>
        </property>
        <property>
            <name>hadoop.proxyuser.hadoop.hosts</name>
            <value>*</value>
        </property>
        <property>
            <name>hadoop.proxyuser.hadoop.groups</name>
            <value>*</value>
        </property>-->
    </configuration>

      3.3 hdfs-site.xml 配置

    <configuration>
        <property>
            <name>dfs.namenode.secondary.http-address</name>
            <value>cloud001:9001</value>
        </property>
        <property>
            <name>dfs.namenode.name.dir</name>
            <value>file:/home/hadoop/hadoop-2.2.0/dfs/name</value>
        </property>
        <property>
            <name>dfs.datanode.data.dir</name>
            <value>file:/home/hadoop/hadoop-2.2.0/dfs/data</value>
        </property>
        <property>
            <name>dfs.replication</name>
            <value>3</value>
        </property>
        <property>
            <name>dfs.webhdfs.enabled</name>
            <value>true</value>
        </property>
    </configuration>

      3.4 mapred-site.xml 配置

    <configuration>
        <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
        </property>
    <!--    <property>
            <name>mapreduce.jobhistory.address</name>
            <value>hadoopmaster:10020</value>
        </property>
        <property>
            <name>mapreduce.jobhistory.webapp.address</name>
            <value>hadoopmaster:19888</value>
        </property>-->
    </configuration>

      3.5 yarn-site.xml 配置

    <configuration>
    
    <!-- Site specific YARN configuration properties -->
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
    <!--    <property>
            <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
            <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>-->
        <property>
            <name>yarn.resourcemanager.address</name>
            <value>cloud001:8032</value>
        </property>
        <property>
            <name>yarn.resourcemanager.scheduler.address</name>
            <value>cloud001:8030</value>
        </property>
        <property>
            <name>yarn.resourcemanager.resource-tracker.address</name>
            <value>cloud001:8031</value>
        </property>
    <!--    <property>
            <name>yarn.resourcemanager.admin.address</name>
            <value>hadoopmaster:8033</value>
        </property>
        <property>
            <name>yarn.resourcemanager.webapp.address</name>
            <value>hadoopmaster:8088</value>
        </property> -->
    </configuration>

      3.6 配置hadoop-env.sh、mapred-env.sh、yarn-env.sh,在开头添加

    export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64

      3.7 数据节点配置

    nano slaves
    cloud002
    cloud003

     4 Spark集群配置(各个机器相同配置)

      4.1 Spark安装部署

      下载Spark二进制包,配置环境变量

    export SCALA_HOME=/home/hadoop/software/spark/scala-2.11.4
    export SPARK_EXAMPLES_JAR=/home/hadoop/software/spark/spark-1.0.0/examples/target/scala-2.11.4/spar$
    export SPARK_HOME=/home/hadoop/software/spark/spark-1.0.0

      配置spark-env.sh,添加如下

    export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
    export SCALA_HOME=/home/hadoop/software/spark/scala-2.11.4
    export HADOOP_HOME=/home/hadoop/hadoop-2.2.0

      配置slaves

    cloud002
    cloud003

    5 集群启动

      5.1 格式化NameNode节点

    hdfs namenode -format

      5.2 启动Hadoop

    sbin/start-all.sh

      5.3 启动Spark

    sbin/start-all.sh

    6 测试

      6.1 本地测试

    # bin/run-exampleorg.apache.spark.examples.SparkPi local

      6.2 普通集群测试

    # bin/run-exampleorg.apache.spark.examples.SparkPi spark://cloud001:7077
    # bin/run-exampleorg.apache.spark.examples.SparkLR spark://cloud001:7077
    # bin/run-exampleorg.apache.spark.examples.SparkKMeans spark://cloud001:7077 file:/usr/local/spark/data/kmeans_data.txt 2 1

      6.3 结合HDFS的集群模式

    # hadoop fs -put README.md .
    # MASTER=spark://cloud001:7077bin/spark-shell
    scala> val file =sc.textFile("hdfs://cloud001:9000/user/root/README.md")
    scala> val count = file.flatMap(line=> line.split(" ")).map(word => (word, 1)).reduceByKey(_+_)
    
    scala> count.collect()
    
    scala> :quit

      6.4 基于YARN模式

    #SPARK_JAR=assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar
    
    bin/spark-class org.apache.spark.deploy.yarn.Client 
    
    --jar examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar 
    
    --class org.apache.spark.examples.SparkPi 
    
    --args yarn-standalone 
    
    --num-workers 3 
    
    --master-memory 4g 
    
    --worker-memory 2g 
    
    --worker-cores 1
  • 相关阅读:
    修理牛棚 贪心 USACO
    零件加工 贪心 题解
    花店橱窗 动态规划 题解
    动态规划 摆花 题解
    NOIP2004普及组第3题 FBI树
    实况世界杯4小游戏链接
    poj2761(treap入门)
    最大连续子序列和(分治法)
    任意区间的最长连续递增子序列,最大连续子序列和
    lca转RMQ
  • 原文地址:https://www.cnblogs.com/liuchangchun/p/4356873.html
Copyright © 2020-2023  润新知