• spark2的编译


    0、操作系统

    centos:6.4
    hadoop:2.5.0-cdh5.3.6

    1、为什么要编译 spark 源码?

    学习spark的第一步 就应该是编译源码,后期修改和调试,扩展集成的功能模块

    2、Spark 源码编译的三种形式?

    a.maven 编译
    # export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
    # ${SPARK_HOME_SRC}/./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package

    b.SBT 编译
    #${SPARK_HOME_SRC}/./build/sbt -Pyarn -Phadoop-2.3 package

    c.打包编译
    # ${SPARK_HOME_SRC}/./dev/make-distribution.sh --tgz -Psparkr -Dhadoop.version=2.5.0-cdh5.3.6 -Phadoop-2.4 -Phive -Phive-thriftserver -Pyarn

    3、版本要求:

    Maven 3.3.9

    JDK 1.8+(1.8.0_12)
    Scala 2.11.8
    Note: Starting version 2.0, Spark is built with Scala 2.11 by default.
    R(3.2.0)
    wget http://mirrors.tuna.tsinghua.edu.cn/CRAN/src/base/R-3/R-3.2.0.tar.gz

    4、编译步骤概览:

    0. root 用户编译 + 网络通畅
    1. jdk 环境搭建
    2. maven 环境搭建
    3. R(3.2.0)语言环境
    4. 正式编译

    5、jdk、maven 环境都是采用压缩包安装形式

    操作形式:上传压缩包、解压、配置环境变量、更新source 资源文件
    NOTE:
    检查Maven 是否和现有Java 环境对应起来
    给Maven 配置阿里云镜像:
    修改 ${MAVEN_HOME}/conf/settings.xml
    添加镜像:
    <mirror>
    <id>alimaven</id>
    <name>aliyun maven</name>
    <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
    <mirrorOf>central</mirrorOf>
    </mirror>

    R 语言 搭建
    下载源码
    # cd ${R_HOME}
    # yum install gcc-gfortran readline-devel libXt-devel

    error:
    # yum install gcc-gfortran #否则报”configure: error: No F77 compiler found”错误

    # yum install gcc gcc-c++ #否则报”configure: error: C++ preprocessor “/lib/cpp” fails sanity check”错误

    # yum install readline-devel #否则报”–with-readline=yes (default) and headers/libs are not available”错误

    # yum install libXt-devel #否则报”configure: error: –with-x=yes (default) and X11 headers/libs are not available”错误

    # ./configure --enable-R-shlib

    #make && make install
    # vi ~/.bashrc (配置环境变量)
    export R_HOME=/opt/modules/R-3.2.0
    export PATH=$R_HOME/bin:$PATH、

    6、正式编译

    上传源码压缩包并解压
    # cd ${SPARK_HOME_SRC}
    # ${SPARK_HOME_SRC}/./dev/make-distribution.sh --tgz -Psparkr -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.6 -Phive -Phive-thriftserver -Pyarn
    a. 添加 sparkr

    b. 添加hadoop版本 -Dhadoop.version=2.5.0-cdh5.3.6

    c. scala 压缩包解压到${SPARK_HOME_SRC}/build/

    d. 修改为对应的版本(dev/make-distribution.sh)
    初始
    VERSION=$("$MVN" help:evaluate -Dexpression=project.version $@ 2>/dev/null | grep -v "INFO" | tail -n 1)
    SCALA_VERSION=$("$MVN" help:evaluate -Dexpression=scala.binary.version $@ 2>/dev/null
    | grep -v "INFO"
    | tail -n 1)
    SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null
    | grep -v "INFO"
    | tail -n 1)
    SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null
    | grep -v "INFO"
    | fgrep --count "<id>hive</id>";
    # Reset exit status to 0, otherwise the script stops here if the last grep finds nothing
    # because we use "set -o pipefail"
    echo -n)
    替换为下面对应的参数值
    VERSION=2.10
    SCALA_VERSION=2.11
    SPARK_HADOOP_VERSION=2.5.0-cdh5.3.6
    SPARK_HIVE=1

    e.spark pom.xml 添加 cdh reponsitory
    <repository>
    <id>cloudera</id>
    <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    <releases>
    <enabled>true</enabled>
    </releases>
    <snapshots>
    <enabled>false</enabled>
    </snapshots>
    </repository>

    如果不添加会出现如下错误信息:
    Failed to execute goal on project spark-launcher_2.11: Could not resolve dependencies for project org.apache.spark:spark-launcher_2.11:jar:2.1.0: Could not find artifact org.apache.hadoop:hadoop-client:jar:2.5.0-cdh5.3.6

    [ERROR] After correcting the problems, you can resume the build with the command
    [ERROR] mvn <goals> -rf :spark-launcher_2.11
    -rf :spark-launcher_2.11

    # ${SPARK_HOME_SRC}/./dev/make-distribution.sh --tgz -Psparkr -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.6 -Phive -Phive-thriftserver -Pyarn -rf :spark-launcher_2.11

    下面是没有使用R模块的
    # ${SPARK_HOME_SRC}/./dev/make-distribution.sh --tgz -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.6 -Phive -Phive-thriftserver -Pyarn
    ===============================================================================

    最终打包编译 生成的包目录对应为${SPARK_HOME_SRC}/spark-2.1.0-bin-2.5.0-cdh5.3.6.tgz
    SPARK_VERSION-bin-HADOOP-VERSION.tgz

    NOTE:
    将编译好的spark 源码打包保存一份,后面 spark sql 及 spark streaming 后续学习会使用到相关的 jar 包.

    =====================================================================================

    真正使用R 运行在 spark 上,前面编译完成以后你需要初始化 R
    # cd {SPARK_HOME_SRC}/R/
    # ./install-dev.sh
    参考文章: https://github.com/apache/spark/tree/master/R

  • 相关阅读:
    DPK880 打印机 驱动正确安装不反应。
    VB6.0 获取N位有效数字方法
    Asp.Net开发小技巧汇总
    Microsoft SqlServer生成表数据Insert语句
    dell 服务器重装
    编程应该注意
    FIRST
    SharedObject使用:在FluorineFx.net与Flex中使用共享对象维护在线用户列表实例
    Flex与As3学习笔记之:Part 3 函数参数、字符串处理、日期与时间类型
    Flex与As3学习笔记之:Part 1 Flex语言基础
  • 原文地址:https://www.cnblogs.com/feiyumo/p/7482465.html
Copyright © 2020-2023  润新知