原文连接:http://spark.apache.org/docs/1.5.0/building-spark.html
· Building a Runnable Distribution
· Setting up Maven’s Memory Usage
· Specifying the Hadoop Version
· Building With Hive and JDBC Support
· Building Spark with IntelliJ IDEA or Eclipse
· Building for PySpark on YARN
· Packaging without Hadoop Dependencies for YARN
· Speeding up Compilation with Zinc
Building Spark using Maven requires Maven 3.3.3 or newer and Java 7+. The Spark build can supply a suitable Maven binary; see below.
编译安装spark 1.5.x需要maven 3.3.3及以后版本并且需要jdk1.7及以后版本。
Building with build/mvn(使用build/mvn编译)
Spark now comes packaged with a self-contained Maven installation to ease building and deployment of Spark from source located under the build/ directory. This script will automatically download and setup all necessary build requirements (Maven, Scala, and Zinc) locally within the build/ directory itself. It honors any mvn binary if present already, however, will pull down its own copy of Scala and Zinc regardless to ensure proper version requirements are met. build/mvn execution acts as a pass through to the mvn call allowing easy transition from previous build methods. As an example, one can build a version of Spark as follows:
目前 Spark 编译目录已经将 Maven 自带进去了,以方便编译以及部署。这个脚本将会在它本地 build/ 编译目录自动下载和安装所有编译过程中所必需的( Maven,Scala 和 Zinc )。如果这些已经存在,它将允许 mvn 二进制包下载它自己 Scala 和 Zinc 的拷贝副本,不管是否满足正确版本的要求。build/mvn 的执行允许从以前的版本的方法轻松过渡建。举个例子,可以如以下编译一个 Spark 版本:
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
Other build examples can be found below.
Note: When building on an encrypted filesystem (if your home directory is encrypted, for example), then the Spark build might fail with a “Filename too long” error. As a workaround, add the following in the configuration args of the scala-maven-plugin in the project pom.xml:
可以在下面找到其他的编译例子。
Note: 当在一个加密的文件系统上进行编译(比如,当你的 home 目录被加密了),那么 Spark 在编译时可能会出错,报错信息为 “Filename too long”。作为一个变通方案,将下面添加到项目pom.xml中的scala-maven-plugin的配置参数:
<arg>-Xmax-classfile-name</arg>
<arg>128</arg>
并在项目 project/SparkBuild.scala添加:
scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),添加到 sharedSettings变量。如果你不确定在哪里添加这行也可以看这个PR.
and in project/SparkBuild.scala add:
scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
to the sharedSettings val. See also this PR if you are unsure of where to add these lines.
Building a Runnable Distribution(编译运行版本)
To create a Spark distribution like those distributed by the Spark Downloads page, and that is laid out so as to be runnable, use make-distribution.sh in the project root directory. It can be configured with Maven profile settings and so on like the direct Maven build. Example:
为了像在 Spark Downloads 页面下载的那些版本一样创建 Spark 发布版。通过在项目根目录下使用 make-distribution.sh。像在直接 Maven 编译那样在 Maven profile文件中进行配置。例如:
./make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Pyarn
为了看更多信息,可以运行:./make-distribution.sh --help.
For more information on usage, run ./make-distribution.sh --help
Setting up Maven’s Memory Usage
You’ll need to configure Maven to use more memory than usual by setting MAVEN_OPTS. We recommend the following settings:
你需要通过设置 MAVEN_OPTS来配置 Maven,需要分配比通常更多的内存来设置 Maven。我们推荐以下的设置:
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
If you don’t run this, you may see errors like the following:
如果不运行上述命令,你可能会遇到如下的错误:
[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-2.10/classes...
[ERROR] PermGen space -> [Help 1]
[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-2.10/classes...
[ERROR] Java heap space -> [Help 1]
You can fix this by setting the MAVEN_OPTS variable as discussed before.
可以通过之前提到的设置 MAVEN_OPTS 变量解决这个问题。
Note:
· For Java 8 and above this step is not required.
· If using build/mvn with no MAVEN_OPTS set, the script will automate this for you.
Note:
· 对于 Java 8 来说,以上步骤不是必需的
· 如果使用不带 MAVEN_OPTS设置的 build/mvn ,那么脚本会自动帮你完成这些
Specifying the Hadoop Version
Because HDFS is not protocol-compatible across versions, if you want to read from HDFS, you’ll need to build Spark against the specific HDFS version in your environment. You can do this through the hadoop.version property. If unset, Spark will build against Hadoop 2.2.0 by default. Note that certain build profiles are required for particular Hadoop versions:
因为 HDFS 各版本协议是不兼容的,如果你想从 HDFS 中读取数据,你需要在你的环境中编译 Spark 来适应具体的 HDFS 版本。可以通过 “Hadoop.version” 属性进行设置。如果没有设置,Spark 将会默认编译 Hadoop2.2.0 版本的。注意到特定的 Hadoop 版本需要对应特定配置文件:
Hadoop version |
Profile required |
1.x to 2.1.x |
hadoop-1 |
2.2.x |
hadoop-2.2 |
2.3.x |
hadoop-2.3 |
2.4.x |
hadoop-2.4 |
2.6.x and later 2.x |
hadoop-2.6 |
For Apache Hadoop versions 1.x, Cloudera CDH “mr1” distributions, and other Hadoop versions without YARN, use:
对于 Apache Hadoop 版本 1.x ,Cloudrea CDH “mr1”发行版本,和其他不基于YARN 的 Hadoop 版本,请使用:
# Apache Hadoop 1.2.1
mvn -Dhadoop.version=1.2.1 -Phadoop-1 -DskipTests clean package
# Cloudera CDH 4.2.0 with MapReduce v1
mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phadoop-1 -DskipTests clean package
You can enable the yarn profile and optionally set the yarn.version property if it is different from hadoop.version. Spark only supports YARN versions 2.2.0 and later.
你可以使 “yarn” 配置文件成功启动,如果与 “hadoop.version” 参数值不一致的话,则可选配置 “yarn.version” 属性。Spark 只支持 YARN 版本 2.2.0 及以上。
Examples:
# Apache Hadoop 2.2.X
mvn -Pyarn -Phadoop-2.2 -DskipTests clean package
# Apache Hadoop 2.3.X
mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
# Apache Hadoop 2.4.X or 2.5.X
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package
Versions of Hadoop after 2.5.X may or may not work with the -Phadoop-2.4 profile (they were
released after this version of Spark).
# Different versions of HDFS and YARN.
mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Dyarn.version=2.2.0 -DskipTests clean package
Building With Hive and JDBC Support
To enable Hive integration for Spark SQL along with its JDBC server and CLI, add the -Phive and Phive-thriftserver profiles to your existing build options. By default Spark will build with Hive 0.13.1 bindings.
如果开启带 Hive 整合以及 JDBC 服务器和命令行界面 (CLI) 支持的 Spark SQL,添加 -Phive 和 Phive-thriftserver配置参数到现有的编译选项中。
# Apache Hadoop 2.4.X with Hive 13 support
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package
Building for Scala 2.11
To produce a Spark package compiled with Scala 2.11, use the -Dscala-2.11 property:
为了处理 由 Scala 2.11 编译的 Spark 包,请使用 -Dscala-2.11:
./dev/change-scala-version.sh 2.11
mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
Spark does not yet support its JDBC component for Scala 2.11.
对于 Scala 2.11 来说,Spark 目前为止并不支持它的 JDBC.
Spark Tests in Maven
Tests are run by default via the ScalaTest Maven plugin.
Some of the tests require Spark to be packaged first, so always run mvn package with -DskipTests the first time. The following is an example of a correct (build, test) sequence:
默认使用 ScalaTest Maven plugin 运行测试
某些测试需要先打包 Spark ,然后第一时间运行mvn包使用-DskipTests参数,所以第一次测试时运行 :
mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive -Phive-thriftserver clean package
mvn -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver test
The ScalaTest plugin also supports running only a specific test suite as follows:
这个 ScalaTest 插件同样也支持只运行指定的测试组件,如下所示:
mvn -Dhadoop.version=... -DwildcardSuites=org.apache.spark.repl.ReplSuite test
Continuous Compilation
We use the scala-maven-plugin which supports incremental and continuous compilation. E.g.
我们使用 scala-maven-plugin 插件支持渐进和持续编译,例如:
mvn scala:cc
should run continuous compilation (i.e. wait for changes). However, this has not been tested extensively. A couple of gotchas to note:
将进行持续编译(例如随时监测代码变化,一有改变就编译(wait for changes))。然而,这个并没有广泛测过。一系列陷阱记录下来:
it only scans the paths src/main and src/test (see docs), so it will only work from within certain submodules that have that structure.
you’ll typically need to run mvn install from the project root for compilation within specific submodules to work; this is because submodules that depend on other submodules do so via the spark-parent module).
Thus, the full flow for running continuous-compilation of the core submodule may look more like:
· 它只扫描 src/main 和 src/test 路径(可查看 docs),所以它只会在具体某些具有那个结构的子模块下工作
· 你将需要运行 mvn install 从项目根目录下编译到在具体子模块中来工作。这是因为子模块通过 spark-parent 模块依赖其他子模块
所以,完整的运行 core 子模块连续-编译的代码段 可能更像下面这段:
$ mvn install
$ cd core
$ mvn scala:cc
Building Spark with IntelliJ IDEA or Eclipse
For help in setting up IntelliJ IDEA or Eclipse for Spark development, and troubleshooting, refer to the wiki page for IDE setup.
Spark 开发环境中,关于搭建 IntelliJ IDEA 或 Eclipse 的有关帮助,和故障排除,请参考 wiki page for IDE setup.
Running Java 8 Test Suites
Running only Java 8 tests and nothing else.
除了只运行 Java8 测试工具集外,并没有运行其他工具集:
mvn install -DskipTests -Pjava8-tests
Java 8 tests are run when -Pjava8-tests profile is enabled, they will run in spite of -DskipTests. For these tests to run your system must have a JDK 8 installation. If you have JDK 8 installed but it is not the system default, you can set JAVA_HOME to point to JDK 8 before running the tests.
仅当 -Pjava8-tests 配置参数开启生效时,Java 8 测试就可以运行,尽管 -DskipTests 配置项开启时也会运行。为了在你系统中进行这些测试,就必须安装 JDK8。如果你已经安装了 JDK8 但是它并不是系统默认的 JDK,那么你在运行这些测试之前,可以先设置 JAVA_HOME 来指向 JDK 8。
Building for PySpark on YARN
PySpark on YARN is only supported if the jar is built with Maven. Further, there is a known problem with building this assembly jar on Red Hat based operating systems (see SPARK-1753). If you wish to run PySpark on a YARN cluster with Red Hat installed, we recommend that you build the jar elsewhere, then ship it over to the cluster. We are investigating the exact cause for this.
如果使用 Mavern 编译 jar,则只支持 PySpark on YARN。另外,基于 Red Hat 内核的操作系统上,使用这个集成包编译会有一个问题(参见 SPARK-1753)。如果你需要在 Red Hat 机子上的 YARN 集群上运行 PySpark,我们建议你在别处编译 jar 包,然后封装到集群。我们正在调查具体的原因。
Packaging without Hadoop Dependencies for YARN
The assembly jar produced by mvn package will, by default, include all of Spark’s dependencies, including Hadoop and some of its ecosystem projects. On YARN deployments, this causes multiple versions of these to appear on executor classpaths: the version packaged in the Spark assembly and the version on each node, included with yarn.application.classpath. The hadoop-provided profile builds the assembly without including Hadoop-ecosystem projects, like ZooKeeper and Hadoop itself.
通过 mvn package 命令编译生成的 jar 包,默认会包含所有 Spark 的依赖库,包括 Hadoop 和一些它的生态体系的工程。在 YARN 部署上,这会在 executor classpath 出现多个不同版本的 jar 包:即每个节点包括 yarn.application.classpath 参数。使用 hadoop-provided 配置参数编译可以不集成 Hadoop 生态体系的工程,比如 ZooKeeper 和 Hadoop 它自身。
Building with SBT
Maven is the official build tool recommended for packaging Spark, and is the build of reference. But SBT is supported for day-to-day development since it can provide much faster iterative compilation. More advanced developers may wish to use SBT.
Maven 是 Spark 编译官方推荐的编译工具,并且也是编译参考。但是 SBT 都在不断更新发展,这是因为它能提供更快的迭代编译。更多高级的开发者可能希望使用 SBT。
The SBT build is derived from the Maven POM files, and so the same Maven profiles and variables can be set to control the SBT build. For example:
SBT 编译是源自 Maven POM 文件,使用相同的 Maven 配置和变量同样可以控制 SBT 编译,例如:
build/sbt -Pyarn -Phadoop-2.3 assembly
Testing with SBT
Some of the tests require Spark to be packaged first, so always run build/sbt assembly the first time. The following is an example of a correct (build, test) sequence:
某些测试需要先安装 Spark,所以都先运行 build/sbt 编译。以下是一个正确(编译,测试)序列的例子:
build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver assembly
build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver test
To run only a specific test suite as follows:
如下,仅运行一个特定的测试工具集:
build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver "test-only org.apache.spark.repl.ReplSuite"
To run test suites of a specific sub project as follows:
如下,运行一个指定的子项目测试套件:
build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver core/test
Speeding up Compilation with Zinc
Zinc is a long-running server version of SBT’s incremental compiler. When run locally as a background process, it speeds up builds of Scala-based projects like Spark. Developers who regularly recompile Spark with Maven will be the most interested in Zinc. The project site gives instructions for building and running zinc; OS X users can install it using brew install zinc.
Zinc 是 SBT 的增量编译的长期运行服务器版本。当作为后台本地运行,它可以使得基于 Scala 项目,比如 Spark的编译速度加速。通常使用 Maven 编译 Spark 的开发者。这个工程网页给出了编译和运行zinc 的介绍,OS 操作系统使用者可以使用 brew 来安装 zinc。
If using the build/mvn package zinc will automatically be downloaded and leveraged for all builds. This process will auto-start after the first time build/mvn is called and bind to port 3030 unless the ZINC_PORT environment variable is set. The zinc process can subsequently be shut down at any time by running build/zinc-<version>/bin/zinc -shutdown and will automatically restart whenever build/mvn is called.
如果使用 build/mvn 打包 zinc 将会自动下载所有版本。这个过程将会自动在第一次调用 build/mvn 和绑定到 3030 端口时自动开启,除非 ZINC_PORT 环境变量已经设置。Zinc 过程可以通过运行 build/zinc -<version>/bin/zinc 在后来随时关闭,也可以无论何时调用 build/mvn 时,zinc进程将自动重启。
我的编译步骤(spark 1.5.0源码编译)
我选择使用make-distribution.sh编译spark(修改make-distribution.sh脚本,注释掉下框中的信息并且手工修改版本信息):
#VERSION=$("$MVN" help:evaluate -Dexpression=project.version 2>/dev/null | grep -v "INFO" | tail -n 1) #SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null # | grep -v "INFO" # | tail -n 1) #SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null # | grep -v "INFO" # | fgrep --count "<id>hive</id>"; # Reset exit status to 0, otherwise the script stops here if the last grep finds nothing # because we use "set -o pipefail" # echo -n) |
VERSION=1.3.0 SCALA_VERSION=2.10 SPARK_HADOOP_VERSION=2.5.0-cdh5.3.6 SPARK_HIVE=1 |
./make-distribution.sh [--name] [--tgz] [--mvn <mvn-command>] [--with-tachyon] <maven build options> ./make-distribution.sh --tgz -Phadoop-2.4 -Dhadoop-version=2.5.0-cdh5.3.6 -Pyarn -Phive-0.13.1 -Phive-thriftserver |