在本博客的《Spark高速入门指南(Quick Start Spark)》文章中简单地介绍了怎样通过Spark shell来高速地运用API。本文将介绍怎样高速地利用Spark提供的API开发Standalone模式的应用程序。Spark支持三种程序语言的开发:Scala (利用SBT进行编译), Java (利用Maven进行编译)以及Python。以下我将分别用Scala、Java和Python开发相同功能的程序:
一、Scala版本号:
程序例如以下:
08 | * 过往记忆博客,专注于hadoop、hive、spark、shark、flume的技术博客,大量的干货 |
09 | * 过往记忆博客微信公共帐号:iteblog_hadoop |
11 | import org.apache.spark.SparkContext |
12 | import org.apache.spark.SparkConf |
14 | def main(args: Array[String]) { |
16 | val conf = new SparkConf().setAppName( "Spark Application in Scala" ) |
17 | val sc = new SparkContext(conf) |
18 | val logData = sc.textFile(logFile, 2 ).cache() |
19 | val numAs = logData.filter(line => line.contains( "a" )).count() |
20 | val numBs = logData.filter(line => line.contains( "b" )).count() |
21 | println( "Lines with a: %s, Lines with b: %s" .format(numAs, numBs)) |
为了编译这个文件,须要创建一个xxx.sbt文件,这个文件相似于pom.xml文件,这里我们创建一个scala.sbt文件,内容例如以下:
1 | name := "Spark application in Scala" |
3 | scalaVersion := "2.10.4" |
4 | libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0" |
编译:
3 | [success] Total time: 270 s, completed Jun 11 , 2014 1 : 05 : 54 AM |
二、Java版本号
07 | * 过往记忆博客,专注于hadoop、hive、spark、shark、flume的技术博客,大量的干货 |
08 | * 过往记忆博客微信公共帐号:iteblog_hadoop |
11 | import org.apache.spark.api.java.*; |
12 | import org.apache.spark.SparkConf; |
13 | import org.apache.spark.api.java.function.Function; |
15 | public class SimpleApp { |
16 | public static void main(String[] args) { |
18 | SparkConf conf = new SparkConf().setAppName( "Spark Application in Java" ); |
19 | JavaSparkContext sc = new JavaSparkContext(conf); |
20 | JavaRDD<String> logData = sc.textFile(logFile).cache(); |
22 | long numAs = logData.filter( new Function<String, Boolean>() { |
23 | public Boolean call(String s) { return s.contains( "a" ); } |
26 | long numBs = logData.filter( new Function<String, Boolean>() { |
27 | public Boolean call(String s) { return s.contains( "b" ); } |
30 | System.out.println( "Lines with a: " + numAs + ",lines with b: " + numBs); |
本程序分别统计README.md文件里包括a和b的行数。本项目的pom.xml文件内容例如以下:
01 | <?xml version= "1.0" encoding= "UTF-8" ?> |
04 | xsi:schemaLocation="http: |
08 | <modelVersion> 4.0 . 0 </modelVersion> |
10 | <groupId>spark</groupId> |
11 | <artifactId>spark</artifactId> |
12 | <version> 1.0 </version> |
16 | <groupId>org.apache.spark</groupId> |
17 | <artifactId>spark-core_2. 10 </artifactId> |
18 | <version> 1.0 . 0 </version> |
利用Maven来编译这个工程:
2 | [INFO] ------------------------------------------------------------------------ |
4 | [INFO] ------------------------------------------------------------------------ |
5 | [INFO] Total time: 5 .815s |
6 | [INFO] Finished at: Wed Jun 11 00 : 01 : 57 CST 2014 |
7 | [INFO] Final Memory: 13M/32M |
8 | [INFO] ------------------------------------------------------------------------ |
三、Python版本号
07 | # 过往记忆博客,专注于hadoop、hive、spark、shark、flume的技术博客,大量的干货 |
08 | # 过往记忆博客微信公共帐号:iteblog_hadoop |
10 | from pyspark import SparkContext |
13 | sc = SparkContext( "local" , "Spark Application in Python" ) |
14 | logData = sc.textFile(logFile).cache() |
16 | numAs = logData.filter(lambda s: 'a' in s).count() |
17 | numBs = logData.filter(lambda s: 'b' in s).count() |
19 | print "Lines with a: %i, lines with b: %i" % (numAs, numBs) |
四、測试执行
本程序的程序环境是Spark 1.0.0,单机模式,測试例如以下:
1、測试Scala版本号的程序
1 | # bin/spark-submit -- class "scala.Test"
|
3 | target/scala- 2.10 /simple-project_2. 10 - 1.0 .jar |
5 | 14 / 06 / 11 01 : 07 : 53 INFO spark.SparkContext: Job finished: |
6 | count at Test.scala: 18 , took 0.019705 s |
7 | Lines with a: 62 , Lines with b: 35 |
2、測试Java版本号的程序
1 | # bin/spark-submit -- class "SimpleApp"
|
3 | target/spark- 1.0 -SNAPSHOT.jar |
5 | 14 / 06 / 11 00 : 49 : 14 INFO spark.SparkContext: Job finished: |
6 | count at SimpleApp.java: 22 , took 0.019374 s |
7 | Lines with a: 62 , lines with b: 35 |
3、測试Python版本号的程序
1 | # bin/spark-submit --master local[ 4 ] |
4 | Lines with a: 62 , lines with b: 35 |
本文地址:《Spark Standalone模式应用程序开发》:http://www.iteblog.com/archives/1041,过往记忆,大量关于Hadoop、Spark等个人原创技术博客本博客文章除特别声明,所有都是原创!
尊重原创,转载请注明: 转载自过往记忆(http://www.iteblog.com/)
本文链接地址: 《Spark Standalone模式应用程序开发》(http://www.iteblog.com/archives/1041)
E-mail:wyphao.2007@163.com