一、环境搭建
1. 编译spark 1.3.0
1)安装apache-maven-3.0.5
2)下载并解压 spark-1.3.0.tgz
3)修改make-distribution.sh
VERSION=1.3.0 SCALA_VERSION=2.10 SPARK_HADOOP_VERSION=2.5.0-cdh5.3.6 SPARK_HIVE=1 #VERSION=$("$MVN" help:evaluate -Dexpression=project.version 2>/dev/null | grep -v "INFO" | tail -n 1) #SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null # | grep -v "INFO" # | tail -n 1) #SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null # | grep -v "INFO" # | fgrep --count "<id>hive</id>"; # # Reset exit status to 0, otherwise the script stops here if the last grep finds nothing # # because we use "set -o pipefail" # echo -n)
4)替换maven仓库jar包
5)打包编译
(1)MAVEN编译
build/mvn clean package -DskipTests -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.6 -Pyarn -Phive-0.13.1 -Phive-thriftserver
(2)使用CDH5.3.6版本的hadoop
./make-distribution.sh --tgz -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.6 -Pyarn -Phive-0.13.1 -Phive-thriftserver
(3)使用Apache版本的hadoop
./make-distribution.sh --tgz -Phadoop-2.4 -Dhadoop.version=2.5.0 -Pyarn -Phive-0.13.1 -Phive-thriftserver
二、测试程序
1. 准备
bin/spark-shell
val textFile = sc.textFile("README.md")
textFile.count()
textFile.count 方法没有参数时,括号可以省略
textFile.first
textFile.take(10)
可以将函数A作为参数传递给函数B,此时这个函数B叫做高阶函数
textFile.filter((line: String) =>line.contains("Spark"))
textFile.filter(line =>line.contains("Spark"))
textFile.filter(_.contains("Spark"))
scala中_标示任意元素
匿名函数
(line: String) =>line.contains("Spark")
def func01(line : String){
line.contains("Spark")
}
def func01(line: String) => line.contains("Spark")
sc.parallelize(Array(1,2,3,4,5))
三、Scala集合操作
Method on Seq[T]
map(f: T => U): Seq[U] flatMap(f: T=> Req[U]): Seq[U] filter(f: T => Boolean): Seq[T] exists(f: T => Boolean): Boolean forall(f: T => Boolean): Boolean reduce(f: (T, T) => T): T groupBy(f: T => K): Map[K, List[T]] sortBy(f: T => K): Seq[T]
(line: String) =>line.contains("Spark")
T: (line: String)
Boolean: line.contains("Spark")
三、 wordcount
val linesRdd = sc.textFile("hdfs://beifeng-hadoop-02:9000/user/beifeng/mapreduce/input01/wc_input")
val wordsRdd = linesRdd.map(line => line.split(" "))
val wordsRdd = linesRdd.flatMap(line => line.split(" "))
val keyvalRdd = wordsRdd.map(word => (word, 1))
val countRdd = keyvalRdd.reduceByKey((a, b) =>(a + b))
countRdd.collect()
countRdd.cache
变成一行
sc.textFile("hdfs://beifeng-hadoop-02:9000/user/beifeng/mapreduce/input01/wc_input").flatMap(line => line.split(" ")).map( (_, 1)).reduceByKey(_ + _).collect