《OD大数据实战》Spark入门实例

《OD大数据实战》Spark入门实例
一、环境搭建

1. 编译spark 1.3.0

1）安装apache-maven-3.0.5

2）下载并解压 spark-1.3.0.tgz

3）修改make-distribution.sh
```
VERSION=1.3.0
SCALA_VERSION=2.10
SPARK_HADOOP_VERSION=2.5.0-cdh5.3.6
SPARK_HIVE=1
#VERSION=$("$MVN" help:evaluate -Dexpression=project.version 2>/dev/null | grep -v "INFO" | tail -n 1)
#SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null
#    | grep -v "INFO"
#    | tail -n 1)
#SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null
#    | grep -v "INFO"
#    | fgrep --count "<id>hive</id>";
#    # Reset exit status to 0, otherwise the script stops here if the last grep finds nothing
#    # because we use "set -o pipefail"
#    echo -n)
```
4）替换maven仓库jar包

5）打包编译

（1）MAVEN编译
```
build/mvn clean package -DskipTests -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.6 -Pyarn -Phive-0.13.1 -Phive-thriftserver 
```
（2）使用CDH5.3.6版本的hadoop
```
./make-distribution.sh --tgz -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.6 -Pyarn -Phive-0.13.1 -Phive-thriftserver 
```
（3）使用Apache版本的hadoop
```
./make-distribution.sh --tgz -Phadoop-2.4 -Dhadoop.version=2.5.0 -Pyarn -Phive-0.13.1 -Phive-thriftserver 
```
二、测试程序

1. 准备

bin/spark-shell

val textFile = sc.textFile("README.md")

textFile.count()

textFile.count 方法没有参数时，括号可以省略

textFile.first

textFile.take(10)

可以将函数A作为参数传递给函数B，此时这个函数B叫做高阶函数

textFile.filter((line: String) =>line.contains("Spark"))

textFile.filter(line =>line.contains("Spark"))

textFile.filter(_.contains("Spark"))

scala中_标示任意元素

匿名函数

(line: String) =>line.contains("Spark")

def func01(line : String){

　　line.contains("Spark")

}

def func01(line: String) => line.contains("Spark")

sc.parallelize(Array(1,2,3,4,5))

三、Scala集合操作

Method on Seq[T]
```
map(f: T => U): Seq[U]

flatMap(f: T=> Req[U]): Seq[U]

filter(f: T => Boolean): Seq[T]

exists(f: T => Boolean): Boolean

forall(f: T => Boolean): Boolean

reduce(f: (T, T) => T): T

groupBy(f: T => K): Map[K, List[T]]

sortBy(f: T => K): Seq[T]
```
　　(line: String) =>line.contains("Spark")

　　T: (line: String)

　　Boolean: line.contains("Spark")　　

三、 wordcount

val linesRdd = sc.textFile("hdfs://beifeng-hadoop-02:9000/user/beifeng/mapreduce/input01/wc_input")

val wordsRdd = linesRdd.map(line => line.split(" "))

val wordsRdd = linesRdd.flatMap(line => line.split(" "))

val keyvalRdd = wordsRdd.map(word => (word, 1))

val countRdd = keyvalRdd.reduceByKey((a, b) =>(a + b))

countRdd.collect()

countRdd.cache

变成一行

sc.textFile("hdfs://beifeng-hadoop-02:9000/user/beifeng/mapreduce/input01/wc_input").flatMap(line => line.split(" ")).map( (_, 1)).reduceByKey(_ + _).collect
相关阅读:
java代码---------实现布尔型的功能，是否执行下一步的关键
 java代码--------实现随机输出100个随机数，10行，0--到9的数字
 java代码--------打印三角形
 java代码------------条件运算符？：
java代码-----------逻辑运算符、 &&逻辑与 ||或
 java代码------计算器
 C/C++ ===复习==函数返回值问题（集合体==网络）
Centos6.5添加163软件yum源
 python2.7.5 +eric4.4.2+PyQt4-4.10.3
python yield初探（转）
原文地址：https://www.cnblogs.com/yeahwell/p/5822223.html