Spark笔记03

Spark笔记03
- 进一步介绍了Transformation操作
- 进一步介绍了Action操作
- 知识点解析：Function
- 知识点解析：Suffle
Transformation

map

map(func) converts each element of the source RDD into a single element of the result RDD by applying a function.

也就是说，rdd里的一个元素，经过转换，生成另一个元素。这里的转换，可以是一个String对应到一个Int，如下：
```
val textFile = sc.textFile("data.txt")
val lineLengths = textFile.map(s => s.length)
```
也可以是一个Int，对应到一个(Int, Int)，如下：
```
val a = sc.parallelize(1 to 9, 3)
def mapDoubleFunc(a : Int) : (Int,Int) = {
    (a,a*2)
}
val mapResult = a.map(mapDoubleFunc)
```
最终，输出是N维的，和输入相同。

flatMap

flatMap(func) works on a single element (as map) and produces multiple elements of the result (as mapPartitions).

flatMap transforms an RDD of length N into another RDD of length M.

举例如下，输入是一串String，输出可能是任意个String。维度可能发生变化。
```
// count each word's apperance in a file
val textFile = sc.textFile("data.txt")
val wordCounts = textFile.flatMap(line => line.split(" ")).groupByKey(identity).count()
wordCounts.collect()
```
https://data-flair.training/blogs/apache-spark-map-vs-flatmap/

mapPartitions

mapPartitions(func) converts each partition of the source RDD into multiple elements of the result (possibly none).
```
val a = sc.parallelize(1 to 9, 3)
  def doubleFunc(iter: Iterator[Int]) : Iterator[(Int,Int)] = {
    var res = List[(Int,Int)]()
    while (iter.hasNext)
    {
      val cur = iter.next;
      res .::= (cur,cur*2)
    }
    res.iterator
  }
val result = a.mapPartitions(doubleFunc)
result.collect()
```
mapPartitions v.s. map
- 操作对象不同：前者针对partition操作，后者针对每一个element操作，一般来说前者的loop更少，更加快。
- 传参不同：前者需要一个func，且type为Iterator<T> => Iterator<U>，后者只需要一个func。
- 结果维度不同：前者的维度可能是任意值，后者的维度一定是N，和传入参数维度相同。
https://blog.csdn.net/lsshlsw/article/details/48627737

aggretation

rdd1.union(rdd2):combine two rdds

rdd1.intersection(rdd2): pick the intersection only, like inner join

rdd1.subtract(rdd2): pick the one left has only, like left join

rdd1.cartesian(rdd2): double for loop

rdd1.zip(rdd2): two rdds must have same structure, match one with another

Action

Understanding Function / Lambda

Spark编程中，使用了大量的函数式编程，一般可以分为两种：
1. Anonymous function syntax (Lambda)
```
val lineLengths = lines.map(s => s.length)
```
这样写，有两个好处：
- 一是简洁。
- 另外，保证都是局部变量而不是全局变量，可以减小开销，且并行计算不容易出错。
1. Static methods in a global singleton object
```
object MyFunctions {
  def func1(s: String): String = { ... }
}

myRdd.map(MyFunctions.func1)
```
Understanding Suffle

Spark的运算是先transformation/map，然后action/reduce。对于action，一般来说是把每个node的结果拿来做aggregation，比如求和。但是有的操作，需要遍历所有node的所有key，比如排序。这时，就需要用到suffle。

It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the shuffle.

会trigger suffle的操作有以下几类：
- repartition: repartition and coalesce
- ByKey operations (except for counting): groupByKey and reduceByKey
- join operations: cogroup and join
需要注意的是，shuffle是非常expensive的操作，因为它involves disk I/O, data serialization, and network I/O。

Reference
相关阅读:
Python编程知识
 Ubuntu 20.04.3 LTS + Intel Realsense 400系列
 Kubectl
在Debian Buster中安装redis-cli
MySQL中最近执行的查询的信息
 EntityFramework 在脚本中使用in
CLDR TimeZone Mapper
Skyspark Axon
Async Restsharp call
HTTP 1.1 中的Accept-Language header
原文地址：https://www.cnblogs.com/maxstack/p/13405622.html

Transformation

map

flatMap

mapPartitions

aggretation

Action

Understanding Function / Lambda

Understanding Suffle

Reference