Spark算子总结
算子分类
-
Transformation(转换)
转换算子 含义 map(func) 返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成 filter(func) 过滤, 返回一个新的RDD, 该RDD由经过func函数计算后返回值为true的输入元素组成 flatMap(func) 类似于map,但是每一个输入元素可以被映射为0或多个输出元素(所以func应该返回一个序列,而不是单一元素) mapPartitions(func) 类似于map,但独立地在RDD的每一个分片上运行,因此在类型为T的RDD上运行时,func的函数类型必须是Iterator[T] => Iterator[U] mapPartitionsWithIndex(func) 类似于mapPartitions,但func带有一个整数参数表示分片的索引值,因此在类型为T的RDD上运行时,func的函数类型必须是(Int, Interator[T]) => Iterator[U] union(otherDataset) 对源RDD和参数RDD求并集后返回一个新的RDD Intersection(otherDataset) 对源RDD和参数RDD求交集后返回一个新的RDD groupByKey([numTasks]) 在一个(K,V)的RDD上调用,返回一个(K, Iterator[V])的RDD reduceByKey(func, [numTasks]) 在一个(K,V)的RDD上调用,返回一个(K,V)的RDD,使用指定的reduce函数,将相同key的值聚合到一起,与groupByKey类似,reduce任务的个数可以通过第二个可选的参数来设置 sortByKey([ascending], [numTasks]) 在一个(K,V)的RDD上调用,K必须实现Ordered接口,返回一个按照key进行排序的(K,V)的RDD sortBy(func,[ascending],[numTasks]) 与sortByKey类似,但是更灵活 join(otherDataset, [numTasks]) 在类型为(K,V)和(K,W)的RDD上调用,返回一个相同key对应的所有元素对在一起的(K,(V,W))的RDD cogroup(otherDataset, [numTasks]) 在类型为(K,V)和(K,W)的RDD上调用,返回一个(K,(Iterable,Iterable))类型的RDD coalesce(numPartitions) 减少 RDD 的分区数到指定值 repartition 对 RDD 重新分区 repartitionAndSortWithinPartitions(partitioner) 重新给 RDD 分区,并且每个分区内以记录的 key 排序 -
Action(行为)
行为 含义 reduce(func) reduce将RDD中元素前两个传给输入函数,产生一个新的return值,新产生的return值与RDD中下一个元素(第三个元素)组成两个元素,再被传给输入函数,直到最后只有一个值为止。 collect() 在驱动程序中,以数组的形式返回数据集的所有元素 count 返回RDD的元素个数 first 返回RDD的第一个元素(类似于take(1)) take(n) 返回一个由数据集的前n个元素组成的数组 takeOrdered(n,[ordering]) 返回自然顺序或者自定义顺序的前 n 个元素 saveAsTextFile(path) 将数据集的元素以textfile的形式保存到HDFS文件系统或者其他支持的文件系统,对于每个元素,Spark将会调用toString方法,将它装换为文件中的文本 saveAsSequenceFile(path) 将数据集中的元素以Hadoop sequencefile的格式保存到指定的目录下,可以使HDFS或者其他Hadoop支持的文件系统。 saveAsObjectFile(path) 将数据集的元素,以 Java 序列化的方式保存到指定的目录下 countByKey() 针对(K,V)类型的RDD,返回一个(K,Int)的map,表示每一个key对应的元素个数。 foreach 在数据集的每一个元素上,运行函数func foreachPartition(func) 在数据集的每一个分区上,运行函数func -
Manipulate(控制)
控制 含义 cache 对于重复使用的算子,进行cache做缓存使用,数据只保存在内存中,性能提升(等价于Memory_Only) persist 根据Storagelevel 将数据 持久化, 从而提升性能 checkPoint 数据容错,当数据计算的时候,机器挂了,重新追溯到checkPoint的目录下。checkPoint是将RDD持久化到磁盘中,还可以切断RDD之间的依赖关系
Transformation
-
map(func)
-
源码
/** * Return a new RDD by applying a function to all elements of this RDD. * 通过对一个RDD的所有元素应用一个函数来返回一个新的RDD */ def map[U: ClassTag](f: T => U): RDD[U] = withScope { // 清理函数闭包以确保其是可序列化的, 并发送至tasks val cleanF = sc.clean(f) new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF)) }
-
案例
package com.ronnie.scala.WordCount import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.rdd.RDD object WordCount { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("WC") val sc = new SparkContext(conf) val lines : RDD[String] = sc.textFile("./resources/words.txt") val word : RDD[String] = lines.flatMap(lines => { lines.split(" ") }) val pairs : RDD[(String, Int)] = word.map(x => (x, 1)) val result = pairs.reduceByKey((a,b) => {a + b}) result.sortBy(_._2, false).foreach(println) // 简化写法 lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).foreach(println) } }
-
-
filter(func)
-
源码
/** * Return a new RDD containing only the elements that satisfy a predicate. * 返回一个仅含有满足断言条件的元素的新RDD */ def filter(f: T => Boolean): RDD[T] = withScope { // 清理函数闭包以确保其是可序列化的, 并发送至tasks val cleanF = sc.clean(f) new MapPartitionsRDD[T, T]( this, // => Iterator.filter (context, pid, iter) => iter.filter(cleanF), preservesPartitioning = true) }
/** Returns an iterator over all the elements of this iterator that satisfy the predicate `p`. 返回当前迭代器中所有满足断言 `p` 的元素的一个迭代器 * The order of the elements is preserved. * * @param p the predicate used to test values. p: 测试值的断言条件 * @return an iterator which produces those values of this iterator which satisfy the predicate `p`. 返回 满足断言 `p` 的元素的一个迭代器 * @note Reuse: $consumesAndProducesIterator 重用消费生产迭代器 */ def filter(p: A => Boolean): Iterator[A] = new AbstractIterator[A] { // TODO 2.12 - Make a full-fledged FilterImpl that will reverse sense of p // 创建一个 完全有效的实现方式来逆转 p 的感知 private var hd: A = _ private var hdDefined: Boolean = false // 反正是个标签, 为啥叫hdDefined我是没辙 def hasNext: Boolean = hdDefined || { do { // 没有下一个元素就返回 false if (!self.hasNext) return false // 迭代下一个元素 hd = self.next() // 对该元素进行断言, 只要p(hd) 不为 false, 就将 hdDefined 标签设置为 true, 并 返回true } while (!p(hd)) hdDefined = true true } // 每次迭代下一个元素前判断一下是否有下一个元素, 有就将hdDefined 标签重置为 false, 返回下一个元素, 没有就抛出 NoSuchElementException 异常 // (empty.next() => Iterator.empty ) def next() = if (hasNext) { hdDefined = false; hd } else empty.next() }
/** The iterator which produces no values. 没有值的迭代器*/ val empty: Iterator[Nothing] = new AbstractIterator[Nothing] { def hasNext: Boolean = false def next(): Nothing = throw new NoSuchElementException("next on empty iterator") }
-
案例
package com.ronnie.scala import java.io.StringReader import au.com.bytecode.opencsv.CSVReader import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object DataExtraction { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("WC") val sc = new SparkContext(conf) val input: RDD[String] = sc.textFile("./resources/gpu.csv") val result: RDD[Array[String]] = input.map { line => val reader = new CSVReader(new StringReader(line)) reader.readNext() } result.foreach(x => { x.flatMap(_.split(" ")).foreach(_.split(" ").filter(y =>y.contains("580") ).foreach(_.split(" ").filter(x => x.equals("RX580")).foreach(println))) }) }
-
-
flatMap(func)
-
源码
/** * Return a new RDD by first applying a function to all elements of this * RDD, and then flattening the results. * 先对该 RDD 中的每一个元素应用一个函数, 并对结果进行扁平化处理, 再返回一个新的 RDD */ def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope { val cleanF = sc.clean(f) // => Iterator.flatMap new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF)) }
/** Creates a new iterator by applying a function to all values produced by this iterator and concatenating the results. 对该迭代器所提供的值 应用一个函数 并合并这些结果 来创建一个迭代器, * * @param f the function to apply on each element. f: 应用函数 * @return the iterator resulting from applying the given iterator-valued function `f` to each value produced by this iterator and concatenating the results. * 返回 对该迭代器所提供的值 应用一个函数 并合并这些结果 所产生的迭代器 * @note Reuse: $consumesAndProducesIterator 重用消费生产迭代器 */ def flatMap[B](f: A => GenTraversableOnce[B]): Iterator[B] = new AbstractIterator[B] { // 当前迭代器 private var cur: Iterator[B] = empty // 将当前迭代器 替换为下一个迭代器 private def nextCur() { cur = f(self.next()).toIterator } def hasNext: Boolean = { // Equivalent to cur.hasNext || self.hasNext && { nextCur(); hasNext } // 等价于 (当前迭代器有下一个值) || (迭代器本身有下一个值 && 下一个迭代器有下一个值) // but slightly shorter bytecode (better JVM inlining!) // 但是下面这种写法有 编译成的字节码会更短, 利于JVM内联 while (!cur.hasNext) { if (!self.hasNext) return false nextCur() } true } // 如果有下一个就返回当前迭代器, 否则就返回空迭代器, 并继续迭代 def next(): B = (if (hasNext) cur else empty).next() }
-
案例
- 见 map 案例
-
-
mapPartitions(func)
-
源码
/** * Return a new RDD by applying a function to each partition of this RDD. * 通过对当前 RDD 的每个分区应用一个函数来创建一个新的 RDD * `preservesPartitioning` indicates whether the input function preserves the partitioner, which should be `false` unless this is a pair RDD and the input function doesn't modify the keys. * `preservesPartitioning(保存分区)` 标签 表明了 输入函数是否保存分区器, 该标签默认为 false, 除非这是个 成对的RDD 并且 传入函数并没有修改 key */ def mapPartitions[U: ClassTag]( f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = withScope { val cleanedF = sc.clean(f) new MapPartitionsRDD( this, (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter), preservesPartitioning) }
/** * [performance] Spark's internal mapPartitions method that skips closure cleaning. * Spark 内部的 mapPartitions 方法, 跳过了 清理闭包的步骤 */ private[spark] def mapPartitionsInternal[U: ClassTag]( f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = withScope { new MapPartitionsRDD( this, (context: TaskContext, index: Int, iter: Iterator[T]) => f(iter), preservesPartitioning) }
-
案例
package com.ronnie.scala.core.transform_operator import org.apache.spark.{SparkConf, SparkContext} import scala.collection.mutable.ListBuffer object Operator_mapPartitions { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local").setAppName("mapPartitions") val sc = new SparkContext(conf) val rdd = sc.parallelize(List(1, 2, 3, 4, 5, 6), 2) val rdd01 = rdd.mapPartitions(iter =>{ val list = new ListBuffer[(Int, Int)]() while (iter.hasNext){ val next = iter.next() list += Tuple2(next, next * 2) } list.iterator }, false) rdd01.foreach(x => print(x + " ")) } }
-
-
mapPartitionsWithIndex(func)
-
源码
/** * Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. * 通过对每个 RDD 的分区应用一个函数来创建一个新的 RDD, 在此过程中追踪 初始 分区的 参数(下标) * `preservesPartitioning` indicates whether the input function preserves the partitioner, which should be `false` unless this is a pair RDD and the input function doesn't modify the keys. * `preservesPartitioning(保存分区)` 标签 表明了 输入函数是否保存分区器, 该标签默认为 false, 除非这是个 成对的RDD 并且 传入函数并没有修改 key */ def mapPartitionsWithIndex[U: ClassTag]( f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = withScope { val cleanedF = sc.clean(f) new MapPartitionsRDD( this, (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter), preservesPartitioning) }
/** * [performance] Spark's internal mapPartitionsWithIndex method that skips closure cleaning. 跳过了闭包清理的 Spark 的内部 mapPartitionsWithIndex 方法 * It is a performance API to be used carefully only if we are sure that the RDD elements are serializable and don't require closure cleaning. 这是一个性能 应当被小心使用的 API, 只有我们确认 该 RDD 中的元素是可序列化的 并且不需要 闭包清理 时才使用。 * * @param preservesPartitioning indicates whether the input function preserves the partitioner, which should be `false` unless this is a pair RDD and the input function doesn't modify the keys. *`preservesPartitioning(保存分区)` 标签 表明了 输入函数是否保存分区器, 该标签默认为 false, 除非这是个 成对的RDD 并且 传入函数并没有修改 key */ private[spark] def mapPartitionsWithIndexInternal[U: ClassTag]( f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = withScope { new MapPartitionsRDD( this, (context: TaskContext, index: Int, iter: Iterator[T]) => f(index, iter), preservesPartitioning) }
-
案例
package com.ronnie.scala.core.transform_operator import org.apache.spark.{SparkConf, SparkContext} import scala.collection.mutable.ListBuffer object Operator_mapPartitionsWithIndex { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("mapPartitionsWithIndex") val sc = new SparkContext(conf) val rdd = sc.makeRDD(List("a", "b", "c"), 2) rdd.mapPartitionsWithIndex((index, iter) => { val list = new ListBuffer[String]() while (iter.hasNext){ val v = iter.next() list.+(v) println("index = " + index + ", value = " + v) } list.iterator },false).foreach(println) sc.stop() } }
-
-
union(otherDataset)
-
源码
/** * Return the union of this RDD and another one. Any identical elements will appear multiple times (use `.distinct()` to eliminate them). * 返回当前 RDD 和 另一个 RDD 的并集, 所有标识的元素都会出现多次(可以使用 distinct() 算子去重) */ def union(other: RDD[T]): RDD[T] = withScope { sc.union(this, other) // => SparkContext.union }
/** * union 的简略形式: ++ * Return the union of this RDD and another one. Any identical elements will appear multiple times (use `.distinct()` to eliminate them). */ def ++(other: RDD[T]): RDD[T] = withScope { this.union(other) }
/** Build the union of a list of RDDs passed as variable-length arguments. */ def union[T: ClassTag](first: RDD[T], rest: RDD[T]*): RDD[T] = withScope { union(Seq(first) ++ rest) // ++ => TraversableLike(可遍历的Trait).++ }
def ++[B >: A, That](that: GenTraversableOnce[B])(implicit bf: CanBuildFrom[Repr, B, That]): That = { val b = bf(repr) // 判断类型是否一致 if (that.isInstanceOf[IndexedSeqLike[_, _]]) b.sizeHint(this, that.seq.size) b ++= thisCollection b ++= that.seq b.result }
-
案例
package com.ronnie.scala.core.transform_operator import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object Operator_union { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local[2]").setAppName("union") val sc = new SparkContext(conf) val nvidiaGpu: RDD[(String, Int)] = sc.parallelize(List(("rtx2080Ti", 1), ("rtx2080S", 2), ("rtx2070S", 3), ("rtx2060S", 4))) val amdGpu: RDD[(String, Int)] = sc.parallelize(List(("radeon7", 1),("5800XT", 2),("5700XT", 3),("RX590",4))) nvidiaGpu.union(amdGpu).foreach(println) } }
-
-
intersection(otherDataset)
-
源码
/** * Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did. * 返回 该 RDD 和 另一个 RDD的 交集。即便输入的 RDD 有重复的元素, 输出的RDD不会包含任何重复的元素 * * @note This method performs a shuffle internally.该方法内部执行了shuffle * * @param partitioner Partitioner to use for the resulting RDD 应用于结果 RDD 的 分区器 */ def intersection( other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope { // cogroup 根据相同key分组 this.map(v => (v, null)).cogroup(other.map(v => (v, null)), partitioner) // 过滤, 只留下key 和 value 都不为空的 .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty } .keys }
-
案例
package com.ronnie.scala.core.transform_operator import org.apache.spark.{SparkConf, SparkContext} import scala.collection.mutable.ListBuffer object Operator_mapPartitionsWithIndex { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("mapPartitionsWithIndex") val sc = new SparkContext(conf) val rdd = sc.makeRDD(Array("a", "b", "c"), 2) rdd.mapPartitionsWithIndex((index, iter) => { val list = new ListBuffer[String]() while (iter.hasNext){ val v = iter.next() list.+(v) println("index = " + index + ", value = " + v) } list.iterator },false).foreach(println) sc.stop() } }
-
-
groupByKey([numTasks]) 【能少用就少用, 性能不好】
-
源码(位于 PairRDDFunctions 中)
/** * Group the values for each key in the RDD into a single sequence. Allows controlling the partitioning of the resulting key-value pair RDD by passing a Partitioner. * 在 RDD 中对每个 key 的 values 进行分组 并放入一个简单的序列。 * 通过 传递一个 分区器 来 控制 结果的键值对 RDD 的分区 * The ordering of elements within each group is not guaranteed, and may even differ each time the resulting RDD is evaluated. * 每个组中的 元素顺序是无法保证的,甚至 每次 结果的RDD 被评估是 都不一定相同 * * @note This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`or `PairRDDFunctions.reduceByKey` will provide much better performance. *该操作可能会非常昂贵。如果你是为了对每个key进行聚合操作而使用grouping, 使用 PairRDDFunctions 类中的 aggregateByKey(根据key聚合) 或者 reduceByKey(根据key规约) 会提供更好的性能。 * * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any key in memory. * 就目前的应用而言, groupByKey 必须能容纳 所有在内存中的 key 的 键值对 * If a key has too many values, it can result in an `OutOfMemoryError`. * 如果一个 key 有太多value, 它可能会导致 OOM(内存用尽) */ def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope { // groupByKey shouldn't use map side combine because map side combine does not reduce the amount of data shuffled and requires all map side data be inserted into a hash table, leading to more objects in the old gen. // groupByKey 不应该使用 map式 的合并, 因为 map式的合并 不能降低 数据 shuffle 的数量 且 要求所有map式的数据都被插入到一个 哈希表中, 会导致 jvm 内存中 的老年区 会有更多的对象(可能会触发 full gc)。 // 创建合并器 val createCombiner = (v: V) => CompactBuffer(v) // 合并值 val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v // 合并合并器 val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2 // 缓存 => combineByKeyWithClassTag val bufs = combineByKeyWithClassTag[CompactBuffer[V]]( createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false) // 缓存必须是 k-v形式的 RDD, 且 v 需为可迭代的集合 bufs.asInstanceOf[RDD[(K, Iterable[V])]] }
/** * :: Experimental :: 实验性的 * Generic function to combine the elements for each key using a custom set of aggregation functions. * 对每个 key 应用 一个定制的 聚合函数 集合 来合并元素的 通用函数 * Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C * RDD[(K, V)] => RDD[(K, C)], C为合并后的类型 * * Users provide three functions: 使用者提供 以下三种函数 * * - `createCombiner`, which turns a V into a C (e.g., creates a one-element list) * createCombiner(创建合并器), 将V类转化为C类(比如创建一个 单元素的列表) * - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list) * mergeValue(合并值), 将 V 类合并入C(比如 将 元素添加到 列表尾) * - `mergeCombiners`, to combine two C's into a single one. * mergeCombiners(合并合并器), 将两个 C类 合并为一个 * * In addition, users can control the partitioning of the output RDD, and whether to perform map-side aggregation (if a mapper can produce multiple items with the same key). * 此外, 用户可以控制输出 RDD 的分区 以及 是否 执行 map式的聚合 (对于同一个 key ) * * @note V and C can be different -- for example, one might group an RDD of type * (Int, Int) into an RDD of type (Int, Seq[Int]). */ @Experimental def combineByKeyWithClassTag[C]( createCombiner: V => C,//创建combiner mergeValue: (C, V) => C,//map端聚合 mergeCombiners: (C, C) => C,//reduce端聚合 partitioner: Partitioner, // 分区器 mapSideCombine: Boolean = true, // 是否使用map式的合并 serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope { require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0 必须是 spark 0.9.0 及其以上的版本 if (keyClass.isArray) { if (mapSideCombine) { throw new SparkException("Cannot use map-side combining with array keys.") } // 如果分区器是哈希分区器 if (partitioner.isInstanceOf[HashPartitioner]) { throw new SparkException("HashPartitioner cannot partition array keys.") } } val aggregator = new Aggregator[K, V, C]( self.context.clean(createCombiner), self.context.clean(mergeValue), self.context.clean(mergeCombiners)) /** * 判断分区器是否相同,如果两个连着的shuffle类算子分区器都是相同的,那么不会产生shuffle *不相同会产生shuffleRDD */ if (self.partitioner == Some(partitioner)) { self.mapPartitions(iter => { val context = TaskContext.get() new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context)) }, preservesPartitioning = true) } else { /** * 下面设置的aggregator 就是以上三个combiner的函数 * 在ShuffledRDD中有个方法是 getDependencies 获取宽窄依赖的关系 */ new ShuffledRDD[K, V, C](self, partitioner) .setSerializer(serializer) .setAggregator(aggregator) .setMapSideCombine(mapSideCombine) } }
-
案例
package com.ronnie.scala.core.transform_operator import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object Operator_groupByKey { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("groupByKey") val sc= new SparkContext(conf) val rdd: RDD[(String, Int, Int, Int)] = sc.parallelize(Array(("amd", 12, 21, 37),("nvidia", 33, 22, 31),("intel", 27, 32, 42),("amd", 29, 34, 25),("nvidia", 47, 37, 36), ("intel", 23, 43, 38))) val result: RDD[(Int, Int, Int)] = rdd.map(e => (e._1, (e._2, e._3, e._4))).groupByKey().flatMap(line =>{ line._2.toArray.sortBy(_._3)(Ordering[Int].reverse).take(2) }) result.foreach(println) } }
-
-
reduceByKey(func, [numTasks])
-
源码
/** * Merge the values for each key using an associative and commutative reduce function. * 通过 关联 和 交换的 规约 函数 来合并每个 key的值 * * This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. * 这同样会 在发送结果到 reducer 之前 向 在每个 mapper 上 执行本地合并, 就像 MapReduce 中的 combiner */ def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope { combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner) }
/** * Merge the values for each key using an associative and commutative reduce function. This will * also perform the merging locally on each mapper before sending results to a reducer, similarly * to a "combiner" in MapReduce. * * Output will be hash-partitioned with numPartitions partitions. * 输出会根据分区数量 进行哈希分区 */ def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope { reduceByKey(new HashPartitioner(numPartitions), func) }
/** * Merge the values for each key using an associative and commutative reduce function. This will * also perform the merging locally on each mapper before sending results to a reducer, similarly * to a "combiner" in MapReduce. * Output will be hash-partitioned with the existing partitioner/parallelism level. * 输出会根据存在的分区数量/并行级别 进行哈希分区 */ def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope { reduceByKey(defaultPartitioner(self), func) }
-
案例
package com.ronnie.scala.core.transform_operator import org.apache.spark.{SparkConf, SparkContext} object Operator_reduceByKey { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("reduceByKey") val sc = new SparkContext(conf) val lines = sc.textFile("data/word.txt") lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).foreach(println) sc.stop() } }
-
-
sortByKey([ascending], [numTasks]) 【位于 OrderedRDDFunctions】
-
源码
/** * Sort the RDD by key, so that each partition contains a sorted range of the elements. * 根据 key 对 RDD进行排序, 从而使每个分区都包含一个排序过的元素序列 * Calling `collect` or `save` on the resulting RDD will return or output an ordered list of records (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in order of the keys). * 对结果 RDD 请求 收集 或 保存,会返回 或者 输出 一个有序的记录 列表 (如果 是 save 请求, 它们会以多个 分区文件的形式被写入到文件系统, 并且按key排序) */ // TODO: this currently doesn't work on P other than Tuple2! // 目前只对 Tuple2起作用 def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length) : RDD[(K, V)] = self.withScope { val part = new RangePartitioner(numPartitions, self, ascending) new ShuffledRDD[K, V, V](self, part) .setKeyOrdering(if (ascending) ordering else ordering.reverse) }
-
案例
package com.ronnie.scala.core.transform_operator import org.apache.spark.{SparkConf, SparkContext} object Operator_sortByKey { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("sortByKey") val sc = new SparkContext(conf) val lines = sc.textFile("./resource/word.txt") lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).map(f => {(f._1, f._2)}).sortByKey(false).map(f =>((f._2, f._1))).foreach(println) sc.stop() } }
-
-
sortBy(func,[ascending], [numTasks])
-
源码
/** * Return this RDD sorted by the given key function. * 根据 key 函数 对 RDD进行排序, 排序其实是隐式调用了 keyBy[K](f).sortByKey(true(升序, 分区数量)).values */ def sortBy[K]( f: (T) => K, ascending: Boolean = true, numPartitions: Int = this.partitions.length) (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope { this.keyBy[K](f) .sortByKey(ascending, numPartitions) .values }
-
案例
package com.ronnie.scala.core.transform_operator import org.apache.spark.{SparkConf, SparkContext} object Operator_sortBy { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("sortBy") val sc = new SparkContext(conf) val lines = sc.textFile("./resources/word.txt") lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._2, false).foreach(println) sc.stop() } }
-
-
join(otherDataset, [numTasks]) 【位于PairRDDFunction中】
-
源码
/** * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. * 返回一个含有 所有 'this' RDD 和'other' RDD 中 key匹配的 元素对的 RDD * Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and (k, v2) is in `other`. * 每一对元素都会以 (k, (v1, v2)) 的元组形式被返回, (k,v1) 来自 this RDD, (k,v2) 来自 other RDD * Uses the given Partitioner to partition the output RDD. * 使用 已得 的分区器 去 对输出的RDD进行分区 */ def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope { this.cogroup(other, partitioner).flatMapValues( pair => for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w) ) }
/** * Perform a left outer join of `this` and `other`. For each element (k, v) in `this`, the resulting RDD will either contain all pairs (k, (v, Some(w))) for w in `other`, or the pair (k, (v, None)) if no elements in `other` have key k. Uses the given Partitioner to partition the output RDD. * 对 this 和 other 进行 左外联操作, 基于 this 中的所有k-v(w)元素, 如果元素w存在于other, 结果 RDD 为包含 k-v形式, v为(原来的 v, Option.Some(w)); 如不存在v就为(原来的 v, Option.None)形式(option 可以预防空指针) * 使用已得的分区器对 RDD进行分区 */ def leftOuterJoin[W]( other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))] = self.withScope { this.cogroup(other, partitioner).flatMapValues { pair => if (pair._2.isEmpty) { pair._1.iterator.map(v => (v, None)) } else { // yield函数作用是: 记住每次迭代中的有关值,并逐一存入到一个数组中。 for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w)) } } }
/** * Perform a right outer join of `this` and `other`. For each element (k, w) in `other`, the * resulting RDD will either contain all pairs (k, (Some(v), w)) for v in `this`, or the * pair (k, (None, w)) if no elements in `this` have key k. * 对 this 和 other 进行 右外联操作, 基于 other 中的所有k-v(w)元素, 如果元素w存在于this, 结果 RDD 为包含 k-v形式, v为(原来的 v, Option.Some(w)); 如不存在v就为(原来的 v, Option.None)形式 * Uses the given Partitioner to partition the output RDD. */ def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner) : RDD[(K, (Option[V], W))] = self.withScope { this.cogroup(other, partitioner).flatMapValues { pair => if (pair._1.isEmpty) { pair._2.iterator.map(w => (None, w)) } else { // yield函数作用是: 记住每次迭代中的有关值,并逐一存入到一个数组中。 for (v <- pair._1.iterator; w <- pair._2.iterator) yield (Some(v), w) } } }
/** * Perform a full outer join of `this` and `other`. For each element (k, v) in `this`, the * resulting RDD will either contain all pairs (k, (Some(v), Some(w))) for w in `other`, or * the pair (k, (Some(v), None)) if no elements in `other` have key k. Similarly, for each * element (k, w) in `other`, the resulting RDD will either contain all pairs * (k, (Some(v), Some(w))) for v in `this`, or the pair (k, (None, Some(w))) if no elements * in `this` have key k. * 对 this 和 other 进行 完全外联操作, 结果RDD为key-value形式, 其中value为(Option.x, Option.y), 若 this 中有该值, 则x为Some(该值), 否则为None, 同理若 other中有该值, 则y为Some(该值), 否则为None * Uses the given Partitioner to partition the output RDD. * 使用已得的分区器 对 输出的RDD进行分区 */ def fullOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner) : RDD[(K, (Option[V], Option[W]))] = self.withScope { this.cogroup(other, partitioner).flatMapValues { case (vs, Seq()) => vs.iterator.map(v => (Some(v), None)) case (Seq(), ws) => ws.iterator.map(w => (None, Some(w))) case (vs, ws) => for (v <- vs.iterator; w <- ws.iterator) yield (Some(v), Some(w)) } }
-
案例
package com.ronnie.scala.core.transform_operator import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object Operator_join { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("join") val sc = new SparkContext(conf) val rdd01: RDD[(String, Int)] = sc.parallelize(Array(("a", 1), ("b", 2), ("c", 3))) val rdd02: RDD[(String, Int)] = sc.parallelize(Array(("a", 1), ("d", 2), ("e", 3))) val result: RDD[(String, (Int, Int))] = rdd01.join(rdd02) result.foreach(println) rdd01.rightOuterJoin(rdd02).foreach(println) rdd01.leftOuterJoin(rdd02).foreach(println) rdd01.fullOuterJoin(rdd02).foreach(println) } }
-
-
cogroup(otherDataset, [numTasks]) 【位于pairRDDFunction中】
-
源码
/** * For each key k in `this` or `other1` or `other2` or `other3`, * return a resulting RDD that contains a tuple with the list of values * for that key in `this`, `other1`, `other2` and `other3`(这四个都是迭代器). * 对每个在 'this' 或 'other1' 或 'other2' 或 'other3' 中的 key, 返回一个含有它们对应的 value 的值 的元组 */ def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)], partitioner: Partitioner) : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))] = self.withScope { // 如果分区器是哈希分区器 且 key的类是数组, 就抛出以下异常 if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) { throw new SparkException("HashPartitioner cannot partition array keys.") } val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), partitioner) cg.mapValues { case Array(vs, w1s, w2s, w3s) => // 如果是 数组就判断 迭代器对应类型是否一致 (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W1]], w2s.asInstanceOf[Iterable[W2]], w3s.asInstanceOf[Iterable[W3]]) } } /** * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the list of values for that key in `this` as well as `other`. * 对于 this 或 other 迭代器中的每一个 key, 返回一个 包含了 key 所对应的的 value 的元组 的 结果RDD */ def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner) : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope { if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) { throw new SparkException("HashPartitioner cannot partition array keys.") } val cg = new CoGroupedRDD[K](Seq(self, other), partitioner) cg.mapValues { case Array(vs, w1s) => (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]]) } } /** * For each key k in `this` or `other1` or `other2`, return a resulting RDD that contains a * tuple with the list of values for that key in `this`, `other1` and `other2`. * 对于 this 或 other 或 other2 迭代器中的每一个 key, 返回一个 包含了 key 所对应的的 value 的元组 的 结果RDD */ def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], partitioner: Partitioner) : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))] = self.withScope { if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) { throw new SparkException("HashPartitioner cannot partition array keys.") } val cg = new CoGroupedRDD[K](Seq(self, other1, other2), partitioner) cg.mapValues { case Array(vs, w1s, w2s) => (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W1]], w2s.asInstanceOf[Iterable[W2]]) } } /** * For each key k in `this` or `other1` or `other2` or `other3`, * return a resulting RDD that contains a tuple with the list of values * for that key in `this`, `other1`, `other2` and `other3`. * 对于 this 或 other 或 other2 或 other3 迭代器中的每一个 key, 返回一个 包含了 key 所对应的的 value 的元组 的 结果RDD */ def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)]) : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))] = self.withScope { cogroup(other1, other2, other3, defaultPartitioner(self, other1, other2, other3)) } /** * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the * list of values for that key in `this` as well as `other`. * 对于 this 或 other 迭代器中的每一个 key, 返回一个 包含了 key 所对应的的 value 的元组 的 结果RDD */ def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope { cogroup(other, defaultPartitioner(self, other)) } /** * For each key k in `this` or `other1` or `other2`, return a resulting RDD that contains a * tuple with the list of values for that key in `this`, `other1` and `other2`. * 对于 this 或 other 或 other2 迭代器中的每一个 key, 返回一个 包含了 key 所对应的的 value 的元组 的 结果RDD */ def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]) : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))] = self.withScope { cogroup(other1, other2, defaultPartitioner(self, other1, other2)) } /** * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the * list of values for that key in `this` as well as `other`. * 对于 this 或 other 迭代器中的每一个 key, 返回一个 包含了 key 所对应的的 value 的元组 的 结果RDD */ def cogroup[W]( other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope { cogroup(other, new HashPartitioner(numPartitions)) } /** * For each key k in `this` or `other1` or `other2`, return a resulting RDD that contains a * tuple with the list of values for that key in `this`, `other1` and `other2`. *对于 this 或 other 或 other2 迭代器中的每一个 key, 返回一个 包含了 key 所对应的的 value 的元组 的 结果RDD */ def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], numPartitions: Int) : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))] = self.withScope { cogroup(other1, other2, new HashPartitioner(numPartitions)) } /** * For each key k in `this` or `other1` or `other2` or `other3`, * return a resulting RDD that contains a tuple with the list of values * for that key in `this`, `other1`, `other2` and `other3`. *对于 this 或 other 或 other2 或 other3 迭代器中的每一个 key, 返回一个 包含了 key 所对应的的 value 的元组 的 结果RDD */ def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)], numPartitions: Int) : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))] = self.withScope { cogroup(other1, other2, other3, new HashPartitioner(numPartitions)) }
-
案例
package com.ronnie.scala.core.transform_operator import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object Operator_cogroup { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("cogroup") val sc = new SparkContext(conf) val a: RDD[Int] = sc.parallelize(List(1,2,3,4)) val b: RDD[String] = sc.parallelize(List("a", "b", "c","d")) val c: RDD[(Int, String)] = a.zip(b) c.cogroup(c).collect() } }
-
-
coalesce(numPartitions)
-
源码
/** * Return a new RDD that is reduced into `numPartitions` partitions. * 返回一个已经规约入如分区数量的分区的RDD * * This results in a narrow dependency, e.g. if you go from 1000 partitions * to 100 partitions, there will not be a shuffle, instead each of the 100 * new partitions will claim 10 of the current partitions. * 这些结果的父分区和子分区都是窄依赖的, 如果你把1000个分区规约成100个分区, 那就不会有shuffle产生, 反之, 100 个分区的每一个分区都会声明10个当前分区 * If a larger number of partitions is requested, it will stay at the current number of partitions. * 如果大量分区被请求, 它会维持当前的分区数 * * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, * this may result in your computation taking place on fewer nodes than * you like (e.g. one node in the case of numPartitions = 1). * 如果你要执行一个急剧的 合并, 比如分区数合并为1, 这可能会导致你的计算 发生在比你预想的更少的节点上。 * To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is). * 你可以通过传递 shuffle = true 参数来避免以上情况的发生。这样会加入一个shuffle的过程, 但是当前的上游分区会被并行执行(无论当前分区时哪个) * @note With shuffle = true, you can actually coalesce to a larger number * of partitions. * 使用 shuffle 时, 你可以将 将 RDD 合并后的分区数调大 * This is useful if you have a small number of partitions,say 100, potentially with a few partitions being abnormally large. * 当你只有一小部分分区的时候, 若有一部分分区异常的大(数据倾斜), 以上操作可能会有用。 * Calling coalesce(1000, shuffle = true) will result in 1000 partitions with the data distributed using a hash partitioner. * 调用 合并时传入1000(分区数) 和 使用shuffle, 会产生 1000 个分区 以及 由 哈希分区器 分区的 数据 * The optional partition coalescer passed in must be serializable. * 传入的可选的 分区合并器 必须是可序列化的 */ def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty) (implicit ord: Ordering[T] = null) : RDD[T] = withScope { // 分区数需要大于0 require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.") // 如果使用shuffle if (shuffle) { /** Distributes elements evenly across output partitions, starting from a random partition. * 从一个随机分区开始, 元素均匀地分布在输出分区上 */ val distributePartition = (index: Int, items: Iterator[T]) => { var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions) items.map { t => // Note that the hash code of the key will just be the key itself. // 该 key 的哈希码 只能是它自己 //The HashPartitioner will mod it with the number of total partitions. // 该哈希分区器会随着总分区的数改变而修改 position = position + 1 (position, t) } } : Iterator[(Int, T)] // include a shuffle step so that our upstream tasks are still distributed // 包含一个shuffle阶段从而使得我们的上游任务仍旧是分布式的 new CoalescedRDD( new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition), new HashPartitioner(numPartitions)), numPartitions, partitionCoalescer).values } else { new CoalescedRDD(this, numPartitions, partitionCoalescer) } }
-
案例
package com.ronnie.scala.core.transform_operator import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} import scala.collection.mutable.ListBuffer object Operator_coalesce { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("coalesce") val sc: SparkContext = new SparkContext(conf) val rdd01: RDD[Int] = sc.parallelize(Array(1,2,3,4,5,6), 4) val rdd02: RDD[String] = rdd01.mapPartitionsWithIndex((partitionIndex, iter) =>{ val list = new ListBuffer[String]() while (iter.hasNext){ list += "rdd01 partitionIndex: " + partitionIndex + ", value: " + iter.next() } list.iterator }) rdd02.foreach(println) val rdd03: RDD[String] = rdd02.coalesce(5, false) println("rdd03 number: " + rdd03.getNumPartitions) val rdd04: RDD[String] = rdd03.mapPartitionsWithIndex((partitionIndex, iter) =>{ val list = new ListBuffer[String]() while (iter.hasNext){ list += "coalesce partitionIndex: " + partitionIndex + ", value: " + iter.next() } list.iterator }) rdd04.foreach(println) sc.stop() } }
-
-
repartition(numPartitions)
-
源码
/** * Return a new RDD that has exactly numPartitions partitions. * 返回由相同分区数分区的新的RDD * Can increase or decrease the level of parallelism in this RDD. * 可以升高或者降低该RDD的并行级别 * Internally, this uses a shuffle to redistribute data. * 内部使用了shuffle 来将数据重新分布 * If you are decreasing the number of partitions in this RDD, consider using `coalesce`, which can avoid performing a shuffle. * 如果你要降低该RDD的分区数, 考虑使用 coalesce, 可以避免使用 shuffle * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207. * 要修正 shuffle +重分区 可能导致的数据丢失, 请参考 SPARK-23207 */ def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { coalesce(numPartitions, shuffle = true) }
-
案例
package com.ronnie.scala.core.transform_operator import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} import scala.collection.mutable.ListBuffer object Operator_repartition { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("repartition") val sc = new SparkContext(conf) val rdd01: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7), 3) val rdd02: RDD[String] = rdd01.mapPartitionsWithIndex((partitionIndex, iter) =>{ val list = new ListBuffer[String]() while (iter.hasNext){ list += "rdd01partitionIndex : "+partitionIndex+",value :"+iter.next() } list.iterator }) rdd02.foreach(println) val rdd03 = rdd02.repartition(4) val result = rdd03.mapPartitionsWithIndex((partitionIndex, iter) =>{ val list = ListBuffer[String]() while(iter.hasNext){ list +=("repartitionIndex : "+partitionIndex+",value :"+iter.next()) } list.iterator }) result.foreach(println) sc.stop() } }
-
-
repartitionAndSortWithinPartitions(partitioner) 【位于 OrderedRDDFunctions中】
-
源码
/** * Repartition the RDD according to the given partitioner and, within each resulting partition,sort records by their keys. * 根据已得的 RDD分区器 对 RDD进行重新分区, 并且在每个结果分区中, 对每条记录根据他们的 key进行排序 * This is more efficient than calling `repartition` and then sorting within each partition because it can push the sorting down into the shuffle machinery. * 这是一个比调用 repartition 再 对每个分区进行排序更高效的 算子, 因为它可以将排序后的结果一次推入 shuffle 体系 */ def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)] = self.withScope { new ShuffledRDD[K, V, V](self, partitioner).setKeyOrdering(ordering) }
-
案例
package com.ronnie.scala.core.transform_operator import org.apache.spark.rdd.RDD import org.apache.spark.{HashPartitioner, SparkConf, SparkContext} object Operator_repartitionAndSortWithinPartitions { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("repartitionAndSortWithinPartitions") val sc = new SparkContext(conf) val rdd01: RDD[Int] = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8)) rdd01.zipWithIndex().repartitionAndSortWithinPartitions(new HashPartitioner(1)).foreach(println) } }
-
Action
-
reduce(func)
-
源码
/** * Reduces the elements of this RDD using the specified commutative and * associative binary operator. * 根据特定的交换和相关的二进制操作器 来规约 RDD 的元素 */ def reduce(f: (T, T) => T): T = withScope { val cleanF = sc.clean(f) val reducePartition: Iterator[T] => Option[T] = iter => { if (iter.hasNext) { Some(iter.reduceLeft(cleanF)) } else { None } } var jobResult: Option[T] = None val mergeResult = (index: Int, taskResult: Option[T]) => { // isDefined <=> !isEmpty if (taskResult.isDefined) { jobResult = jobResult match { case Some(value) => Some(f(value, taskResult.get)) case None => taskResult } } } sc.runJob(this, reducePartition, mergeResult) // Get the final result out of our Option, or throw an exception if the RDD was empty // 获取最终的Option结果, 如果 RDD为空则抛出异常 jobResult.getOrElse(throw new UnsupportedOperationException("empty collection")) }
/** * Reduces the elements of this RDD in a multi-level tree pattern. * 按照多层级的树状模型来规约RDD中的元素 * treeReduce可以对任何RDD使用,相当于是reduce操作的泛化 * * @param depth suggested depth of the tree (default: 2) * param: 建议的树的深度(默认为2) * @see [[org.apache.spark.rdd.RDD#reduce]] */ def treeReduce(f: (T, T) => T, depth: Int = 2): T = withScope { // 树的深度必须要 >= 1 require(depth >= 1, s"Depth must be greater than or equal to 1 but got $depth.") val cleanF = context.clean(f) val reducePartition: Iterator[T] => Option[T] = iter => { if (iter.hasNext) { Some(iter.reduceLeft(cleanF)) } else { None } } val partiallyReduced = mapPartitions(it => Iterator(reducePartition(it))) val op: (Option[T], Option[T]) => Option[T] = (c, x) => { // isDefined <=> !isEmpty if (c.isDefined && x.isDefined) { Some(cleanF(c.get, x.get)) } else if (c.isDefined) { c } else if (x.isDefined) { x } else { None } } partiallyReduced.treeAggregate(Option.empty[T])(op, op, depth) .getOrElse(throw new UnsupportedOperationException("empty collection")) }
-
案例
- 见wordCount
-
-
collect()
-
源码
/** * Return an array that contains all of the elements in this RDD. * 返回包含所有RDD中元素的一个数组 * @note This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory. * 该方法仅在期望的结构数组比较小时才适用, 因为所有的数据都会加载到 driver 的内存中 */ def collect(): Array[T] = withScope { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(results: _*) }
-
案例
package com.ronnie.scala.core.action_operator import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object Operator_count { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("count") val sc = new SparkContext(conf) val lines: RDD[String] = sc.textFile("./resources/word.txt") val result: Long = lines.count() println(result) sc.stop() } }
-
-
count()
-
源码
/** * Return the number of elements in the RDD. * 返回 RDD中的元素 */ def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
-
案例
package com.ronnie.scala.core.action_operator import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object Operator_count { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("count") val sc = new SparkContext(conf) val lines: RDD[String] = sc.textFile("./resources/word.txt") val result: Long = lines.count() println(result) sc.stop() } }
-
-
first()
-
源码
/** * Return the first element in this RDD. * 返回该RDD中的第一个元素 <=> take(1) */ def first(): T = withScope { take(1) match { case Array(t) => t case _ => throw new UnsupportedOperationException("empty collection") } }
-
案例
-
-
take(n)
-
源码
/** * Take the first num elements of the RDD. * 获取第一个RDD中对应传入数字的第一个元素 * It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit. * 它先通过扫描一个分区, 并使用分区的结果来预估 所需的额外的分区来满足限制 * * @note This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory. * 该方法仅在期望的结构数组比较小时才适用, 因为所有的数据都会加载到 driver 的内存中 * @note Due to complications in the internal implementation, this method will raise an exception if called on an RDD of `Nothing` or `Null`. * 由于内部实现的复杂, 该方法在调用的RDD为 Nothing或空时会抛出异常 */ def take(num: Int): Array[T] = withScope { val scaleUpFactor = Math.max(conf.getInt("spark.rdd.limit.scaleUpFactor", 4), 2) if (num == 0) { new Array[T](0) } else { val buf = new ArrayBuffer[T] val totalParts = this.partitions.length var partsScanned = 0 while (buf.size < num && partsScanned < totalParts) { // The number of partitions to try in this iteration. It is ok for this number to be greater than totalParts because we actually cap it at totalParts in runJob. // 进行此次迭代的分区数, 该参数可以比总的分区数大, 我们可以在runJob的总分区数中以cap(Consistency, Avaliability, Partition Tolerance) var numPartsToTry = 1L val left = num - buf.size if (partsScanned > 0) { // If we didn't find any rows after the previous iteration, quadruple and retry. // 如果在上次迭代之后 没有找到任何行, 就将迭代分区数变为它原来的四倍 并重试 // Otherwise, interpolate the number of partitions we need to try, but overestimate it by 50%. We also cap the estimation in the end. // 如果找到了任何行, J就在它内部插入我们想测试的分区数, 并且提升百分之50的预估值, 最终我们也以cap来评估它 // 如果缓存为空 if (buf.isEmpty) { // 就将尝试的分区数修改为 扫描的分区数 * 上升的因子 numPartsToTry = partsScanned * scaleUpFactor } else { // As left > 0, numPartsToTry is always >= 1 // 如果 num左侧的数量大于 0, 要尝试的分区数也会 >= 1 // ceil 向上取整 numPartsToTry = Math.ceil(1.5 * left * partsScanned / buf.size).toInt numPartsToTry = Math.min(numPartsToTry, partsScanned * scaleUpFactor) } } val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt) val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p) // ++= 将一批元素添加至末尾 res.foreach(buf ++= _.take(num - buf.size)) partsScanned += p.size } buf.toArray } }
-
案例
package com.ronnie.scala.core.action_operator import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object Operator_take { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("take") val sc = new SparkContext(conf) val rdd: RDD[Int] = sc.parallelize(List(1, 2, 3, 4, 5, 6)) val r: Array[Int] = rdd.take(3) r.foreach(println) } }
-
-
takeOrdered(n, [ordering])
-
源码
/** * Returns the first k (smallest) elements from this RDD as defined by the specified implicit Ordering[T] and maintains the ordering. This does the opposite of [[top]]. * 说白了就是获取有序的前几个最小的元素, 排序有隐式函数 Ordering[T] 维护, 效果与 top算子相反 * For example: * {{{ * sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1) * // returns Array(2) * * sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2) * // returns Array(2, 3) * }}} * * @note This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory. * 该方法仅在期望的结构数组比较小时才适用, 因为所有的数据都会加载到 driver 的内存中 * * @param num k, the number of elements to return * num k, 返回的元素数量 * @param ord the implicit ordering for T * ord, 隐式地对数组T进行排序 * @return an array of top elements */ def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { if (num == 0) { Array.empty } else { val mapRDDs = mapPartitions { items => // Priority keeps the largest elements, so let's reverse the ordering. // 默认优先保留最大的元素, 所以我们需要逆转排序 val queue = new BoundedPriorityQueue[T](num)(ord.reverse) queue ++= collectionUtils.takeOrdered(items, num)(ord) Iterator.single(queue) } // 如果mapRDD的分区长度为0 if (mapRDDs.partitions.length == 0) { // 就清空数组 Array.empty } else { // 否则就把第二个队列按顺序合并到第一个对列末尾, 然后转成数组再排序 mapRDDs.reduce { (queue1, queue2) => queue1 ++= queue2 queue1 }.toArray.sorted(ord) } } }
-
案例
package com.ronnie.scala.core.action_operator import org.apache.spark.{SparkConf, SparkContext} object Operator_takeOrdered { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("take") val sc = new SparkContext(conf) val rdd = sc.makeRDD(List(1, -21, 30, 45, 6, 98.1, 0.97)) rdd.takeOrdered(9).foreach(println) } }
-
-
saveAsTextFile(path)
-
源码
/** * Save this RDD as a text file, using string representations of elements. * 将该RDD中储存的元素保存为text文件, 使用字符串来代替元素 */ def saveAsTextFile(path: String): Unit = withScope { // https://issues.apache.org/jira/browse/SPARK-2075 // NullWritable is a `Comparable` in Hadoop 1.+, so the compiler cannot find an implicit Ordering for it and will use the default `null`. // 空可写入(如key, null) 在hadoop 1.x 版本中是可比较的, 所以编译器并不能找到隐式的排序函数, 并且使用默认的 null // However, it's a `Comparable[NullWritable]`in Hadoop 2.+, so the compiler will call the implicit `Ordering.ordered` method to create an Ordering for `NullWritable`. // 然而在hadoop2.x版本中, 它变成了Comparable[NullWritable], 所以编译器能隐式调用排序方法来对空可写入进行排序 // That's why the compiler will generate different anonymous classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+. // 这也是为什么编译器会对 Hadoop1.x 和 Hadoop2.x 版本的 saveAsTextFile 生成不同的匿名类 // Therefore, here we provide an explicit Ordering `null` to make sure the compiler generate same bytecodes for `saveAsTextFile`. // 从而, 我们就在此提供一个显示的空排序来确保 编译器能够对saveAsTextFile生成相同的字节码文件 val nullWritableClassTag = implicitly[ClassTag[NullWritable]] val textClassTag = implicitly[ClassTag[Text]] val r = this.mapPartitions { iter => val text = new Text() iter.map { x => text.set(x.toString) (NullWritable.get(), text) } } RDD.rddToPairRDDFunctions(r)(nullWritableClassTag, textClassTag, null) .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path) }
/** * Save this RDD as a compressed text file, using string representations of elements. * 将此 RDD 保存为一个压缩的 text 文件, 用 字符串来代替元素。(多个压缩方式参数avro, parquet ...) */ def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit = withScope { // https://issues.apache.org/jira/browse/SPARK-2075 val nullWritableClassTag = implicitly[ClassTag[NullWritable]] val textClassTag = implicitly[ClassTag[Text]] val r = this.mapPartitions { iter => val text = new Text() iter.map { x => text.set(x.toString) (NullWritable.get(), text) } } RDD.rddToPairRDDFunctions(r)(nullWritableClassTag, textClassTag, null) .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path, codec) }
-
案例
package com.ronnie.scala.core.action_operator import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object Operator_saveAsTextFile { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("saveAsTextFile") val sc = new SparkContext(conf) val rdd: RDD[(String, Int)] = sc.makeRDD(List(("Life is just a journey", 1), ("Whatever make you confused, please follow your mind", 2), ("A life so changed", 3))) rdd.saveAsTextFile("./resource/testTextFile") } }
-
-
saveAsSequenceFile(path) 【位于SequenceFileRDDFunctions 中】
-
源码
/** * Output the RDD as a Hadoop SequenceFile using the Writable types we infer from the RDD's key and value types. * 通过使用写入类型来将RDD输出为一个hadoop序列文件, 我们通过 RDD 的 key 和 value类型来推断 可写入的类型 * If the key or value are Writable, then we use their classes directly; * 如果key或者value 是可写入的, 我们则可以直接使用他们的类 * otherwise we map primitive types such as Int and Double to IntWritable, DoubleWritable, etc, byte arrays to BytesWritable, and Strings to Text. * 否则, 我们就会将 key 和 value 去和原始类型进行匹配, 比如IntWritable, DoubleWritable (byte 数组 => ByteWritbale, Strings => IntWritable) * The `path` can be on any Hadoop-supported file system. * 路径可以在任意hadoop支持的文件系统上 */ def saveAsSequenceFile( path: String, codec: Option[Class[_ <: CompressionCodec]] = None): Unit = self.withScope { def anyToWritable[U <% Writable](u: U): Writable = u // TODO We cannot force the return type of `anyToWritable` be same as keyWritableClass and valueWritableClass at the compile time. // 我们不能在编译时强迫 anyToWritable 的返回类型 和 keyWritableClass // To implement that, we need to add type parameters to SequenceFileRDDFunctions. // 为了实现以上, 我们需要将参数类型添加到 SequenceFileRDDFunctions 类中 // however, SequenceFileRDDFunctions is a public class so it will be a breaking change. // 然而, SequenceFileRDDFunctions 是一个共有类, 所以以上操作会导致爆炸性的改变 val convertKey = self.keyClass != _keyWritableClass val convertValue = self.valueClass != _valueWritableClass logInfo("Saving as sequence file of type " + s"(${_keyWritableClass.getSimpleName},${_valueWritableClass.getSimpleName})" ) val format = classOf[SequenceFileOutputFormat[Writable, Writable]] val jobConf = new JobConf(self.context.hadoopConfiguration) // 如果不能转化key, 也不能转化value(类型不匹配导致) if (!convertKey && !convertValue) { self.saveAsHadoopFile(path, _keyWritableClass, _valueWritableClass, format, jobConf, codec) // 如果不能转化 key, 能转化 value } else if (!convertKey && convertValue) { self.map(x => (x._1, anyToWritable(x._2))).saveAsHadoopFile( path, _keyWritableClass, _valueWritableClass, format, jobConf, codec) // 如果能转化key, 不能转化value } else if (convertKey && !convertValue) { self.map(x => (anyToWritable(x._1), x._2)).saveAsHadoopFile( path, _keyWritableClass, _valueWritableClass, format, jobConf, codec) // 如果能转化key 且 能转化 value } else if (convertKey && convertValue) { self.map(x => (anyToWritable(x._1), anyToWritable(x._2))).saveAsHadoopFile( path, _keyWritableClass, _valueWritableClass, format, jobConf, codec) } }
-
案例
package com.ronnie.scala.core.action_operator import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object Operator_saveAsSequenceFile { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("saveAsSequenceFile") val sc = new SparkContext(conf) val rdd: RDD[(String, Int)] = sc.makeRDD(List(("Life is just a journey", 1), ("Whatever make you confused, please follow your mind", 2), ("A life so changed", 3))) rdd.saveAsTextFile("./resource/testSequenceFile") } }
-
-
saveAsObjectFile(path)
-
源码
/** * Save this RDD as a SequenceFile of serialized objects. * 保存RDD问一个序列化过的序列文件 */ def saveAsObjectFile(path: String): Unit = withScope { this.mapPartitions(iter => iter.grouped(10).map(_.toArray)) .map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x)))) .saveAsSequenceFile(path) }
-
案例
package com.ronnie.scala.core.action_operator import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.rdd.RDD object Operation_saveAsObjectFile { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("saveAsObjectFile") val sc = new SparkContext(conf) val rdd: RDD[(String, Int)] = sc.makeRDD(List(("Life is just a journey", 1), ("Whatever make you confused, please follow your mind", 2), ("A life so changed", 3))) rdd.saveAsTextFile("./resource/testObjectFile") } }
-
-
countByKey()
-
源码
/** * Return the count of each unique value in this RDD as a local map of (value, count) pairs. * 返回RDD中每一个唯一的value和它对应的计数 的本地视图 * * @note This method should only be used if the resulting map is expected to be small, as the whole thing is loaded into the driver's memory. * 该方法仅在期望的结果map比较小时才适用, 因为所有的数据都会加载到 driver 的内存中 * To handle very large results, consider using * 为了控制这些变化, 考虑使用以下方式来替代 * {{{ * rdd.map(x => (x, 1L)).reduceByKey(_ + _) * }}} * * , which returns an RDD[T, Long] instead of a map. * 返回一个 k, v形式的RDD(v 为 Long) 而不是map */ def countByValue()(implicit ord: Ordering[T] = null): Map[T, Long] = withScope { map(value => (value, null)).countByKey() }
-
案例
package com.ronnie.scala.core.action_operator import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object Operator_countByKey { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("countByKey") val sc = new SparkContext(conf) val rdd: RDD[(String, Int)] = sc.makeRDD(List( ("a", 100), ("b", 200), ("c", 300), ("d", 400))) val rdd01 = sc.parallelize(List( ("a", 100), ("b", 200), ("a", 300), ("c", 400) )) val result: collection.Map[String, Long] = rdd01.countByKey() result.foreach(println) sc.stop() } }
-
-
foreach(func)
-
源码
/** * Applies a function f to all elements of this RDD. * 对RDD中的所有元素应用一个函数 */ def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF)) }
-
案例
- 见wordCount
-
-
foreachPartition
-
源码
/** * Applies a function f to each partition of this RDD. * 对RDD中的每一个分区应用一个函数 */ def foreachPartition(f: Iterator[T] => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob(this, (iter: Iterator[T]) => cleanF(iter)) }
-
案例
package com.ronnie.scala.core.action_operator import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object Operation_foreachPartition { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local").setAppName("foreachPartition") val sc = new SparkContext(conf) val lines: RDD[String] = sc.textFile("./resources/word.txt") lines.foreachPartition(x =>{ println("连接数据库......") while (x.hasNext){ println(x.next()) } println("关闭数据库......") }) } }
-
Manipulate
-
cache
-
源码
/** * Persist this RDD with the default storage level (`MEMORY_ONLY`). * 等于只使用内存的persist, 其实就是对 下面的persist方法包装了一下 */ def cache(): this.type = persist()
/** * Persist this RDD with the default storage level (`MEMORY_ONLY`). */ def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
-
-
persist
- 源码
/**- Mark this RDD for persisting using the specified level. - 使用特殊的存储级别来标记该RDD的持久化 * - @param newLevel the target storage level 新的目标存储级别 - @param allowOverride whether to override any existing level with the new one - 是否允许覆盖已经存在的存储级别 */ private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = { // TODO: Handle changes of StorageLevel 处理存储界别的变化 // 如果 当前存储级别 不为NONE 且 新存储级别 不为旧存储级别 且 不允许覆盖, 就抛出不支持的操作异常 if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) { throw new UnsupportedOperationException( "Cannot change storage level of an RDD after it was already assigned a level") } // If this is the first time this RDD is marked for persisting, register it // with the SparkContext for cleanups and accounting. Do this only once. // 如果 这是这个RDD第一次被标记要持久化, 那么只执行这一次 if (storageLevel == StorageLevel.NONE) { sc.cleaner.foreach(_.registerRDDForCleanup(this)) sc.persistRDD(this) } storageLevel = newLevel this }
/** * Set this RDD's storage level to persist its values across operations after the first time it is computed. * 在该RDD第一次被计算后, 根据它的存储级别在操作间隙持久化它 * This can only be used to assign a new storage level if the RDD does not * have a storage level set yet. Local checkpointing is an exception. * 除本地检查点外这只能对新生的RDD进行操作 */ def persist(newLevel: StorageLevel): this.type = { // 如果是本地设置过检查点 if (isLocallyCheckpointed) { // This means the user previously called localCheckpoint(), which should have already marked this RDD for persisting. // 这意味着用户之前调用过 localCheckpoint方法, 所以已经标记改RDD为已持久化操作过 // Here we should override the old storage level with one that is explicitly requested by the user (after adapting it to use disk). // 在这里, 我们应该用用户显示请求的存储级别 来 覆盖旧的存储级别 persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true) } else { persist(newLevel, allowOverride = false) } }
- 源码
-
checkPoint
-
源码
/** * Performs the checkpointing of this RDD by saving this. * 执行 checkpointing 方法 来保存 RDD * It is called after a job using this RDD has completed (therefore the RDD has been materialized and potentially stored in memory). * 当一个使用该RDD的任务完成后, 该方法会被调用(因为RDD可能会被实例化 并 可能被存储到内存中) * doCheckpoint() is called recursively on the parent RDDs. * doCheckpoint 方法会对父RDD递归执行 */ private[spark] def doCheckpoint(): Unit = { RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) { // 如果 doCheckpoint没有被调用过 if (!doCheckpointCalled) { // 就需要调用 doCheckpointCalled = true // 如果检查点数据已经被定义过了 if (checkpointData.isDefined) { // 如果 从此检查点开始的所有祖先都被标记为checkpointing(需要添加保存点) if (checkpointAllMarkedAncestors) { // TODO We can collect all the RDDs that needs to be checkpointed, and then checkpoint them in parallel. // 我们可以收集所有需要添加检查点的RDD, 并且并行的添加检查点 // Checkpoint parents first because our lineage will be truncated after we checkpoint ourselves // 优先对保存点的父保存点执行操作, 因为 RDD 的lineage(血缘) 会在添加保存点后被截断 dependencies.foreach(_.rdd.doCheckpoint()) } checkpointData.get.checkpoint() } else { dependencies.foreach(_.rdd.doCheckpoint()) } } } }
// Whether to checkpoint all ancestor RDDs that are marked for checkpointing. // 是否对所有添加了checkpointing标签的 RDD的祖先添加保存点 // By default, we stop as soon as we find the first such RDD, an optimization that allows us to write less data but is not safe for all workloads. // 默认情况下, 我们只要找到一个这样的RDD就会停止查询, 这是一个能使我们写入更少的数据的优化, 但是, 这可能对所有的工作量来说是不安全的。 // E.g. in streaming we may checkpoint both an RDD and its parent in every batch, in which case the parent may never be checkpointed and its lineage never truncated, leading to OOMs in the long run (SPARK-6847). // 比如在流中, 我们可能会在每一批中对一个RDD和它的父RDD都添加保存点, 这可能会导致 父 RDD 的 保存点操作 一直失败 但 它的 lineage(血缘) 缺一直没被切断, 这可能会在一次长运行中导致内存用尽 private val checkpointAllMarkedAncestors = Option(sc.getLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS)).exists(_.toBoolean)
-
关于存储级别请查看: StorageLevel