1:glom
def glom(): RDD[Array[T]]
将原RDD的元素收集到一个数组,创建一个数组类型的RDD
2:getNumPartitions
final def getNumPartitions: Int
求RDD的分区书
3:groupBy
def groupBy[K](f: (T) ⇒ K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]
根据指定函数进行分组,例如:
scala> rdd1.collect res61: Array[Int] = Array(1, 2, 3, 4, 5) scala> rdd1.groupBy(x=>if(x%2==0) 0 else 1).collect res62: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(4, 2)), (1,CompactBuffer(1, 3, 5)))
4:randomSplit
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]
将一个RDD根据weights数组进行划分多个RDD,返回一个数组。
5:countByValue
返回每一个元素出现的次数,可以更加方便实现wordcount
scala> sc.parallelize(Array(1,2,1,2,1,2,3,4,5)).countByValue
res73: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 3, 2 -> 3, 3 -> 1, 4 -> 1)
6:countByValueApprox
def countByValueApprox(timeout: Long, confidence: Double = 0.95)(implicit ord: Ordering[T] = null): PartialResult[Map[T, BoundedDouble]]
求一个近似的计算结果
7:++
def ++(other: RDD[T]): RDD[T]
求RDD的并集
8:fold
def fold(zeroValue: T)(op: (T, T) ⇒ T): T
例如:
scala> rdd1.collect res90: Array[Int] = Array(1, 2, 3, 4, 5) scala> rdd1.fold(0)(_+_) res91: Int = 15