• zip和zipPartitions


    zip函数用于将两个RDD组合成Key/Value形式的RDD,这里默认两个RDD的partition数量以及元素数量都相同,否则会抛出异常。

    scala> val aa=sc.makeRDD(1 to 10)

    aa: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[86] at makeRDD at <console>:26

    scala> val cc=sc.makeRDD(21 to 30)

    cc: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[88] at makeRDD at <console>:26


    scala> aa.zip(bb).collect
    res62: Array[(Int, Int)] = Array((1,21), (2,22), (3,23), (4,24), (5,25), (6,26), (7,27), (8,28), (9,29), (10,30))

    zipPartitions函数将多个RDD按照partition组合成为新的RDD,该函数需要组合的RDD具有相同的分区数,但对于每个分区内的元素数量没有要求。

    def zipPartitions[B, C, D, V](rdd2: org.apache.spark.rdd.RDD[B],rdd3: org.apache.spark.rdd.RDD[C],rdd4: org.apache.spark.rdd.RDD[D],preservesPartitioning: Boolean)(f: (Iterator[Int], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V])(implicit evidence$21: scala.reflect.ClassTag[B],implicit evidence$22: scala.reflect.ClassTag[C],implicit evidence$23: scala.reflect.ClassTag[D],implicit evidence$24: scala.reflect.ClassTag[V]): org.apache.spark.rdd.RDD[V]
    def zipPartitions[B, C, V](rdd2: org.apache.spark.rdd.RDD[B],rdd3: org.apache.spark.rdd.RDD[C],preservesPartitioning: Boolean)(f: (Iterator[Int], Iterator[B], Iterator[C]) => Iterator[V])(implicit evidence$15: scala.reflect.ClassTag[B],implicit evidence$16: scala.reflect.ClassTag[C],implicit evidence$17: scala.reflect.ClassTag[V]): org.apache.spark.rdd.RDD[V]
    def zipPartitions[B, C, D, V](rdd2: org.apache.spark.rdd.RDD[B],rdd3: org.apache.spark.rdd.RDD[C],rdd4: org.apache.spark.rdd.RDD[D])(f: (Iterator[Int], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V])(implicit evidence$25: scala.reflect.ClassTag[B],implicit evidence$26: scala.reflect.ClassTag[C],implicit evidence$27: scala.reflect.ClassTag[D],implicit evidence$28: scala.reflect.ClassTag[V]): org.apache.spark.rdd.RDD[V]
    def zipPartitions[B, V](rdd2: org.apache.spark.rdd.RDD[B],preservesPartitioning: Boolean)(f: (Iterator[Int], Iterator[B]) => Iterator[V])(implicit evidence$11: scala.reflect.ClassTag[B],implicit evidence$12: scala.reflect.ClassTag[V]): org.apache.spark.rdd.RDD[V]
    def zipPartitions[B, V](rdd2: org.apache.spark.rdd.RDD[B])(f: (Iterator[Int], Iterator[B]) => Iterator[V])(implicit evidence$13: scala.reflect.ClassTag[B],implicit evidence$14: scala.reflect.ClassTag[V]): org.apache.spark.rdd.RDD[V]
    def zipPartitions[B, C, V](rdd2: org.apache.spark.rdd.RDD[B],rdd3: org.apache.spark.rdd.RDD[C])(f: (Iterator[Int], Iterator[B], Iterator[C]) => Iterator[V])(implicit evidence$18: scala.reflect.ClassTag[B],implicit evidence$19: scala.reflect.ClassTag[C],implicit evidence$20: scala.reflect.ClassTag[V]): org.apache.spark.rdd.RDD[V]

    举例如下:

    scala> val aa=sc.makeRDD(1 to 10)

    aa: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[86] at makeRDD at <console>:26

    scala> val bb=sc.makeRDD(11 to 15)

    bb: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[87] at makeRDD at <console>:26

    scala> val cc=sc.makeRDD(21 to 35)

    cc: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[88] at makeRDD at <console>:26


    scala> aa.zipPartitions(aa,bb){(aaiter,bbiter,cciter)=>{var result=List[(Int,Int,Int)]();while(aaiter.hasNext&&bbiter.hasNext&&cciter.hasNext){result::=(aaiter.next,bbiter.next,cciter.next)};result.toIterator}}.collect
    res56: Array[(Int, Int, Int)] = Array((1,1,11), (3,3,12), (6,6,13), (9,9,15), (8,8,14))

    ----------------------

      def zipWithIndex(): org.apache.spark.rdd.RDD[(Int, Long)]

    zipWithIndex将RDD中的元素和这个元素在RDD中的ID(索引号)组合成键/值对


    scala> bb.zipWithIndex().collect
    res64: Array[(Int, Long)] = Array((21,0), (22,1), (23,2), (24,3), (25,4), (26,5), (27,6), (28,7), (29,8), (30,9))

    --------------------------------

     def zipWithUniqueId(): org.apache.spark.rdd.RDD[(Int, Long)]

    zipWithUniqueId将RDD中的元素和一个唯一的ID组成键/值对

    这个唯一ID生成算法如下:

    每个分区中第一个元素的唯一ID值为:该分区索引号;

    每个分区中第N个元素的唯一ID值为:(前一个元素的唯一ID值) + (该RDD总的分区数)

    scala> bb.zipWithUniqueId().collect
    res67: Array[(Int, Long)] = Array((21,0), (22,4), (23,1), (24,5), (25,9), (26,2), (27,6), (28,3), (29,7), (30,11))

  • 相关阅读:
    ajax返回乱码的解决方案
    Javascript里使用Dom操作Xml
    ASP.NET 网站路径
    远程连接SQL Server
    缘 in English
    简单C#验证类
    js事件列表
    ArrayList用法
    下拉菜单遮挡层的解决方案
    正则表达式过滤HTML危险脚本
  • 原文地址:https://www.cnblogs.com/playforever/p/9475032.html
Copyright © 2020-2023  润新知