• spark RDD,reduceByKey vs groupByKey


    Spark中有两个类似的api,分别是reduceByKey和groupByKey。这两个的功能类似,但底层实现却有些不同,那么为什么要这样设计呢?我们来从源码的角度分析一下。

    先看两者的调用顺序(都是使用默认的Partitioner,即defaultPartitioner)

    所用spark版本:spark2.1.0

    先看reduceByKey

    Step1

      def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
        reduceByKey(defaultPartitioner(self), func)
      }
    

    Setp2

      def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
        combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
      }
    

    Setp3

    def combineByKeyWithClassTag[C](
          createCombiner: V => C,
          mergeValue: (C, V) => C,
          mergeCombiners: (C, C) => C,
          partitioner: Partitioner,
          mapSideCombine: Boolean = true,
          serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
        require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
        if (keyClass.isArray) {
          if (mapSideCombine) {
            throw new SparkException("Cannot use map-side combining with array keys.")
          }
          if (partitioner.isInstanceOf[HashPartitioner]) {
            throw new SparkException("HashPartitioner cannot partition array keys.")
          }
        }
        val aggregator = new Aggregator[K, V, C](
          self.context.clean(createCombiner),
          self.context.clean(mergeValue),
          self.context.clean(mergeCombiners))
        if (self.partitioner == Some(partitioner)) {
          self.mapPartitions(iter => {
            val context = TaskContext.get()
            new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
          }, preservesPartitioning = true)
        } else {
          new ShuffledRDD[K, V, C](self, partitioner)
            .setSerializer(serializer)
            .setAggregator(aggregator)
            .setMapSideCombine(mapSideCombine)
        }
      }
    

    姑且不去看方法里面的细节,我们会只要知道最后调用的是combineByKeyWithClassTag这个方法。这个方法有两个参数我们来重点看一下,

    def combineByKeyWithClassTag[C](
          createCombiner: V => C,
          mergeValue: (C, V) => C,
          mergeCombiners: (C, C) => C,
          partitioner: Partitioner,
          mapSideCombine: Boolean = true,
          serializer: Serializer = null)
    

    首先是partitioner参数,这个即是RDD的分区设置。除了默认的defaultPartitioner,Spark还提供了RangePartitioner和HashPartitioner外,此外用户也可以自定义partitioner。通过源码可以发现如果是HashPartitioner的话,那么是会抛出一个错误的。

    然后是mapSideCombine参数,这个参数正是reduceByKey和groupByKey最大不同的地方,它决定是是否会先在节点上进行一次Combine操作,下面会有更具体的例子来介绍。

    然后是groupByKey

    Step1

      def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
        groupByKey(defaultPartitioner(self))
      }
    

    Step2

      def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
        // groupByKey shouldn't use map side combine because map side combine does not
        // reduce the amount of data shuffled and requires all map side data be inserted
        // into a hash table, leading to more objects in the old gen.
        val createCombiner = (v: V) => CompactBuffer(v)
        val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
        val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
        val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
          createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
        bufs.asInstanceOf[RDD[(K, Iterable[V])]]
      }
    

    Setp3

    def combineByKeyWithClassTag[C](
          createCombiner: V => C,
          mergeValue: (C, V) => C,
          mergeCombiners: (C, C) => C,
          partitioner: Partitioner,
          mapSideCombine: Boolean = true,
          serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
        require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
        if (keyClass.isArray) {
          if (mapSideCombine) {
            throw new SparkException("Cannot use map-side combining with array keys.")
          }
          if (partitioner.isInstanceOf[HashPartitioner]) {
            throw new SparkException("HashPartitioner cannot partition array keys.")
          }
        }
        val aggregator = new Aggregator[K, V, C](
          self.context.clean(createCombiner),
          self.context.clean(mergeValue),
          self.context.clean(mergeCombiners))
        if (self.partitioner == Some(partitioner)) {
          self.mapPartitions(iter => {
            val context = TaskContext.get()
            new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
          }, preservesPartitioning = true)
        } else {
          new ShuffledRDD[K, V, C](self, partitioner)
            .setSerializer(serializer)
            .setAggregator(aggregator)
            .setMapSideCombine(mapSideCombine)
        }
      }
    

    结合上面reduceByKey的调用链,可以发现最终其实都是调用combineByKeyWithClassTag这个方法的,但调用的参数不同。
    reduceByKey的调用

    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
    

    groupByKey的调用

    combineByKeyWithClassTag[CompactBuffer[V]](
          createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    

    正是两者不同的调用方式导致了两个方法的差别,我们分别来看

    • reduceByKey的泛型参数直接是[V],而groupByKey的泛型参数是[CompactBuffer[V]]。这直接导致了reduceByKey和groupByKey的返回值不同,前者是RDD[(K, V)],而后者是RDD[(K, Iterable[V])]

    • 然后就是mapSideCombine=false了,这个mapSideCombine参数的默认是true的。这个值有什么用呢,上面也说了,这个参数的作用是控制要不要在map端进行初步合并(Combine)。可以看看下面具体的例子。

    从功能上来说,可以发现ReduceByKey其实就是会在每个节点先进行一次合并的操作,而groupByKey没有。

    这么来看ReduceByKey的性能会比groupByKey好很多,因为有些工作在节点已经处理了。那么groupByKey为什么存在,它的应用场景是什么呢?我也不清楚,如果观看这篇文章的读者知道的话不妨在评论里说出来吧。非常感谢!

  • 相关阅读:
    由jQuery Validation Remote验证引起的错误(MVC3 jQuery.validate.unobtrusive)
    Windows8下设置VS默认启动方式为管理员启动
    Asp.Net MVC 必备插件MVC Route Visualizer(Visual Studio 2012 版)
    2012 LinkCoder Jeffrey Richter:Win 8应用开发与.NET4.5
    WCF应用:宿主与调用纯代码示例(Host &Client code only sample)
    Nexus 7 入手风波记
    [转]使用HyperV BPA(Best Practices Analyzer最佳化分析工具)
    [转]SCVMM2012部署之一:先决条件条件准备
    [转]VMware管理员必掌握的八个HyperV功能
    [转]Installing and Configuring target iSCSI server on Windows Server 2012
  • 原文地址:https://www.cnblogs.com/listenfwind/p/9860204.html
Copyright © 2020-2023  润新知