Spark源码系列:DataFrame repartition、coalesce 对比

在Spark开发中，有时为了更好的效率，特别是涉及到关联操作的时候，对数据进行重新分区操作可以提高程序运行效率（很多时候效率的提升远远高于重新分区的消耗，所以进行重新分区还是很有价值的）。
在SparkSQL中，对数据重新分区主要有两个方法 repartition 和 coalesce ，下面将对两个方法比较

repartition

repartition 有三个重载的函数：

def repartition(numPartitions: Int): DataFrame

1 /**
2    * Returns a new [[DataFrame]] that has exactly `numPartitions` partitions.
3    * @group dfops
4    * @since 1.3.0
5    */
6   def repartition(numPartitions: Int): DataFrame = withPlan {
7     Repartition(numPartitions, shuffle = true, logicalPlan)
8   }

此方法返回一个新的[[DataFrame]]，该[[DataFrame]]具有确切的 'numpartition' 分区。

def repartition(partitionExprs: Column*): DataFrame

 1 /**
 2    * Returns a new [[DataFrame]] partitioned by the given partitioning expressions preserving
 3    * the existing number of partitions. The resulting DataFrame is hash partitioned.
 4    *
 5    * This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
 6    *
 7    * @group dfops
 8    * @since 1.6.0
 9    */
10   @scala.annotation.varargs
11   def repartition(partitionExprs: Column*): DataFrame = withPlan {
12     RepartitionByExpression(partitionExprs.map(_.expr), logicalPlan, numPartitions = None)
13   }

此方法返回一个新的[[DataFrame]]分区，它由保留现有分区数量的给定分区表达式划分。得到的DataFrame是哈希分区的。

这与SQL (Hive QL)中的“distribution BY”操作相同。

def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame

 1   /**
 2    * Returns a new [[DataFrame]] partitioned by the given partitioning expressions into
 3    * `numPartitions`. The resulting DataFrame is hash partitioned.
 4    *
 5    * This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
 6    *
 7    * @group dfops
 8    * @since 1.6.0
 9    */
10   @scala.annotation.varargs
11   def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame = withPlan {
12     RepartitionByExpression(partitionExprs.map(_.expr), logicalPlan, Some(numPartitions))
13   }

此方法返回一个新的[[DataFrame]]，由给定的分区表达式划分为 'numpartition' 。得到的DataFrame是哈希分区的。

这与SQL (Hive QL)中的“distribution BY”操作相同。

coalesce

coalesce(numPartitions: Int): DataFrame

 1   /**
 2    * Returns a new [[DataFrame]] that has exactly `numPartitions` partitions.
 3    * Similar to coalesce defined on an [[RDD]], this operation results in a narrow dependency, e.g.
 4    * if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of
 5    * the 100 new partitions will claim 10 of the current partitions.
 6    * @group rdd
 7    * @since 1.4.0
 8    */
 9   def coalesce(numPartitions: Int): DataFrame = withPlan {
10     Repartition(numPartitions, shuffle = false, logicalPlan)
11   }

返回一个新的[[DataFrame]]，该[[DataFrame]]具有确切的 'numpartition' 分区。类似于在[[RDD]]上定义的coalesce，这种操作会导致一个狭窄的依赖关系，例如：

如果从1000个分区到100个分区，就不会出现shuffle，而是100个新分区中的每一个都会声明10个当前分区。

反过来从100个分区到1000个分区，将会出现shuffle。

注：coalesce(numPartitions: Int): DataFrame 和 repartition(numPartitions: Int): DataFrame 底层调用的都是 class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)

 1 /**
 2  * Returns a new RDD that has exactly `numPartitions` partitions. Differs from
 3  * [[RepartitionByExpression]] as this method is called directly by DataFrame's, because the user
 4  * asked for `coalesce` or `repartition`. [[RepartitionByExpression]] is used when the consumer
 5  * of the output requires some specific ordering or distribution of the data.
 6  */
 7 case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)
 8   extends UnaryNode {
 9   override def output: Seq[Attribute] = child.output
10 }

返回一个新的RDD，该RDD恰好具有“numpartition”分区。与[[RepartitionByExpression]]不同的是，这个方法直接由DataFrame调用，因为用户需要' coalesce '或' repartition '。

当输出的使用者需要特定的数据排序或分布时使用[[RepartitionByExpression]]。（源码里面说的是RDD，但是返回类型写的是DataFrame，感觉没差）。

而repartition(partitionExprs: Column*): DataFrame 和repartition(numPartitions: Int, partitionExprs: Column*): DataFrame 底层调用是

class RepartitionByExpression(partitionExpressions:Seq[Expression],child:LogicalPlan,numPartitions:Option[Int]=None) extends RedistributeData

 1 /**
 2  * This method repartitions data using [[Expression]]s into `numPartitions`, and receives
 3  * information about the number of partitions during execution. Used when a specific ordering or
 4  * distribution is expected by the consumer of the query result. Use [[Repartition]] for RDD-like
 5  * `coalesce` and `repartition`.
 6  * If `numPartitions` is not specified, the number of partitions will be the number set by
 7  * `spark.sql.shuffle.partitions`.
 8  */
 9 case class RepartitionByExpression(
10     partitionExpressions: Seq[Expression],
11     child: LogicalPlan,
12     numPartitions: Option[Int] = None) extends RedistributeData {
13   numPartitions match {
14     case Some(n) => require(n > 0, "numPartitions must be greater than 0.")
15     case None => // Ok
16   }
17 }

该方法使用[[Expression]]将数据重新划分为 'numpartition'，并在执行期间接收关于分区数量的信息。当用户期望某个特定的排序或分布时使用。使用[[Repartition]]用于类rdd的 'coalesce' 和 'Repartition'。

如果没有指定 'numpartition'，那么分区的数量将由 "spark.sql.shuffle.partition" 设置。

使用示例

def repartition(numPartitions: Int): DataFrame

1 //    获取一个测试的DataFrame 里面包含一个user字段
2     val testDataFrame: DataFrame = readMysqlTable(sqlContext, "MYSQLTABLE", proPath)
3 //    获得10个分区的DataFrame
4     testDataFrame.repartition(10)

def repartition(partitionExprs: Column*): DataFrame

1 //    获取一个测试的DataFrame 里面包含一个user字段
2     val testDataFrame: DataFrame = readMysqlTable(sqlContext, "MYSQLTABLE", proPath)
3 //    根据 user 字段进行分区，分区数量由 spark.sql.shuffle.partition 决定
4     testDataFrame.repartition($"user")

def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame

1 //    获取一个测试的DataFrame 里面包含一个user字段
2     val testDataFrame: DataFrame = readMysqlTable(sqlContext, "MYSQLTABLE", proPath)
3 //    根据 user 字段进行分区，将获得10个分区的DataFrame，此方法有时候在join的时候可以极大的提高效率，但是得注意出现数据倾斜的问题
4     testDataFrame.repartition(10,$"user")

coalesce(numPartitions: Int): DataFrame

1 val testDataFrame1: DataFrame = readMysqlTable(sqlContext, "MYSQLTABLE", proPath)
2     val testDataFrame2=testDataFrame1.repartition(10)
3 //    不会触发shuffle
4     testDataFrame2.coalesce(5)
5 //    触发shuffle 返回一个100分区的DataFrame
6     testDataFrame2.coalesce(100)

至于分区的数据设定，得根据自己的实际情况来，多了浪费少了负优化。

现在的只是初步探讨，具体的底层代码实现，后续去研究一下。

此文为本人工作学习整理笔记，转载请注明出处！！！！！！

相关阅读:
【1801視聴説2宿題】中国のリサイクル事情やごみの分別事情に対する意見
 【1701日本語新聞編集】第2回3月6日
 【1701新聞編集宿題】興味のあるネットニュース
 【1801日語写作】第2回：3月5日
 【1801日語听解4】第2回：3月3日
 【1801日本語新聞選読】第2回：3月3日
 不解压查看tar.gz包内文件
 设计模式——适配器模式
 ubuntu安装jre
设计模式——抽象工厂模式
原文地址：https://www.cnblogs.com/lillcol/p/9885080.html