spark通过合理设置spark.default.parallelism参数提高执行效率

spark中有partition的概念（和slice是同一个概念，在spark1.2中官网已经做出了说明），一般每个partition对应一个task。在我的测试过程中，如果没有设置spark.default.parallelism参数，spark计算出来的partition非常巨大，与我的cores非常不搭。我在两台机器上（8cores *2 +6g * 2）上，spark计算出来的partition达到2.8万个，也就是2.9万个tasks，每个task完成时间都是几毫秒或者零点几毫秒，执行起来非常缓慢。在我尝试设置了 spark.default.parallelism 后，任务数减少到10，执行一次计算过程从minute降到20second。

参数可以通过spark_home/conf/spark-default.conf配置文件设置。

eg.

 spark.master                       spark://master:7077
 spark.default.parallelism          10
 spark.driver.memory                2g
 spark.serializer                   org.apache.spark.serializer.KryoSerializer
 spark.sql.shuffle.partitions       50

Property Name	Default	Meaning
`spark.default.parallelism`	For distributed shuffle operations like `reduceByKey` and `join`, the largest number of partitions in a parent RDD. For operations like`parallelize` with no parent RDDs, it depends on the cluster manager: Local mode: number of cores on the local machine Mesos fine grained mode: 8 Others: total number of cores on all executor nodes or 2, whichever is larger	Default number of partitions in RDDs returned by transformations like `join`, `reduceByKey`, and `parallelize` when not set by user.

from:http://spark.apache.org/docs/latest/tuning.html

Level of Parallelism

Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile, etc), and for distributed “reduce” operations, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument (see the spark.PairRDDFunctions documentation), or set the config propertyspark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster.

相关阅读:
27. 移除元素
 axios调用webapi报错
 MySql重装以后，修改数据库路径，打开以前的数据库报Table 'XX库.XX表' doesn't exist错误的解决办法
 SqlServer2012,设置指定数据库对指定用户开放权限
 win10无法访问服务器上的共享文件夹怎么设置,提示:你不能访问此共享文件夹，因为你组织的安全策略阻止未经身份验证的来宾访问
 Vs2017的git的坑
 jira6.3.6创建问题不自动发邮件通知的问题
 在windows下面配置redis集群遇到的一些坑
 SqlServer2008 无法修改表,超时时间已到在操作完成之前超时解决方法
 小程序中也可以使用三元运算符且可嵌套使用
原文地址：https://www.cnblogs.com/wrencai/p/4231966.html