• Spark源码执行逻辑分析【基于案例SparkPi】


    一.案例SparkPi代码

     1 package scala
     2 
     3 import org.apache.spark.sql.SparkSession
     4 
     5 import scala.math.random
     6 
     7 /** Computes an approximation to pi */
     8 object SparkPi {
     9   def main(args: Array[String]) {
    10     val spark = SparkSession
    11       .builder
    12       .appName("Spark Pi")
    13       .master("local[2]")
    14       .getOrCreate()
    15     val slices = if (args.length > 0) args(0).toInt else 2
    16     val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
    17     val count = spark.sparkContext.parallelize(1 until n, slices).map { i =>
    18       val x = random * 2 - 1
    19       val y = random * 2 - 1
    20       if (x*x + y*y <= 1) 1 else 0
    21     }.reduce(_ + _)
    22     println(s"Pi is roughly ${4.0 * count / (n - 1)}")
    23     spark.stop()
    24   }
    25 }

    二.执行结果

      

     三.日志分析

      1.在使用提交命令./run-example SparkPi 10执行案例SparkPi时,根据警告信息可知,因为是local【本地模式】,Spark会先检查本地IP。

      

       2.其次,Spark会检测是否配置本地Hadoop及相关log4j等配置,配置会优先加载用户指定的Hadoop,无配置则使用自带的默认Hadoop.

      

       3.基本信息检查完之后,开始启动Spark任务,向服务器注册该任务,启动可视化组件acls,开启服务sparkDriver

      

       4.Spark开始注册任务调度器和资源管理器

      

       5.创建本地临时目录,根据缓存模式缓存数据

      

       6.SparkUI开启成功

      

       7.开启Spark自带的netty web服务器

      

       8.执行计算

      

       9.执行成功,关闭SparkUI、任务调度器、资源管理器

      

    四.源码分析

      1.创建SparkSession程序执行入口

        val spark = SparkSession.builder.appName("Spark Pi").master("local[2]").getOrCreate()

        该程序首先调用对象SparkSession,指定应用的名称,运行方式【集群or单机】以及一些类如使用内存大小,核数等配置。在这个过程中会检测IP【仅限单机模式】和Hadoop配置。对应日志中的1、2、3。

        源码如下:   

      1 object SparkSession extends Logging {
      2 
      3   /**
      4    * Builder for [[SparkSession]].
      5    */
      6   @InterfaceStability.Stable
      7   class Builder extends Logging {
      8 
      9     private[this] val options = new scala.collection.mutable.HashMap[String, String]
     10 
     11     private[this] val extensions = new SparkSessionExtensions
     12 
     13     private[this] var userSuppliedContext: Option[SparkContext] = None
     14 
     15     private[spark] def sparkContext(sparkContext: SparkContext): Builder = synchronized {
     16       userSuppliedContext = Option(sparkContext)
     17       this
     18     }
     19 
     20     /**
     21      * Sets a name for the application, which will be shown in the Spark web UI.
     22      * If no application name is set, a randomly generated name will be used.
     23      *
     24      * @since 2.0.0
     25      */
     26     def appName(name: String): Builder = config("spark.app.name", name)
     27 
     28     /**
     29      * Sets a config option. Options set using this method are automatically propagated to
     30      * both `SparkConf` and SparkSession's own configuration.
     31      *
     32      * @since 2.0.0
     33      */
     34     def config(key: String, value: String): Builder = synchronized {
     35       options += key -> value
     36       this
     37     }
     38 
     39     /**
     40      * Sets the Spark master URL to connect to, such as "local" to run locally, "local[4]" to
     41      * run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
     42      *
     43      * @since 2.0.0
     44      */
     45     def master(master: String): Builder = config("spark.master", master)
     46 
     47     /**
     48      * Enables Hive support, including connectivity to a persistent Hive metastore, support for
     49      * Hive serdes, and Hive user-defined functions.
     50      *
     51      * @since 2.0.0
     52      */
     53     def enableHiveSupport(): Builder = synchronized {
     54       if (hiveClassesArePresent) {
     55         config(CATALOG_IMPLEMENTATION.key, "hive")
     56       } else {
     57         throw new IllegalArgumentException(
     58           "Unable to instantiate SparkSession with Hive support because " +
     59             "Hive classes are not found.")
     60       }
     61     }
     62 
     63     /**
     64      * Gets an existing [[SparkSession]] or, if there is no existing one, creates a new
     65      * one based on the options set in this builder.
     66      *
     67      * This method first checks whether there is a valid thread-local SparkSession,
     68      * and if yes, return that one. It then checks whether there is a valid global
     69      * default SparkSession, and if yes, return that one. If no valid global default
     70      * SparkSession exists, the method creates a new SparkSession and assigns the
     71      * newly created SparkSession as the global default.
     72      *
     73      * In case an existing SparkSession is returned, the config options specified in
     74      * this builder will be applied to the existing SparkSession.
     75      *
     76      * @since 2.0.0
     77      */
     78     def getOrCreate(): SparkSession = synchronized {
     79       assertOnDriver()
     80       // Get the session from current thread's active session.
     81       var session = activeThreadSession.get()
     82       if ((session ne null) && !session.sparkContext.isStopped) {
     83         options.foreach { case (k, v) => session.sessionState.conf.setConfString(k, v) }
     84         if (options.nonEmpty) {
     85           logWarning("Using an existing SparkSession; some configuration may not take effect.")
     86         }
     87         return session
     88       }
     89 
     90       // Global synchronization so we will only set the default session once.
     91       SparkSession.synchronized {
     92         // If the current thread does not have an active session, get it from the global session.
     93         session = defaultSession.get()
     94         if ((session ne null) && !session.sparkContext.isStopped) {
     95           options.foreach { case (k, v) => session.sessionState.conf.setConfString(k, v) }
     96           if (options.nonEmpty) {
     97             logWarning("Using an existing SparkSession; some configuration may not take effect.")
     98           }
     99           return session
    100         }
    101 
    102         // No active nor global default session. Create a new one.
    103         val sparkContext = userSuppliedContext.getOrElse {
    104           val sparkConf = new SparkConf()
    105           options.foreach { case (k, v) => sparkConf.set(k, v) }
    106 
    107           // set a random app name if not given.
    108           if (!sparkConf.contains("spark.app.name")) {
    109             sparkConf.setAppName(java.util.UUID.randomUUID().toString)
    110           }
    111 
    112           SparkContext.getOrCreate(sparkConf)
    113           // Do not update `SparkConf` for existing `SparkContext`, as it's shared by all sessions.
    114         }
    115 
    116         // Initialize extensions if the user has defined a configurator class.
    117         val extensionConfOption = sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS)
    118         if (extensionConfOption.isDefined) {
    119           val extensionConfClassName = extensionConfOption.get
    120           try {
    121             val extensionConfClass = Utils.classForName(extensionConfClassName)
    122             val extensionConf = extensionConfClass.newInstance()
    123               .asInstanceOf[SparkSessionExtensions => Unit]
    124             extensionConf(extensions)
    125           } catch {
    126             // Ignore the error if we cannot find the class or when the class has the wrong type.
    127             case e @ (_: ClassCastException |
    128                       _: ClassNotFoundException |
    129                       _: NoClassDefFoundError) =>
    130               logWarning(s"Cannot use $extensionConfClassName to configure session extensions.", e)
    131           }
    132         }
    133 
    134         session = new SparkSession(sparkContext, None, None, extensions)
    135         options.foreach { case (k, v) => session.initialSessionOptions.put(k, v) }
    136         setDefaultSession(session)
    137         setActiveSession(session)
    138 
    139         // Register a successfully instantiated context to the singleton. This should be at the
    140         // end of the class definition so that the singleton is updated only if there is no
    141         // exception in the construction of the instance.
    142         sparkContext.addSparkListener(new SparkListener {
    143           override def onApplicationEnd(applicationEnd: SparkListenerApplicationEnd): Unit = {
    144             defaultSession.set(null)
    145           }
    146         })
    147       }
    148 
    149       return session
    150     }
    151   }
    152 }

      2.程序计算逻辑执行

      val count = spark.sparkContext.parallelize(1 until n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y <= 1) 1 else 0
      }.reduce(_ + _)
     首先,程序调用SparkContext对象的parallelize函数,把数据转换为RDD并执行计算。对应日志中的步骤8。
     源码如下:
     1  /** Distribute a local Scala collection to form an RDD.
     2    *
     3    * @note Parallelize acts lazily. If `seq` is a mutable collection and is altered after the call
     4    * to parallelize and before the first action on the RDD, the resultant RDD will reflect the
     5    * modified collection. Pass a copy of the argument to avoid this.
     6    * @note avoid using `parallelize(Seq())` to create an empty `RDD`. Consider `emptyRDD` for an
     7    * RDD with no partitions, or `parallelize(Seq[T]())` for an RDD of `T` with empty partitions.
     8    * @param seq Scala collection to distribute
     9    * @param numSlices number of partitions to divide the collection into
    10    * @return RDD representing distributed collection
    11    */
    12   def parallelize[T: ClassTag](
    13       seq: Seq[T],
    14       numSlices: Int = defaultParallelism): RDD[T] = withScope {
    15     assertNotStopped()
    16     new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
    17   }

      其中,比较重要的调用是withScope,该函数可以实现执行传入的函数体,以使在该主体中创建的所有RDD具有相同的作用域。

      源码如下:

     1  /**
     2    * Execute the given body such that all RDDs created in this body will have the same scope.
     3    * The name of the scope will be the first method name in the stack trace that is not the
     4    * same as this method's.
     5    *
     6    * Note: Return statements are NOT allowed in body.
     7    */
     8   private[spark] def withScope[T](
     9       sc: SparkContext,
    10       allowNesting: Boolean = false)(body: => T): T = {
    11     val ourMethodName = "withScope"
    12     val callerMethodName = Thread.currentThread.getStackTrace()
    13       .dropWhile(_.getMethodName != ourMethodName)
    14       .find(_.getMethodName != ourMethodName)
    15       .map(_.getMethodName)
    16       .getOrElse {
    17         // Log a warning just in case, but this should almost certainly never happen
    18         logWarning("No valid method name for this RDD operation scope!")
    19         "N/A"
    20       }
    21     withScope[T](sc, callerMethodName, allowNesting, ignoreParent = false)(body)
    22   }
    23 
    24   /**
    25    * Execute the given body such that all RDDs created in this body will have the same scope.
    26    *
    27    * If nesting is allowed, any subsequent calls to this method in the given body will instantiate
    28    * child scopes that are nested within our scope. Otherwise, these calls will take no effect.
    29    *
    30    * Additionally, the caller of this method may optionally ignore the configurations and scopes
    31    * set by the higher level caller. In this case, this method will ignore the parent caller's
    32    * intention to disallow nesting, and the new scope instantiated will not have a parent. This
    33    * is useful for scoping physical operations in Spark SQL, for instance.
    34    *
    35    * Note: Return statements are NOT allowed in body.
    36    */
    37   private[spark] def withScope[T](
    38       sc: SparkContext,
    39       name: String,
    40       allowNesting: Boolean,
    41       ignoreParent: Boolean)(body: => T): T = {
    42     // Save the old scope to restore it later
    43     val scopeKey = SparkContext.RDD_SCOPE_KEY
    44     val noOverrideKey = SparkContext.RDD_SCOPE_NO_OVERRIDE_KEY
    45     val oldScopeJson = sc.getLocalProperty(scopeKey)
    46     val oldScope = Option(oldScopeJson).map(RDDOperationScope.fromJson)
    47     val oldNoOverride = sc.getLocalProperty(noOverrideKey)
    48     try {
    49       if (ignoreParent) {
    50         // Ignore all parent settings and scopes and start afresh with our own root scope
    51         sc.setLocalProperty(scopeKey, new RDDOperationScope(name).toJson)
    52       } else if (sc.getLocalProperty(noOverrideKey) == null) {
    53         // Otherwise, set the scope only if the higher level caller allows us to do so
    54         sc.setLocalProperty(scopeKey, new RDDOperationScope(name, oldScope).toJson)
    55       }
    56       // Optionally disallow the child body to override our scope
    57       if (!allowNesting) {
    58         sc.setLocalProperty(noOverrideKey, "true")
    59       }
    60       body
    61     } finally {
    62       // Remember to restore any state that was modified before exiting
    63       sc.setLocalProperty(scopeKey, oldScopeJson)
    64       sc.setLocalProperty(noOverrideKey, oldNoOverride)
    65     }
    66   }
  • 相关阅读:
    JavaScript二进制数据序列化和反序列化
    三维变换矩阵推导笔记
    如何制作一款“有毒”的游戏
    如何使用visual studio将你的程序打包成安装包
    游戏设计模式系列(三)—— 策划变心太快?也许可以使用组合
    使用LayaAir解析xml文件
    游戏设计模式系列(二)—— 适时使用观察者模式,解耦你的代码
    游戏设计模式系列(一)—— 单线逻辑&&数据驱动,搞定最容易卡死的结算界面
    vs2010 win32程序中 sqlserver 2008 express的简单使用 (C++)
    UVALive 6529
  • 原文地址:https://www.cnblogs.com/yszd/p/11835620.html
Copyright © 2020-2023  润新知