• RDD的checkpoint源码分析


    当调用RDD#checkpoint的,checkpoint的方法如下:

     1   /**
     2    * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
     3    * directory set with `SparkContext#setCheckpointDir` and all references to its parent
     4    * RDDs will be removed. This function must be called before any job has been
     5    * executed on this RDD. It is strongly recommended that this RDD is persisted in
     6    * memory, otherwise saving it on a file will require recomputation.
     7    */
     8   def checkpoint(): Unit = RDDCheckpointData.synchronized {
     9     // NOTE: we use a global lock here due to complexities downstream with ensuring
    10     // children RDD partitions point to the correct parent partitions. In the future
    11     // we should revisit this consideration.
    12     if (context.checkpointDir.isEmpty) {
    13       throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    14     } else if (checkpointData.isEmpty) {
    15      //最后生成一个新的ReliableRDDCheckpointData,checkpoint的逻辑主要体现在 ReliableRDDCheckpointData#doCheckpoint函数中。
    16       checkpointData = Some(new ReliableRDDCheckpointData(this))
    17     }
    18   }

    从注释上看,只是将此rdd标示为要checkpoint,文件保存在SparkContext#setCheckpointDir定义的目录,并且此rdd所有的父依赖将移除。
    此函数一定要在所有job运行之前被执行。强烈建议把这个RDD进行persisted,否则的话数据进将行重新计算。

     1    /**
     2    * Materialize this RDD and write its content to a reliable DFS.
     3    * This is called immediately after the first action invoked on this RDD has completed.
     4    */
     5   protected override def doCheckpoint(): CheckpointRDD[T] = {
     6   //核心代码,将文件写入到目录
     7     val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)
     8 
     9     // Optionally clean our checkpoint files if the reference is out of scope
    10     if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) {
    11       rdd.context.cleaner.foreach { cleaner =>
    12         cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)
    13       }
    14     }
    15 
    16     logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")
    17     newRDD
    18   }
     1      /**
     2    * Write RDD to checkpoint files and return a ReliableCheckpointRDD representing the RDD.
     3    */
     4   def writeRDDToCheckpointDirectory[T: ClassTag](
     5       originalRDD: RDD[T],
     6       checkpointDir: String,
     7       blockSize: Int = -1): ReliableCheckpointRDD[T] = {
     8 
     9     val sc = originalRDD.sparkContext
    10 
    11     // Create the output path for the checkpoint
    12     val checkpointDirPath = new Path(checkpointDir)
    13     val fs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration)
    14     if (!fs.mkdirs(checkpointDirPath)) {
    15       throw new SparkException(s"Failed to create checkpoint path $checkpointDirPath")
    16     }
    17 
    18     // Save to file, and reload it as an RDD
    19     val broadcastedConf = sc.broadcast(
    20       new SerializableConfiguration(sc.hadoopConfiguration))
    21     // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582)
    22     //核心代码
    23     sc.runJob(originalRDD,
    24       writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _)
    25 
    26     if (originalRDD.partitioner.nonEmpty) {
    27       writePartitionerToCheckpointDir(sc, originalRDD.partitioner.get, checkpointDirPath)
    28     }
    29 
    30     val newRDD = new ReliableCheckpointRDD[T](
    31       sc, checkpointDirPath.toString, originalRDD.partitioner)
    32     if (newRDD.partitions.length != originalRDD.partitions.length) {
    33       throw new SparkException(
    34         s"Checkpoint RDD $newRDD(${newRDD.partitions.length}) has different " +
    35           s"number of partitions from original RDD $originalRDD(${originalRDD.partitions.length})")
    36     }
    37     newRDD
    38   }

    第23行代码,用到了柯里化的小技巧,我们把方法稍作修改

          // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582)
        sc.runJob(originalRDD,
          writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _)
    
          // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582)
        val func : (TaskContext, Iterator[T]) => Unit = writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf)
        sc.runJob(originalRDD,func)

    此处新提交一个Job,也是对RDD进行计算,那么如果原有的RDD对结果进行了cache的话,那么是不是减少了很多的计算呢,这就是为啥checkpoint的时候强烈推荐进行cache的缘故。

    写文件的逻辑

      /**
       * Write a RDD partition's data to a checkpoint file.
       */
      def writePartitionToCheckpointFile[T: ClassTag](
          path: String,
          broadcastedConf: Broadcast[SerializableConfiguration],
          blockSize: Int = -1)(ctx: TaskContext, iterator: Iterator[T]) {
        val env = SparkEnv.get
        val outputDir = new Path(path)
        val fs = outputDir.getFileSystem(broadcastedConf.value.value)
    
        val finalOutputName = ReliableCheckpointRDD.checkpointFileName(ctx.partitionId())
        val finalOutputPath = new Path(outputDir, finalOutputName)
        val tempOutputPath =
          new Path(outputDir, s".$finalOutputName-attempt-${ctx.attemptNumber()}")
    
        if (fs.exists(tempOutputPath)) {
          throw new IOException(s"Checkpoint failed: temporary path $tempOutputPath already exists")
        }
        val bufferSize = env.conf.getInt("spark.buffer.size", 65536)
    
        val fileOutputStream = if (blockSize < 0) {
          fs.create(tempOutputPath, false, bufferSize)
        } else {
          // This is mainly for testing purpose
          fs.create(tempOutputPath, false, bufferSize, fs.getDefaultReplication, blockSize)
        }
        val serializer = env.serializer.newInstance()
        val serializeStream = serializer.serializeStream(fileOutputStream)
        Utils.tryWithSafeFinally {
          serializeStream.writeAll(iterator)
        } {
          serializeStream.close()
        }
    
        if (!fs.rename(tempOutputPath, finalOutputPath)) {
          if (!fs.exists(finalOutputPath)) {
            logInfo(s"Deleting tempOutputPath $tempOutputPath")
            fs.delete(tempOutputPath, false)
            throw new IOException("Checkpoint failed: failed to save output of task: " +
              s"${ctx.attemptNumber()} and final output path does not exist: $finalOutputPath")
          } else {
            // Some other copy of this task must've finished before us and renamed it
            logInfo(s"Final output path $finalOutputPath already exists; not overwriting it")
            if (!fs.delete(tempOutputPath, false)) {
              logWarning(s"Error deleting ${tempOutputPath}")
            }
          }
        }
      }

      核心代码

     1       val serializer = env.serializer.newInstance()
     2     val serializeStream = serializer.serializeStream(fileOutputStream)
     3     Utils.tryWithSafeFinally {
     4       serializeStream.writeAll(iterator)
     5     } {
     6       serializeStream.close()
     7     }
     8 
     9     //把iterator返回的结果写到指定目录中。文件命为
    10     ReliableCheckpointRDD.checkpointFileName(ctx.partitionId())

    我们看下定义

    1       /**
    2    * Return the checkpoint file name for the given partition.
    3    */
    4   private def checkpointFileName(partitionIndex: Int): String = {
    5     "part-%05d".format(partitionIndex)
    6   }

    这个再RDD恢复的时候会用到这个文件名。在下一篇博客中我将写如何恢复。

    以上是我们讲述了checkpoint的流程,那么checkpoint是如何启动的呢?
    答案在SparkContext#runJob方法

      /**
       * Run a function on a given set of partitions in an RDD and pass the results to the given
       * handler function. This is the main entry point for all actions in Spark.
       */
      def runJob[T, U: ClassTag](
          rdd: RDD[T],
          func: (TaskContext, Iterator[T]) => U,
          partitions: Seq[Int],
          resultHandler: (Int, U) => Unit): Unit = {
        if (stopped.get()) {
          throw new IllegalStateException("SparkContext has been shutdown")
        }
        val callSite = getCallSite
        val cleanedFunc = clean(func)
        logInfo("Starting job: " + callSite.shortForm)
        if (conf.getBoolean("spark.logLineage", false)) {
          logInfo("RDD's recursive dependencies:
    " + rdd.toDebugString)
        }
        dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
        progressBar.foreach(_.finishAll())
        rdd.doCheckpoint()
      }

      我们看最后三行,先提交真正我们需要计算的job,然后才是 rdd.doCheckpoint()

        /**
       * Performs the checkpointing of this RDD by saving this. It is called after a job using this RDD
       * has completed (therefore the RDD has been materialized and potentially stored in memory).
       * doCheckpoint() is called recursively on the parent RDDs.
       */
      private[spark] def doCheckpoint(): Unit = {
        RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) {
          if (!doCheckpointCalled) {
            doCheckpointCalled = true
            if (checkpointData.isDefined) {
              checkpointData.get.checkpoint()
            } else {
              dependencies.foreach(_.rdd.doCheckpoint())
            }
          }
        }

    从注释上看,此函数是在使用此RDD的的job执行结束后执行,因此结果可能会保存在内存中,这就是提到过的最好对RDD进行cache的缘故。重要的事要说三遍。

    最后调用的方法就是 checkpointData.get.checkpoint()

    到此为止RDD如何进行checkpoint算是分析完成了。

  • 相关阅读:
    《陶哲轩实分析》习题10.4.3
    陶哲轩实分析定理10.1.3:导数运算的积法则和商法则
    《数学分析新讲》_张筑生,12.5节:隐函数定理(1)
    《数学分析新讲》_张筑生,12.5节:隐函数定理(1)
    《陶哲轩实分析》定理10.1.15:导数的链法则
    我的博客园的CSS和html设置
    陶哲轩实分析定理10.1.3:导数运算的积法则和商法则
    关于Eclipse中一个错误排查总结
    RMI, Dynamic Proxies, and the Evolution of Deployment
    Java垃圾回收机制浅析
  • 原文地址:https://www.cnblogs.com/luckuan/p/5248051.html
Copyright © 2020-2023  润新知