Spark如何删除无效rdd checkpoint

spark可以使用checkpoint来作为检查点，将rdd的数据写入hdfs文件，也可以利用本地缓存子系统。
当我们使用checkpoint将rdd保存到hdfs文件时，如果任务的临时文件长时间不删除，长此以往，hdfs会出现很多没有用的文件，spark也考虑到了这一点，因此，用了一些取巧的方式来解决这种问题。

spark config:

spark.cleaner.referenceTracking.cleanCheckpoints = 默认false

也就是说默认情况下，保存的文件一直都会放在dfs中，除非人工删除
下述内容均建立在值为true的情况下

设置检查点路径

spark.sparkContext().setCheckpointDir("hdfs://nameservice1/xx/xx");

存放到hdfs文件系统的好处是自带高容错性、可用性。
那么，所有运行的任务都写这个路径会不会出现覆盖的情况呢？答案是不会

  /**
   * Set the directory under which RDDs are going to be checkpointed.
   * @param directory path to the directory where checkpoint files will be stored
   * (must be HDFS path if running in cluster)
   */
  def setCheckpointDir(directory: String) {

    // If we are running on a cluster, log a warning if the directory is local.
    // Otherwise, the driver may attempt to reconstruct the checkpointed RDD from
    // its own local file system, which is incorrect because the checkpoint files
    // are actually on the executor machines.
    if (!isLocal && Utils.nonLocalPaths(directory).isEmpty) {
      logWarning("Spark is not running in local mode, therefore the checkpoint directory " +
        s"must not be on the local filesystem. Directory '$directory' " +
        "appears to be on the local filesystem.")
    }

    checkpointDir = Option(directory).map { dir =>
	  //利用uuid生成了一个子目录，存放的rdd文件将放到子目录中	
      val path = new Path(dir, UUID.randomUUID().toString)
      val fs = path.getFileSystem(hadoopConfiguration)
      fs.mkdirs(path)
      fs.getFileStatus(path).getPath.toString
    }
  }

利用uuid的唯一性，使不同的进程间的checkpoint互不干扰，后续有checkpoint创建的请求时，将会在该目录下创建文件来保存rdd的内容

在生成checkpoint的ReliableRDDCheckpointData 方法中，

保存检查点

  /**
   * Materialize this RDD and write its content to a reliable DFS.
   * This is called immediately after the first action invoked on this RDD has completed.
   */
  protected override def doCheckpoint(): CheckpointRDD[T] = {
    //写入到可靠的文件中
    val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)

    // Optionally clean our checkpoint files if the reference is out of scope
	//默认false，才会注册清理器
    if (rdd.conf.get(CLEANER_REFERENCE_TRACKING_CLEAN_CHECKPOINTS)) {
      rdd.context.cleaner.foreach { cleaner =>
		//注册清理事件
        cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)
      }
    }

    logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")
    newRDD
  }

注册事件

注册清理事件的意义是当rdd对象无其他引用依赖时，由清理线程异步清理对应的checkpoint文件

  /** Register a RDDCheckpointData for cleanup when it is garbage collected. */
  def registerRDDCheckpointDataForCleanup[T](rdd: RDD[_], parentId: Int): Unit = {
    registerForCleanup(rdd, CleanCheckpoint(parentId))
  }

  /** Register an object for cleanup. */
  private def registerForCleanup(objectForCleanup: AnyRef, task: CleanupTask): Unit = {
    referenceBuffer.add(new CleanupTaskWeakReference(task, objectForCleanup, referenceQueue))
  }

referenceBuffer的作用是持有CleanupTaskWeakReference对象的引用，防止CleanupTaskWeakReference被提前回收，导致提前清理。

弱引用对象

CleanupTaskWeakReference继承自WeakReference，将referent(也就是rdd)，绑定到referenceQueue上，如果gc回收时，发现referent除了referenceQueue这个弱引用外，已经没有其他对象引用，就会将CleanupTaskWeakReference对应放入referenceQueue中

//引用队列，当garbage collector发现对应的可达性改变被发现时，会将引用对象推入队列中
//这是通过Reference.enqueue方法实现的 public boolean enqueue() {return this.queue.enqueue(this);}
private val referenceQueue = new ReferenceQueue[AnyRef]

/**
 * A WeakReference associated with a CleanupTask.
 *
 * When the referent object becomes only weakly reachable, the corresponding
 * CleanupTaskWeakReference is automatically added to the given reference queue.
 */
private class CleanupTaskWeakReference(
    val task: CleanupTask,
    referent: AnyRef,
    referenceQueue: ReferenceQueue[AnyRef])
  extends WeakReference(referent, referenceQueue)

回收线程

再来细致的讲回收线程
在SparkContext初始化时，会启动cleaner,代码较多，直接依次

_cleaner =
  if (_conf.get(CLEANER_REFERENCE_TRACKING)) {
    Some(new ContextCleaner(this))
  } else {
    None
  }
_cleaner.foreach(_.start())

  /** Start the cleaner. */
  def start(): Unit = {
    cleaningThread.setDaemon(true) //守护进程
    cleaningThread.setName("Spark Context Cleaner")
    cleaningThread.start()
	//这里有点银弹的意思，定时执行gc，默认半小时一次，主要是应对长时间任务问题
    periodicGCService.scheduleAtFixedRate(() => System.gc(),
      periodicGCInterval, periodicGCInterval, TimeUnit.SECONDS)
  }

private val cleaningThread = new Thread() { override def run() { keepCleaning() }}

  /** Keep cleaning RDD, shuffle, and broadcast state. */
  private def keepCleaning(): Unit = Utils.tryOrStopSparkContext(sc) {
    while (!stopped) {
      try {
		//从referenceQueue中取可以回收的弱引用对象，弱引用对象返回表示登记的rdd已经可回收了
        val reference = Option(referenceQueue.remove(ContextCleaner.REF_QUEUE_POLL_TIMEOUT))
          .map(_.asInstanceOf[CleanupTaskWeakReference])
        // Synchronize here to avoid being interrupted on stop()
        synchronized {
          reference.foreach { ref =>
            logDebug("Got cleaning task " + ref.task)
			//清除强引用
            referenceBuffer.remove(ref)
            ref.task match {
              case CleanRDD(rddId) =>
                doCleanupRDD(rddId, blocking = blockOnCleanupTasks)
              case CleanShuffle(shuffleId) =>
                doCleanupShuffle(shuffleId, blocking = blockOnShuffleCleanupTasks)
              case CleanBroadcast(broadcastId) =>
                doCleanupBroadcast(broadcastId, blocking = blockOnCleanupTasks)
              case CleanAccum(accId) =>
                doCleanupAccum(accId, blocking = blockOnCleanupTasks)
              case CleanCheckpoint(rddId) =>
                doCleanCheckpoint(rddId) //如果任务是cleancheckpoint任务
            }
          }
        }
      } catch {
        case ie: InterruptedException if stopped => // ignore
        case e: Exception => logError("Error in cleaning thread", e)
      }
    }
  }

  /**
   * Clean up checkpoint files written to a reliable storage.
   * Locally checkpointed files are cleaned up separately through RDD cleanups.
   */
  def doCleanCheckpoint(rddId: Int): Unit = {
    try {
      logDebug("Cleaning rdd checkpoint data " + rddId)
	  //删除checkpoint操作被触发
      ReliableRDDCheckpointData.cleanCheckpoint(sc, rddId)
      listeners.asScala.foreach(_.checkpointCleaned(rddId))
      logInfo("Cleaned rdd checkpoint data " + rddId)
    }
    catch {
      case e: Exception => logError("Error cleaning rdd checkpoint data " + rddId, e)
    }
  }

特殊操作的意思

为什么要定时执行System.gc()去触发full gc？

由于删除rdd checkpoint的方法利用了WeakReference,它是一个严重依赖gc的功能，如果没有gc，就不会发现对象可回收，也就不会触发回收逻辑。
极端情况可能出现长时间只有yong gc,而老年区的对象长时间无法回收，而对象早已无其他引用，利用System.gc()来尝试执行full gc,达到回收老年代的目的

总结

默认情况下，保存的文件一直都会放在dfs中，除非人工删除
及时开启spark.cleaner.referenceTracking.cleanCheckpoints,也不能意味着一定能回收，因为垃圾回收并非一定会在合适的时间执行，有可能最终也没有触发弱引用清理任务逻辑

相关阅读:
大话数据结构笔记
 zsh安装教程
 Matlab安装教程
 7-16 插入排序还是归并排序（25 分)
7-14 插入排序还是堆排序（25 分)
7-14 二叉搜索树的最近公共祖先（30 分)
7-11 笛卡尔树（25 分)
中缀转换为后缀和前缀
 7-15 水果忍者（30 分)
兔子的区间密码（思维）
原文地址：https://www.cnblogs.com/windliu/p/10983334.html