SparkContext.setCheckpointDir()

class SparkContext extends Logging with ExecutorAllocationClient

Main entry point for Spark functionality.

spark功能函数的主入口。

def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]

Distribute a local Scala collection to form an RDD.

将一个本地Scala collection 格式化为一个RDD。

Note: Parallelize acts lazily. If seq is a mutable collection and is altered after the call to parallelize and before the first action on the RDD, the resultant RDD will reflect the modified collection. Pass a copy of the argument to avoid this.

注意

Parallelize是懒动作函数.如果参数seq是一个易变的collection，并且在调用parallelize之后但又在一个对RDD的action之前的期间会被修改，那么所得的RDD将会反应出被修改的collection，导致结果可能会不可预料。所以，向本函数的参数seq传递一个副本。

checkpoint(self)

Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext.setCheckpointDir() and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.

checkpoint(self)

标记当前RDD的校验点。它会被保存为在由SparkContext.setCheckpointDir()方法设置的checkpoint目录下的文件集中的一个文件。简而言之就是当前RDD的校验点被保存为了一个文件，而这个文件在一个目录下，这个目录下有不少的这样的文件，这个目录是由SparkContext.setCheckpointDir()方法设置的。并且所有从父RDD中引用的文件都将被删除。这个函数必须在所有的job前被调用，运行在这个RDD上。它被强烈的建议保存在内存中，否则，也就是从内存转出存入文件，则需要重新计算它。

scala:

def setCheckpointDir(directory: String): Unit

Set the directory under which RDDs are going to be checkpointed. The directory must be a HDFS path if running on a cluster.

设置一个目录，用来让RDD们可以在其下被checkpoint。如果是跑在一个集群上，这个目录必须是一个HDFS路径。

相关阅读:
poj1006 Biorhythms ——中国剩余定理入门题
hoj12614 Dictionary ——拓扑排序找环&&比赛残留题
2013年4月23日雨
zoj1586 QS Network ——最小生成树入门题_Prim算法
hoj12616 Median Tree ——最小生成树入门题&&比赛残留题_Kruscal算法
tset3
testhtml
Oracle SQLID 与 Hash_value 算法及转换
Linux 脚本中生成日志 set x
test wrod

原文地址：https://www.cnblogs.com/suanec/p/4769768.html