• RDD中的cache() persist() checkpoint()


     cache只有一个默认的缓存级别MEMORY_ONLY ,而persist可以根据StorageLevel设置其它的缓存级别。 cache以及persist都不是action。

    被重复使用的(但是)不能太大的RDD需要cache

    cache 只使用 memory,checkpoint写磁盘

    rdd.persist(StorageLevel.DISK_ONLY) 与 checkpoint 的区别:

    persist将 RDD 的 partition 持久化到磁盘,但该 partition 由 blockManager 管理。一旦 driver program 执行结束,也就是 executor 所在进程 CoarseGrainedExecutorBackend stop,blockManager 也会 stop,被 cache 到磁盘上的 RDD 也会被清空,而 checkpoint 将 RDD 持久化到 HDFS 或本地文件夹,如果不被手动 remove 掉,是一直存在的,也就是说可以被下一个 driver program 使用,而 cached RDD 不能被其他 dirver program 使用。

    使用checkponint首先需要设置setCheckpointDir

    scala> bb.checkpoint
    org.apache.spark.SparkException: Checkpoint directory has not been set in the SparkContext
      at org.apache.spark.rdd.RDD.checkpoint(RDD.scala:1544)
      at org.apache.spark.sql.Dataset.checkpoint(Dataset.scala:517)
      at org.apache.spark.sql.Dataset.checkpoint(Dataset.scala:502)
      ... 48 elided

    创建文件夹(可不创建)

    [root@host ~]# hdfs dfs -mkdir /tmp/checkpoint

    scala> sc.setCheckpointDir("/tmp/checkpoint")

    [root@host ~]# hdfs dfs -ls -R /tmp/checkpoint 
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    drwxr-xr-x   - root supergroup          0 2018-07-31 17:33 /tmp/checkpoint/68309a1b-6e2d-4d03-8282-60abbbc8845b

    scala> bb.checkpoint
    res25: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [sex: int, count(1): bigint]

    Dataset使用checkpoint不是lazy的,RDD使用checkpoint是lazy的

    [root@host ~]# hdfs dfs -ls -R /tmp/checkpoint
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    drwxr-xr-x   - root supergroup          0 2018-07-31 17:35 /tmp/checkpoint/68309a1b-6e2d-4d03-8282-60abbbc8845b
    drwxr-xr-x   - root supergroup          0 2018-07-31 17:35 /tmp/checkpoint/68309a1b-6e2d-4d03-8282-60abbbc8845b/rdd-438
    -rw-r--r--   1 root supergroup        163 2018-07-31 17:35 /tmp/checkpoint/68309a1b-6e2d-4d03-8282-60abbbc8845b/rdd-438/part-00000
    -rw-r--r--   1 root supergroup        163 2018-07-31 17:35 /tmp/checkpoint/68309a1b-6e2d-4d03-8282-60abbbc8845b/rdd-438/part-00001
    -rw-r--r--   1 root supergroup        163 2018-07-31 17:35 /tmp/checkpoint/68309a1b-6e2d-4d03-8282-60abbbc8845b/rdd-438/part-00002
    -rw-r--r--   1 root supergroup          4 2018-07-31 17:35 /tmp/checkpoint/68309a1b-6e2d-4d03-8282-60abbbc8845b/rdd-438/part-00003

     scala> bb.show
    +----+--------+
    | sex|count(1)|
    +----+--------+
    |null|      51|
    |   0|      19|
    |   1|      32|
    +----+--------+

    -------------------------------------------------------------------

    scala>  val weblogrdd=sc.textFile("hdfs://localhost:9000/spark/log/web.log")
    weblogrdd: org.apache.spark.rdd.RDD[String] = hdfs://localhost:9000/spark/log/web.log MapPartitionsRDD[1] at textFile at <console>:24

    scala> sc.setCheckpointDir("/tmp/checkpoint4")

    [root@host ~]# hdfs dfs -ls -R /tmp/checkpoint4
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    drwxr-xr-x   - root supergroup          0 2018-08-01 13:34 /tmp/checkpoint4/c6f50081-6c31-4a5c-a1d8-afe19dcef98c

    scala> weblogrdd.checkpoint

    [root@host ~]# hdfs dfs -ls -R /tmp/checkpoint4
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    drwxr-xr-x   - root supergroup          0 2018-08-01 13:34 /tmp/checkpoint4/c6f50081-6c31-4a5c-a1d8-afe19dcef98c

    scala> weblogrdd.count
    res2: Long = 26

    [root@host ~]# hdfs dfs -ls -R /tmp/checkpoint4
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    drwxr-xr-x   - root supergroup          0 2018-08-01 13:35 /tmp/checkpoint4/c6f50081-6c31-4a5c-a1d8-afe19dcef98c
    drwxr-xr-x   - root supergroup          0 2018-08-01 13:35 /tmp/checkpoint4/c6f50081-6c31-4a5c-a1d8-afe19dcef98c/rdd-1
    -rw-r--r--   1 root supergroup        464 2018-08-01 13:35 /tmp/checkpoint4/c6f50081-6c31-4a5c-a1d8-afe19dcef98c/rdd-1/part-00000
    -rw-r--r--   1 root supergroup        457 2018-08-01 13:35 /tmp/checkpoint4/c6f50081-6c31-4a5c-a1d8-afe19dcef98c/rdd-1/part-00001

  • 相关阅读:
    C语言实现的单链表
    单链表创建链表出现问题
    Windows10更新后出现右击文件卡死
    顺序表的错误
    XML 字符串 转 JSON
    xml to json
    excel xlsx-js 细节链接
    关于javasciprt导出excel 一
    关于javasciprt导出excel 前言
    书签8
  • 原文地址:https://www.cnblogs.com/playforever/p/9394823.html
Copyright © 2020-2023  润新知