• 【原创】大叔经验分享(39)spark cache unpersist级联操作


    问题:spark中如果有两个DataFrame(或者DataSet),DataFrameA依赖DataFrameB,并且两个DataFrame都进行了cache,将DataFrameB unpersist之后,DataFrameA的cache也会失效,官方解释如下:

    When invalidating a cache, we invalid other caches dependent on this cache to ensure cached data is up to date. For example, when the underlying table has been modified or the table has been dropped itself, all caches that use this table should be invalidated or refreshed.

    However, in other cases, like when user simply want to drop a cache to free up memory, we do not need to invalidate dependent caches since no underlying data has been changed. For this reason, we would like to introduce a new cache invalidation mode: the non-cascading cache invalidation.

    之前默认的模式为regular mode,这种模式下为了保证被cache数据是最新的(没有过期),会对cache的unpersist进行级联操作,即清空所有依赖(包括间接依赖)该cache的其他cache;
    从spark2.4开始引入了一个新的模式:non-cascading mode,这个模式下不会对cache的unpersist进行级联操作;

    DataFrame/DataSet的cache操作默认用的level是MEMORY_AND_DISK,除非手工指定MEMORY,并且确认内存足够,否则unpersist之前的cache看起来没有必要;

    参考:
    https://issues.apache.org/jira/browse/SPARK-21478
    https://issues.apache.org/jira/browse/SPARK-24596
    https://issues.apache.org/jira/browse/SPARK-21579

  • 相关阅读:
    elemenui数据表格加入计数器和输入框
    递归求阶乘
    递归累和
    递归
    file类
    Object类
    首页背景图
    异常的处理
    数据结构有什么用?常见的数据结构有什么?
    线程、并发、并行、进程是什么,以及如何开启新的线程?
  • 原文地址:https://www.cnblogs.com/barneywill/p/10524805.html
Copyright © 2020-2023  润新知