• Spark SQL(5) CacheManage


    Spark SQL(5) CacheManage

    在spark sql的analyzed plan 生成之后,会经过一步withCachedData的操作,其实就是根据对logicalplan的缓存,如果logicalPlan的查询结果相同则会替换相对应的节点。这步发生在QueryExecution.withCachedData:

    lazy val withCachedData: LogicalPlan = {
        assertAnalyzed()
        assertSupported()
        sparkSession.sharedState.cacheManager.useCachedData(analyzed)
      }
    
     def useCachedData(plan: LogicalPlan): LogicalPlan = {
        val newPlan = plan transformDown {
          // Do not lookup the cache by hint node. Hint node is special, we should ignore it when
          // canonicalizing plans, so that plans which are same except hint can hit the same cache.
          // However, we also want to keep the hint info after cache lookup. Here we skip the hint
          // node, so that the returned caching plan won't replace the hint node and drop the hint info
          // from the original plan.
          case hint: ResolvedHint => hint
    
          case currentFragment =>
            lookupCachedData(currentFragment)
              .map(_.cachedRepresentation.withOutput(currentFragment.output))
              .getOrElse(currentFragment)
        }
    
        newPlan transformAllExpressions {
          case s: SubqueryExpression => s.withNewPlan(useCachedData(s.plan))
        }
      }

      这里面主要是CacheManager.lookupCachedData方法:

     def lookupCachedData(plan: LogicalPlan): Option[CachedData] = readLock {
        cachedData.asScala.find(cd => plan.sameResult(cd.plan))
      }
    
    private val cachedData = new java.util.LinkedList[CachedData]
    
    case class CachedData(plan: LogicalPlan, cachedRepresentation: InMemoryRelation)

      从上面可以看到CacheManager是通过一个链表保存了LogicalPlan和InMemoryRelation(叶子节点),从而在执行的时候直接替换缓存的结果。

    此处有个问题,这个链表是什么时候放进去的呢?其实需要调用dataset的persist方法即可:

    def cacheQuery(
          query: Dataset[_],
          tableName: Option[String] = None,
          storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = writeLock {
        val planToCache = query.logicalPlan
        if (lookupCachedData(planToCache).nonEmpty) {
          logWarning("Asked to cache already cached data.")
        } else {
          val sparkSession = query.sparkSession
          val inMemoryRelation = InMemoryRelation(
            sparkSession.sessionState.conf.useCompression,
            sparkSession.sessionState.conf.columnBatchSize, storageLevel,
            sparkSession.sessionState.executePlan(AnalysisBarrier(planToCache)).executedPlan,
            tableName,
            planToCache.stats)
          cachedData.add(CachedData(planToCache, inMemoryRelation))
        }
      }
    
     def persist(newLevel: StorageLevel): this.type = {
        sparkSession.sharedState.cacheManager.cacheQuery(this, None, newLevel)
        this
      }

    这里其实就是通过后序遍历的方式,查看缓存在cacheData中的逻辑计划,如果匹配就把整个节点替换。  

  • 相关阅读:
    三维坐标变换习题
    1 Introduction and Roadmap
    快讯:2019 OOW 文档已提供下载
    过了35岁,90%以上DBA都在迷茫:未来要何去何从?
    深度学习入门笔记(五):神经网络的编程基础
    《算法图解》学习笔记(十):K 最近邻算法(附代码)
    LNMP环境搭建之php安装
    LNMP环境搭建之php安装
    LNMP环境搭建之php安装
    (PSO-BP)结合粒子群的神经网络算法以及matlab实现
  • 原文地址:https://www.cnblogs.com/ldsggv/p/13380846.html
Copyright © 2020-2023  润新知