Spark SQL(5) CacheManage

在spark sql的analyzed plan 生成之后，会经过一步withCachedData的操作，其实就是根据对logicalplan的缓存，如果logicalPlan的查询结果相同则会替换相对应的节点。这步发生在QueryExecution.withCachedData:

lazy val withCachedData: LogicalPlan = {
    assertAnalyzed()
    assertSupported()
    sparkSession.sharedState.cacheManager.useCachedData(analyzed)
  }

 def useCachedData(plan: LogicalPlan): LogicalPlan = {
    val newPlan = plan transformDown {
      // Do not lookup the cache by hint node. Hint node is special, we should ignore it when
      // canonicalizing plans, so that plans which are same except hint can hit the same cache.
      // However, we also want to keep the hint info after cache lookup. Here we skip the hint
      // node, so that the returned caching plan won't replace the hint node and drop the hint info
      // from the original plan.
      case hint: ResolvedHint => hint

      case currentFragment =>
        lookupCachedData(currentFragment)
          .map(_.cachedRepresentation.withOutput(currentFragment.output))
          .getOrElse(currentFragment)
    }

    newPlan transformAllExpressions {
      case s: SubqueryExpression => s.withNewPlan(useCachedData(s.plan))
    }
  }

　　这里面主要是CacheManager.lookupCachedData方法：

 def lookupCachedData(plan: LogicalPlan): Option[CachedData] = readLock {
    cachedData.asScala.find(cd => plan.sameResult(cd.plan))
  }

private val cachedData = new java.util.LinkedList[CachedData]

case class CachedData(plan: LogicalPlan, cachedRepresentation: InMemoryRelation)

　　从上面可以看到CacheManager是通过一个链表保存了LogicalPlan和InMemoryRelation(叶子节点)，从而在执行的时候直接替换缓存的结果。

此处有个问题，这个链表是什么时候放进去的呢？其实需要调用dataset的persist方法即可：

def cacheQuery(
      query: Dataset[_],
      tableName: Option[String] = None,
      storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = writeLock {
    val planToCache = query.logicalPlan
    if (lookupCachedData(planToCache).nonEmpty) {
      logWarning("Asked to cache already cached data.")
    } else {
      val sparkSession = query.sparkSession
      val inMemoryRelation = InMemoryRelation(
        sparkSession.sessionState.conf.useCompression,
        sparkSession.sessionState.conf.columnBatchSize, storageLevel,
        sparkSession.sessionState.executePlan(AnalysisBarrier(planToCache)).executedPlan,
        tableName,
        planToCache.stats)
      cachedData.add(CachedData(planToCache, inMemoryRelation))
    }
  }

 def persist(newLevel: StorageLevel): this.type = {
    sparkSession.sharedState.cacheManager.cacheQuery(this, None, newLevel)
    this
  }

这里其实就是通过后序遍历的方式，查看缓存在cacheData中的逻辑计划，如果匹配就把整个节点替换。　　

相关阅读:
三维坐标变换习题
 1 Introduction and Roadmap
快讯：2019 OOW 文档已提供下载
 过了35岁，90%以上DBA都在迷茫：未来要何去何从？
深度学习入门笔记（五）：神经网络的编程基础
 《算法图解》学习笔记（十）：K 最近邻算法（附代码）
LNMP环境搭建之php安装
 LNMP环境搭建之php安装
 LNMP环境搭建之php安装
 （PSO-BP）结合粒子群的神经网络算法以及matlab实现
原文地址：https://www.cnblogs.com/ldsggv/p/13380846.html