第三篇：Spark SQL Catalyst源码分析之Analyzer

第三篇：Spark SQL Catalyst源码分析之Analyzer
/** Spark SQL源码分析系列文章*/

前面几篇文章讲解了Spark SQL的核心执行流程和Spark SQL的Catalyst框架的Sql Parser是怎样接受用户输入sql，经过解析生成Unresolved Logical Plan的。我们记得Spark SQL的执行流程中另一个核心的组件式Analyzer，本文将会介绍Analyzer在Spark SQL里起到了什么作用。

Analyzer位于Catalyst的analysis package下，主要职责是将Sql Parser 未能Resolved的Logical Plan 给Resolved掉。

一、Analyzer构造

Analyzer会使用Catalog和FunctionRegistry将UnresolvedAttribute和UnresolvedRelation转换为catalyst里全类型的对象。

Analyzer里面有fixedPoint对象，一个Seq[Batch].
[java] view plain copy
1. class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Boolean)
2. extends RuleExecutor[LogicalPlan] with HiveTypeCoercion {
4. // TODO: pass this in as a parameter.
5. val fixedPoint = FixedPoint(100)
7. val batches: Seq[Batch] = Seq(
8. Batch("MultiInstanceRelations", Once,
9. NewRelationInstances),
10. Batch("CaseInsensitiveAttributeReferences", Once,
11. (if (caseSensitive) Nil else LowercaseAttributeReferences :: Nil) : _*),
12. Batch("Resolution", fixedPoint,
13. ResolveReferences ::
14. ResolveRelations ::
15. NewRelationInstances ::
16. ImplicitGenerate ::
17. StarExpansion ::
18. ResolveFunctions ::
19. GlobalAggregates ::
20. typeCoercionRules :_*),
21. Batch("AnalysisOperators", fixedPoint,
22. EliminateAnalysisOperators)
23. )
Analyzer里的一些对象解释：

FixedPoint：相当于迭代次数的上限。
[java] view plain copy
1. /** A strategy that runs until fix point or maxIterations times, whichever comes first. */
2. case class FixedPoint(maxIterations: Int) extends Strategy
Batch: 批次，这个对象是由一系列Rule组成的，采用一个策略（策略其实是迭代几次的别名吧，eg：Once）
[java] view plain copy
1. /** A batch of rules. */，
2. protected case class Batch(name: String, strategy: Strategy, rules: Rule[TreeType]*)
Rule：理解为一种规则，这种规则会应用到Logical Plan 从而将UnResolved 转变为Resolved
[java] view plain copy
1. abstract class Rule[TreeType <: TreeNode[_]] extends Logging {
3. /** Name for this rule, automatically inferred based on class name. */
4. val ruleName: String = {
5. val className = getClass.getName
6. if (className endsWith "$") className.dropRight(1) else className
7. }
9. def apply(plan: TreeType): TreeType
10. }
Strategy：最大的执行次数，如果执行次数在最大迭代次数之前就达到了fix point，策略就会停止，不再应用了。
[java] view plain copy
1. /**
2. * An execution strategy for rules that indicates the maximum number of executions. If the
3. * execution reaches fix point (i.e. converge) before maxIterations, it will stop.
4. */
5. abstract class Strategy { def maxIterations: Int }
Analyzer解析主要是根据这些Batch里面定义的策略和Rule来对Unresolved的逻辑计划进行解析的。

这里Analyzer类本身并没有定义执行的方法，而是要从它的父类RuleExecutor[LogicalPlan]寻找，Analyzer也实现了HiveTypeCosercion，这个类是参考Hive的类型自动兼容转换的原理。如图：

RuleExecutor：执行Rule的执行环境，它会将包含了一系列的Rule的Batch进行执行，这个过程都是串行的。

具体的执行方法定义在apply里：

可以看到这里是一个while循环，每个batch下的rules都对当前的plan进行作用，这个过程是迭代的，直到达到Fix Point或者最大迭代次数。
[java] view plain copy
1. def apply(plan: TreeType): TreeType = {
2. var curPlan = plan
4. batches.foreach { batch =>
5. val batchStartPlan = curPlan
6. var iteration = 1
7. var lastPlan = curPlan
8. var continue = true
10. // Run until fix point (or the max number of iterations as specified in the strategy.
11. while (continue) {
12. curPlan = batch.rules.foldLeft(curPlan) {
13. case (plan, rule) =>
14. val result = rule(plan) //这里将调用各个不同Rule的apply方法，将UnResolved Relations，Attrubute和Function进行Resolve
15. if (!result.fastEquals(plan)) {
16. logger.trace(
17. s"""
18. |=== Applying Rule ${rule.ruleName} ===
19. |${sideBySide(plan.treeString, result.treeString).mkString(" ")}
20. """.stripMargin)
21. }
23. result //返回作用后的result plan
24. }
25. iteration += 1
26. if (iteration > batch.strategy.maxIterations) { //如果迭代次数已经大于该策略的最大迭代次数，就停止循环
27. logger.info(s"Max iterations ($iteration) reached for batch ${batch.name}")
28. continue = false
29. }
31. if (curPlan.fastEquals(lastPlan)) { //如果在多次迭代中不再变化，因为plan有个unique id，就停止循环。
32. logger.trace(s"Fixed point reached for batch ${batch.name} after $iteration iterations.")
33. continue = false
34. }
35. lastPlan = curPlan
36. }
38. if (!batchStartPlan.fastEquals(curPlan)) {
39. logger.debug(
40. s"""
41. |=== Result of Batch ${batch.name} ===
42. |${sideBySide(plan.treeString, curPlan.treeString).mkString(" ")}
43. """.stripMargin)
44. } else {
45. logger.trace(s"Batch ${batch.name} has no effect.")
46. }
47. }
49. curPlan //返回Resolved的Logical Plan
50. }
二、Rules介绍

目前Spark SQL 1.0.0的Rule都定义在了Analyzer.scala的内部类。

在batches里面定义了4个Batch。

MultiInstanceRelations、CaseInsensitiveAttributeReferences、Resolution、AnalysisOperators 四个。

这4个Batch是将不同的Rule进行归类，每种类别采用不同的策略来进行Resolve。

2.1、MultiInstanceRelation

如果一个实例在Logical Plan里出现了多次，则会应用NewRelationInstances这儿Rule
[java] view plain copy

Batch("MultiInstanceRelations", Once,

NewRelationInstances)
[java] view plain copy

trait MultiInstanceRelation {

  def newInstance: this.type

}
[java] view plain copy

object NewRelationInstances extends Rule[LogicalPlan] {

  def apply(plan: LogicalPlan): LogicalPlan = {

    val localRelations = plan collect { case l: MultiInstanceRelation => l} //将logical plan应用partial function得到所有MultiInstanceRelation的plan的集合

    val multiAppearance = localRelations

      .groupBy(identity[MultiInstanceRelation]) //group by操作

      .filter { case (_, ls) => ls.size > 1 } //如果只取size大于1的进行后续操作

      .map(_._1)

      .toSet



    //更新plan，使得每个实例的expId是唯一的。

    plan transform {

      case l: MultiInstanceRelation if multiAppearance contains l => l.newInstance

    }

  }

}
2.2、LowercaseAttributeReferences

同样是partital function，对当前plan应用，将所有匹配的如UnresolvedRelation的别名alise转换为小写，将Subquery的别名也转换为小写。

总结：这是一个使属性名大小写不敏感的Rule，因为它将所有属性都to lower case了。
[java] view plain copy

object LowercaseAttributeReferences extends Rule[LogicalPlan] {

  def apply(plan: LogicalPlan): LogicalPlan = plan transform {

    case UnresolvedRelation(databaseName, name, alias) =>

      UnresolvedRelation(databaseName, name, alias.map(_.toLowerCase))

    case Subquery(alias, child) => Subquery(alias.toLowerCase, child)

    case q: LogicalPlan => q transformExpressions {

      case s: Star => s.copy(table = s.table.map(_.toLowerCase))

      case UnresolvedAttribute(name) => UnresolvedAttribute(name.toLowerCase)

      case Alias(c, name) => Alias(c, name.toLowerCase)()

      case GetField(c, name) => GetField(c, name.toLowerCase)

    }

  }

}
2.3、ResolveReferences
将Sql parser解析出来的UnresolvedAttribute全部都转为对应的实际的catalyst.expressions.AttributeReference AttributeReferences

这里调用了logical plan 的resolve方法，将属性转为NamedExepression。
[java] view plain copy

object ResolveReferences extends Rule[LogicalPlan] {

  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {

    case q: LogicalPlan if q.childrenResolved =>

      logger.trace(s"Attempting to resolve ${q.simpleString}")

      q transformExpressions {

        case u @ UnresolvedAttribute(name) =>

          // Leave unchanged if resolution fails.  Hopefully will be resolved next round.

          val result = q.resolve(name).getOrElse(u)//转化为NamedExpression

          logger.debug(s"Resolving $u to $result")

          result

      }

  }

}
2.4、 ResolveRelations
这个比较好理解，还记得前面Sql parser吗，比如select * from src，这个src表parse后就是一个UnresolvedRelation节点。

这一步ResolveRelations调用了catalog这个对象。Catalog对象里面维护了一个tableName,Logical Plan的HashMap结果。

通过这个Catalog目录来寻找当前表的结构，从而从中解析出这个表的字段，如UnResolvedRelations 会得到一个tableWithQualifiers。（即表和字段）

这也解释了为什么流程图那，我会画一个catalog在上面，因为它是Analyzer工作时需要的meta data。
[java] view plain copy

object ResolveRelations extends Rule[LogicalPlan] {

    def apply(plan: LogicalPlan): LogicalPlan = plan transform {

      case UnresolvedRelation(databaseName, name, alias) =>

        catalog.lookupRelation(databaseName, name, alias)

    }

  }
2.5、ImplicitGenerate
如果在select语句里只有一个表达式，而且这个表达式是一个Generator（Generator是一个1条记录生成到N条记录的映射）

当在解析逻辑计划时，遇到Project节点的时候，就可以将它转换为Generate类（Generate类是将输入流应用一个函数，从而生成一个新的流）。
[java] view plain copy

object ImplicitGenerate extends Rule[LogicalPlan] {

  def apply(plan: LogicalPlan): LogicalPlan = plan transform {

    case Project(Seq(Alias(g: Generator, _)), child) =>

      Generate(g, join = false, outer = false, None, child)

  }

}
2.6 StarExpansion
在Project操作符里，如果是*符号，即select * 语句，可以将所有的references都展开，即将select * 中的*展开成实际的字段。
[java] view plain copy

  object StarExpansion extends Rule[LogicalPlan] {

    def apply(plan: LogicalPlan): LogicalPlan = plan transform {

      // Wait until children are resolved

      case p: LogicalPlan if !p.childrenResolved => p

      // If the projection list contains Stars, expand it.

      case p @ Project(projectList, child) if containsStar(projectList) =>

        Project(

          projectList.flatMap {

            case s: Star => s.expand(child.output) //展开，将输入的Attributeexpand(input: Seq[Attribute]) 转化为Seq[NamedExpression]

            case o => o :: Nil

          },

          child)

      case t: ScriptTransformation if containsStar(t.input) =>

        t.copy(

          input = t.input.flatMap {

            case s: Star => s.expand(t.child.output)

            case o => o :: Nil

          }

        )

      // If the aggregate function argument contains Stars, expand it.

      case a: Aggregate if containsStar(a.aggregateExpressions) =>

        a.copy(

          aggregateExpressions = a.aggregateExpressions.flatMap {

            case s: Star => s.expand(a.child.output)

            case o => o :: Nil

          }

        )

    }

    /**

     * Returns true if `exprs` contains a [[Star]].

     */

    protected def containsStar(exprs: Seq[Expression]): Boolean =

      exprs.collect { case _: Star => true }.nonEmpty

  }

}
2.7 ResolveFunctions
这个和ResolveReferences差不多，这里主要是对udf进行resolve。

将这些UDF都在FunctionRegistry里进行查找。
[java] view plain copy

object ResolveFunctions extends Rule[LogicalPlan] {

  def apply(plan: LogicalPlan): LogicalPlan = plan transform {

    case q: LogicalPlan =>

      q transformExpressions {

        case u @ UnresolvedFunction(name, children) if u.childrenResolved =>

          registry.lookupFunction(name, children) //看是否注册了当前udf

      }

  }

}
2.8 GlobalAggregates
全局的聚合，如果遇到了Project就返回一个Aggregate.
[java] view plain copy

object GlobalAggregates extends Rule[LogicalPlan] {

  def apply(plan: LogicalPlan): LogicalPlan = plan transform {

    case Project(projectList, child) if containsAggregates(projectList) =>

      Aggregate(Nil, projectList, child)

  }



  def containsAggregates(exprs: Seq[Expression]): Boolean = {

    exprs.foreach(_.foreach {

      case agg: AggregateExpression => return true

      case _ =>

    })

    false

  }

}
2.9 typeCoercionRules

这个是Hive里的兼容SQL语法，比如将String和Int互相转换，不需要显示的调用cast xxx as yyy了。如StringToIntegerCasts。
[java] view plain copy

val typeCoercionRules =

  PropagateTypes ::

  ConvertNaNs ::

  WidenTypes ::

  PromoteStrings ::

  BooleanComparisons ::

  BooleanCasts ::

  StringToIntegralCasts ::

  FunctionArgumentConversion ::

  CastNulls ::

  Nil
2.10 EliminateAnalysisOperators

将分析的操作符移除，这里仅支持2种，一种是Subquery需要移除，一种是LowerCaseSchema。这些节点都会从Logical Plan里移除。
[java] view plain copy

object EliminateAnalysisOperators extends Rule[LogicalPlan] {

  def apply(plan: LogicalPlan): LogicalPlan = plan transform {

    case Subquery(_, child) => child //遇到Subquery,不反悔本身，返回它的Child，即删除了该元素

    case LowerCaseSchema(child) => child

  }

}
三、实践

补充昨天DEBUG的一个例子，这个例子证实了如何将上面的规则应用到Unresolved Logical Plan：

当传递sql语句的时候，的确调用了ResolveReferences将mobile解析成NamedExpression。

可以对照这看执行流程，左边是Unresolved Logical Plan，右边是Resoveld Logical Plan。

先是执行了Batch Resolution，eg: 调用ResovelRalation这个RUle来使 Unresovled Relation 转化为 SparkLogicalPlan并通过Catalog找到了其对于的字段属性。

然后执行了Batch Analysis Operator。eg：调用EliminateAnalysisOperators来将SubQuery给remove掉了。

可能格式显示的不太好，可以向右边拖动下滚动轴看下结果。：）
[java] view plain copy

val exec = sqlContext.sql("select mobile as mb, sid as id, mobile*2 multi2mobile, count(1) times from (select * from temp_shengli_mobile)a where pfrom_id=0.0 group by mobile, sid,  mobile*2")

14/07/21 18:23:32 DEBUG SparkILoop$SparkILoopInterpreter: Invoking: public static java.lang.String $line47.$eval.$print()

14/07/21 18:23:33 INFO Analyzer: Max iterations (2) reached for batch MultiInstanceRelations

14/07/21 18:23:33 INFO Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences

14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'pfrom_id to pfrom_id#5

14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'mobile to mobile#2

14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'sid to sid#1

14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'mobile to mobile#2

14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'mobile to mobile#2

14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'sid to sid#1

14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'mobile to mobile#2

14/07/21 18:23:33 DEBUG Analyzer:

=== Result of Batch Resolution ===

!Aggregate ['mobile,'sid,('mobile * 2) AS c2#27], ['mobile AS mb#23,'sid AS id#24,('mobile * 2) AS multi2mobile#25,COUNT(1) AS times#26L]   Aggregate [mobile#2,sid#1,(CAST(mobile#2, DoubleType) * CAST(2, DoubleType)) AS c2#27], [mobile#2 AS mb#23,sid#1 AS id#24,(CAST(mobile#2, DoubleType) * CAST(2, DoubleType)) AS multi2mobile#25,COUNT(1) AS times#26L]

! Filter ('pfrom_id = 0.0)                                                                                                                   Filter (CAST(pfrom_id#5, DoubleType) = 0.0)

   Subquery a                                                                                                                                 Subquery a

!   Project [*]                                                                                                                                Project [data_date#0,sid#1,mobile#2,pverify_type#3,create_time#4,pfrom_id#5,p_status#6,pvalidate_time#7,feffect_time#8,plastupdate_ip#9,update_time#10,status#11,preserve_int#12]

!    UnresolvedRelation None, temp_shengli_mobile, None                                                                                         Subquery temp_shengli_mobile

!                                                                                                                                                SparkLogicalPlan (ExistingRdd [data_date#0,sid#1,mobile#2,pverify_type#3,create_time#4,pfrom_id#5,p_status#6,pvalidate_time#7,feffect_time#8,plastupdate_ip#9,update_time#10,status#11,preserve_int#12], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:174)



14/07/21 18:23:33 DEBUG Analyzer:

=== Result of Batch AnalysisOperators ===

!Aggregate ['mobile,'sid,('mobile * 2) AS c2#27], ['mobile AS mb#23,'sid AS id#24,('mobile * 2) AS multi2mobile#25,COUNT(1) AS times#26L]   Aggregate [mobile#2,sid#1,(CAST(mobile#2, DoubleType) * CAST(2, DoubleType)) AS c2#27], [mobile#2 AS mb#23,sid#1 AS id#24,(CAST(mobile#2, DoubleType) * CAST(2, DoubleType)) AS multi2mobile#25,COUNT(1) AS times#26L]

! Filter ('pfrom_id = 0.0)                                                                                                                   Filter (CAST(pfrom_id#5, DoubleType) = 0.0)

!  Subquery a                                                                                                                                 Project [data_date#0,sid#1,mobile#2,pverify_type#3,create_time#4,pfrom_id#5,p_status#6,pvalidate_time#7,feffect_time#8,plastupdate_ip#9,update_time#10,status#11,preserve_int#12]

!   Project [*]                                                                                                                                SparkLogicalPlan (ExistingRdd [data_date#0,sid#1,mobile#2,pverify_type#3,create_time#4,pfrom_id#5,p_status#6,pvalidate_time#7,feffect_time#8,plastupdate_ip#9,update_time#10,status#11,preserve_int#12], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:174)

!    UnresolvedRelation None, temp_shengli_mobile, None
四、总结
本文从源代码角度分析了Analyzer在对Sql Parser解析出的UnResolve Logical Plan 进行analyze的过程中，所执行的流程。

流程是实例化一个SimpleAnalyzer，定义一些Batch，然后遍历这些Batch在RuleExecutor的环境下，执行Batch里面的Rules，每个Rule会对Unresolved Logical Plan进行Resolve，有些可能不能一次解析出，需要多次迭代，直到达到max迭代次数或者达到fix point。这里Rule里比较常用的就是ResolveReferences、ResolveRelations、StarExpansion、GlobalAggregates、typeCoercionRules和EliminateAnalysisOperators。

——EOF——

原创文章，转载请注明：

转载自：OopsOutOfMemory盛利的Blog，作者： OopsOutOfMemory

本文链接地址：http://blog.csdn.net/oopsoom/article/details/38025185

注：本文基于署名-非商业性使用-禁止演绎 2.5 中国大陆(CC BY-NC-ND 2.5 CN)协议，欢迎转载、转发和评论，但是请保留本文作者署名和文章链接。如若需要用于商业目的或者与授权方面的协商，请联系我。

转自：http://blog.csdn.net/oopsoom/article/details/38025185
相关阅读:
ParksLink修改密码
 ORA-01940:无法删除当前已链接的用户
 imp导入数据的时候报错：ORA-01658: 无法为表空间 MAXDATA 中的段创建 INITIAL 区
 Linux下查看日志用到的常用命令
 大批量数据高效插入数据库表
 线程中断：Thread类中interrupt（）、interrupted（）和 isInterrupted（）方法详解
 CyclicBarrier、CountDownLatch、Callable、FutureTask、thread.join() 、wait()、notify()、Condition
Mysql全文索引
 Docker 镜像的常用操作
 Docker 入门
原文地址：https://www.cnblogs.com/sh425/p/7596393.html