• 如何扩展Spark Catalyst,抓取spark sql 语句,通过listenerBus发送sql event以及编写自定义的Spark SQL引擎


    1、Spark Catalyst扩展点

    Spark catalyst的扩展点在SPARK-18127中被引入,Spark用户可以在SQL处理的各个阶段扩展自定义实现,非常强大高效,是SparkSQL的核心组件(查询优化器),它负责将SQL语句转换成物理执行计划,Catalyst的优劣决定了SQL执行的性能。Catalyst Optimizer是SparkSQL的核心组件(查询优化器),它负责将SQL语句转换成物理执行计划,Catalyst的优劣决定了SQL执行的性能。查询优化器是一个SQL引擎的核心,开源常用的有Apache Calcite(很多开源组件都通过引入Calcite来实现查询优化,如Hive/Phoenix/Drill等),另外一个是orca(HAWQ/GreenPlum中使用)。

    2、SparkSessionExtensions

    SparkSessionExtensions保存了所有用户自定义的扩展规则,自定义规则保存在成员变量中,对于不同阶段的自定义规则,SparkSessionExtensions提供了不同的接口。
    api文档地址:https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SparkSessionExtensions.html

    SparkSessionExtensions
    classSparkSessionExtensions extends AnyRef
    ExperimentalDeveloper API
    Holder for injection points to the SparkSession. We make NO guarantee about the stability regarding binary compatibility and source compatibility of methods here.
    
    This current provides the following extension points:
    
    Analyzer Rules.
    Check Analysis Rules.
    Optimizer Rules.
    Pre CBO Rules.
    Planning Strategies.
    Customized Parser.
    (External) Catalog listeners.
    Columnar Rules.
    Adaptive Query Stage Preparation Rules.
    The extensions can be used by calling withExtensions on the SparkSession.Builder, for example:
    
    SparkSession.builder()
      .master("...")
      .config("...", true)
      .withExtensions { extensions =>
        extensions.injectResolutionRule { session =>
          ...
        }
        extensions.injectParser { (session, parser) =>
          ...
        }
      }
      .getOrCreate()
    The extensions can also be used by setting the Spark SQL configuration property spark.sql.extensions. Multiple extensions can be set using a comma-separated list. For example:
    
    SparkSession.builder()
      .master("...")
      .config("spark.sql.extensions", "org.example.MyExtensions,org.example.YourExtensions")
      .getOrCreate()
    
    class MyExtensions extends Function1[SparkSessionExtensions, Unit] {
      override def apply(extensions: SparkSessionExtensions): Unit = {
        extensions.injectResolutionRule { session =>
          ...
        }
        extensions.injectParser { (session, parser) =>
          ...
        }
      }
    }
    
    class YourExtensions extends SparkSessionExtensionsProvider {
      override def apply(extensions: SparkSessionExtensions): Unit = {
        extensions.injectResolutionRule { session =>
          ...
        }
        extensions.injectFunction(...)
      }
    }
    Note that none of the injected builders should assume that the SparkSession is fully initialized and should not touch the session's internals (e.g. the SessionState).
    
    Annotations
    @DeveloperApi() @Experimental() @Unstable()
    Source
    SparkSessionExtensions.scala
    
    Filter all members
    Instance Constructors
    newSparkSessionExtensions()
    Type Members
    typeCheckRuleBuilder = (SparkSession) ⇒ (LogicalPlan) ⇒ Unit
    typeColumnarRuleBuilder = (SparkSession) ⇒ ColumnarRule
    typeFunctionDescription = (FunctionIdentifier, ExpressionInfo, FunctionBuilder)
    typeParserBuilder = (SparkSession, ParserInterface) ⇒ ParserInterface
    typeQueryStagePrepRuleBuilder = (SparkSession) ⇒ Rule[SparkPlan]
    typeRuleBuilder = (SparkSession) ⇒ Rule[LogicalPlan]
    typeStrategyBuilder = (SparkSession) ⇒ Strategy
    typeTableFunctionDescription = (FunctionIdentifier, ExpressionInfo, TableFunctionBuilder)
    Value Members
    definjectCheckRule(builder: CheckRuleBuilder): Unit
    Inject an check analysis Rule builder into the SparkSession.
    
    definjectColumnar(builder: ColumnarRuleBuilder): Unit
    Inject a rule that can override the columnar execution of an executor.
    
    definjectFunction(functionDescription: FunctionDescription): Unit
    Injects a custom function into the org.apache.spark.sql.catalyst.analysis.FunctionRegistry at runtime for all sessions.
    
    definjectOptimizerRule(builder: RuleBuilder): Unit
    Inject an optimizer Rule builder into the SparkSession.
    
    definjectParser(builder: ParserBuilder): Unit
    Inject a custom parser into the SparkSession.
    
    definjectPlannerStrategy(builder: StrategyBuilder): Unit
    Inject a planner Strategy builder into the SparkSession.
    
    definjectPostHocResolutionRule(builder: RuleBuilder): Unit
    Inject an analyzer Rule builder into the SparkSession.
    
    definjectPreCBORule(builder: RuleBuilder): Unit
    Inject an optimizer Rule builder that rewrites logical plans into the SparkSession.
    
    definjectQueryStagePrepRule(builder: QueryStagePrepRuleBuilder): Unit
    Inject a rule that can override the query stage preparation phase of adaptive query execution.
    
    definjectResolutionRule(builder: RuleBuilder): Unit
    Inject an analyzer resolution Rule builder into the SparkSession.
    
    definjectTableFunction(functionDescription: TableFunctionDescription): Unit
    Injects a custom function into the org.apache.spark.sql.catalyst.analysis.TableFunctionRegistry at runtime for all sessions.

    转载请注明张永清 博客园:https://www.cnblogs.com/laoqing/p/16351482.html

    2.1 新增自定义规则

    用户可以通过SparkSessionExtensions提供的inject开头的方法添加新的自定义规则,具体的inject接口如下:

    • injectOptimizerRule – 添加optimizer自定义规则,optimizer负责逻辑执行计划的优化。
    • injectParser – 添加parser自定义规则,parser负责SQL解析。
    • injectPlannerStrategy – 添加planner strategy自定义规则,planner负责物理执行计划的生成。
    • injectResolutionRule – 添加Analyzer自定义规则到Resolution阶段,analyzer负责逻辑执行计划生成。
    • injectPostHocResolutionRule – 添加Analyzer自定义规则到Post Resolution阶段。
    • injectCheckRule – 添加Analyzer自定义Check规则。

    Spark Catalyst的SQL处理分成parser,analyzer,optimizer以及planner等多个步骤,其中analyzer,optimizer等步骤内部也分为多个阶段,以Analyzer为例,analyse规则切分到不同的batch中,每个batch的执行策略可能不尽相同,有的只会执行一遍,有的会迭代执行直到满足一定条件。

    2.2 获取自定义规则

    SparkSessionExtensions对应每一种自定义规则也都有一个build开头的方法用于获取对应类型的自定义规则,Spark session在初始化的时候,通过这些方法获取自定义规则并传递给parser,analyzer,optimizer以及planner等对象。

    • buildOptimizerRules
    • buildParser
    • buildPlannerStrategies
    • buildResolutionRules
    • buildPostHocResolutionRules
    • buildCheckRules

    2.3 配置自定义规则

    在Spark中,用户自定义的规则可以通过两种方式配置生效:

    • 使用SparkSession.Builder中的withExtenstion方法,withExtension方法是一个高阶函数,接收一个自定义函数作为参数,这个自定义函数以SparkSessionExtensions作为参数,用户可以实现这个函数,通过SparkSessionExtensions的inject开头的方法添加用户自定义规则。
    • 通过Spark配置参数,具体参数名为spark.sql.extensions。用户可以将1中的自定义函数实现定义为一个类,将完整类名作为参数值。

    具体的用法用户可以参考org.apache.spark.sql.SparkSessionExtensionSuite测试用例中的Spark代码。

    3、扩展Spark Catalyst

    3.1 通过listenerBus发送sql event

    package org.apache.spark.sql.execution
    //转载请注明张永清 博客园:https://www.cnblogs.com/laoqing/p/16351482.html
    import org.apache.spark.internal.Logging
    import org.apache.spark.sql.SparkSession
    
    case class SqlEvent(sqlText: String, sparkSession: SparkSession) extends org.apache.spark.scheduler.SparkListenerEvent with Logging
    
    class MySqlParser(sparkSession: SparkSession, val delegate : org.apache.spark.sql.catalyst.parser.ParserInterface) extends scala.AnyRef with org.apache.spark.sql.catalyst.parser.ParserInterface with Logging{
      override def parsePlan(sqlText: String): LogicalPlan = {
        logInfo("start to send SqlEvent by listenerBus,sqlText:"+sqlText)
        sparkSession.sparkContext.listenerBus.post( SqlEvent(sqlText,sparkSession))
        logInfo("send SqlEvent success by listenerBus,sqlText:"+sqlText)
        delegate.parsePlan(sqlText)
      }
      
        override def parseExpression(sqlText: String): Expression = {
        delegate.parseExpression(sqlText)
      }
    
      override def parseTableIdentifier(sqlText: String): TableIdentifier = {
        delegate.parseTableIdentifier(sqlText)
    
      }
    
      override def parseFunctionIdentifier(sqlText: String): FunctionIdentifier = {
        delegate.parseFunctionIdentifier(sqlText)
    
      }
    
    
    
      override def parseTableSchema(sqlText: String): StructType = {
        delegate.parseTableSchema(sqlText)
    
      }
    
      override def parseDataType(sqlText: String): DataType = {
        delegate.parseDataType(sqlText)
      }
    
    }
    
    import org.apache.spark.sql.SparkSessionExtensions
    
    
    class MySparkSessionExtension  extends ((SparkSessionExtensions) => Unit) {
    
      override def apply(extensions: SparkSessionExtensions): Unit = {
    
        extensions.injectParser { (session, parser) =>
    
          new MySqlParser(session, parser)
        }
    
    
      }
    
    
    }
    SparkSession.builder()
      .master("...")
      .config("spark.sql.extensions", "MySparkSessionExtension") .getOrCreate()

      

    3.2创建一个自定义Parser

    class StrictParser(parser: ParserInterface) extends ParserInterface {
      /**
       * Parse a string to a [[LogicalPlan]].
       */
      override def parsePlan(sqlText: String): LogicalPlan = {
        val logicalPlan = parser.parsePlan(sqlText)
        logicalPlan transform {
          case project @ Project(projectList, _) =>
            projectList.foreach {
              name =>
                if (name.isInstanceOf[UnresolvedStar]) {
                  throw new RuntimeException("You must specify your project column set," +
                    " * is not allowed.")
                }
            }
            project
        }
        logicalPlan
      }
     
      /**
       * Parse a string to an [[Expression]].
       */
      override def parseExpression(sqlText: String): Expression = parser.parseExpression(sqlText)
     
      /**
       * Parse a string to a [[TableIdentifier]].
       */
      override def parseTableIdentifier(sqlText: String): TableIdentifier =
        parser.parseTableIdentifier(sqlText)
     
      /**
       * Parse a string to a [[FunctionIdentifier]].
       */
      override def parseFunctionIdentifier(sqlText: String): FunctionIdentifier =
        parser.parseFunctionIdentifier(sqlText)
     
      /**
       * Parse a string to a [[StructType]]. The passed SQL string should be a comma separated
       * list of field definitions which will preserve the correct Hive metadata.
       */
      override def parseTableSchema(sqlText: String): StructType =
        parser.parseTableSchema(sqlText)
     
      /**
       * Parse a string to a [[DataType]].
       */
      override def parseDataType(sqlText: String): DataType = parser.parseDataType(sqlText)
    }

    创建扩展点函数

    type ParserBuilder = (SparkSession, ParserInterface) => ParserInterface
    type ExtensionsBuilder = SparkSessionExtensions => Unit
    val parserBuilder: ParserBuilder = (_, parser) => new StrictParser(parser)
    val extBuilder: ExtensionsBuilder = { e => e.injectParser(parserBuilder)}
    

    extBuilder函数用于SparkSession构建,SparkSessionExtensions.injectParser函数本身也是一个高阶函数,接收parserBuilder作为参数,将原生parser作为参数传递给自定义的StrictParser,并将StrictParser作为自定义parser插入SparkSessionExtensions中。

    在SparkSession中启用自定义Parser

    val spark = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config("spark.master", "local[2]")
      .withExtensions(extBuilder)
      .getOrCreate()
    

    在Spark2.2版本中,引入了新的扩展点,使得用户可以在Spark session中自定义自己的parser,analyzer,optimizer以及physical planning stragegy rule。通过两个简单的示例,我们展示了如何通过Spark提供的扩展点实现SQL检查以及定制化的执行计划优化。Spark Catalyst高度的可扩展性使得我们可以非常方便的定制适合自己实际使用场景的SQL引擎,拓展了更多的可能性。我们可以实现特定的SQL方言,针对特殊的数据源做更深入的优化,进行SQL规范检查,针对特定执行环境制定特定的优化策略等等。

  • 相关阅读:
    安装完QQ必须要删除掉的几个恐怖文件
    dede实战系统:更换成kindEditor编辑器
    PHP 5.4 中经 htmlspecialchars 转义后的中文字符串为空的问题
    DEDECMS图片集上传图片出错302的解决办法
    dedecms安装完成后登录后台出现空白
    OFV.msi是什么 为什么更新时无法安装
    CentOS 挂载NTFS分区的两种方法
    centos使用yum安装gcc
    NetBeans菜单栏字体太小了
    注入漏洞页
  • 原文地址:https://www.cnblogs.com/laoqing/p/16351482.html
Copyright © 2020-2023  润新知