• Spark SQL(6) OptimizedPlan


    Spark SQL(6) OptimizedPlan

    在这一步spark sql主要应用一些规则,优化生成的Resolved Plan,这一步涉及到的有Optimizer。

    之前介绍在sparksession实例化的是会实例化sessionState,进而确定QueryExecution、Analyzer,Optimizer也是在这一步确定的:

      protected def optimizer: Optimizer = {
        new SparkOptimizer(catalog, experimentalMethods) {
          override def extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]] =
            super.extendedOperatorOptimizationRules ++ customOperatorOptimizationRules
        }
      }
    

      Optimizer也是RuleExecutor的子类,而SparkOptimizer是Optimizer子类,在analyzed步骤知道,其实主要规则就是RuleExecutor子类定义的batchs的规则

    sparkOptimizer:

     override def batches: Seq[Batch] = (preOptimizationBatches ++ super.batches :+
        Batch("Optimize Metadata Only Query", Once, OptimizeMetadataOnlyQuery(catalog)) :+
        Batch("Extract Python UDF from Aggregate", Once, ExtractPythonUDFFromAggregate) :+
        Batch("Prune File Source Table Partitions", Once, PruneFileSourcePartitions) :+
        Batch("Push down operators to data source scan", Once, PushDownOperatorsToDataSource)) ++
        postHocOptimizationBatches :+
        Batch("User Provided Optimizers", fixedPoint, experimentalMethods.extraOptimizations: _*)
    

    Optimizer:

    def batches: Seq[Batch] = {
        val operatorOptimizationRuleSet =
          Seq(
            // Operator push down
            PushProjectionThroughUnion,
            ReorderJoin,
            EliminateOuterJoin,
            PushPredicateThroughJoin,
            PushDownPredicate,
            LimitPushDown,
            ColumnPruning,
            InferFiltersFromConstraints,
            // Operator combine
            CollapseRepartition,
            CollapseProject,
            CollapseWindow,
            CombineFilters,
            CombineLimits,
            CombineUnions,
            // Constant folding and strength reduction
            NullPropagation,
            ConstantPropagation,
            FoldablePropagation,
            OptimizeIn,
            ConstantFolding,
            ReorderAssociativeOperator,
            LikeSimplification,
            BooleanSimplification,
            SimplifyConditionals,
            RemoveDispensableExpressions,
            SimplifyBinaryComparison,
            PruneFilters,
            EliminateSorts,
            SimplifyCasts,
            SimplifyCaseConversionExpressions,
            RewriteCorrelatedScalarSubquery,
            EliminateSerialization,
            RemoveRedundantAliases,
            RemoveRedundantProject,
            SimplifyCreateStructOps,
            SimplifyCreateArrayOps,
            SimplifyCreateMapOps,
            CombineConcats) ++
            extendedOperatorOptimizationRules
    
        val operatorOptimizationBatch: Seq[Batch] = {
          val rulesWithoutInferFiltersFromConstraints =
            operatorOptimizationRuleSet.filterNot(_ == InferFiltersFromConstraints)
          Batch("Operator Optimization before Inferring Filters", fixedPoint,
            rulesWithoutInferFiltersFromConstraints: _*) ::
          Batch("Infer Filters", Once,
            InferFiltersFromConstraints) ::
          Batch("Operator Optimization after Inferring Filters", fixedPoint,
            rulesWithoutInferFiltersFromConstraints: _*) :: Nil
        }
    
        (Batch("Eliminate Distinct", Once, EliminateDistinct) ::
        // Technically some of the rules in Finish Analysis are not optimizer rules and belong more
        // in the analyzer, because they are needed for correctness (e.g. ComputeCurrentTime).
        // However, because we also use the analyzer to canonicalized queries (for view definition),
        // we do not eliminate subqueries or compute current time in the analyzer.
        Batch("Finish Analysis", Once,
          EliminateSubqueryAliases,
          EliminateView,
          ReplaceExpressions,
          ComputeCurrentTime,
          GetCurrentDatabase(sessionCatalog),
          RewriteDistinctAggregates,
          ReplaceDeduplicateWithAggregate) ::
        //////////////////////////////////////////////////////////////////////////////////////////
        // Optimizer rules start here
        //////////////////////////////////////////////////////////////////////////////////////////
        // - Do the first call of CombineUnions before starting the major Optimizer rules,
        //   since it can reduce the number of iteration and the other rules could add/move
        //   extra operators between two adjacent Union operators.
        // - Call CombineUnions again in Batch("Operator Optimizations"),
        //   since the other rules might make two separate Unions operators adjacent.
        Batch("Union", Once,
          CombineUnions) ::
        Batch("Pullup Correlated Expressions", Once,
          PullupCorrelatedPredicates) ::
        Batch("Subquery", Once,
          OptimizeSubqueries) ::
        Batch("Replace Operators", fixedPoint,
          ReplaceIntersectWithSemiJoin,
          ReplaceExceptWithFilter,
          ReplaceExceptWithAntiJoin,
          ReplaceDistinctWithAggregate) ::
        Batch("Aggregate", fixedPoint,
          RemoveLiteralFromGroupExpressions,
          RemoveRepetitionFromGroupExpressions) :: Nil ++
        operatorOptimizationBatch) :+
        Batch("Join Reorder", Once,
          CostBasedJoinReorder) :+
        Batch("Decimal Optimizations", fixedPoint,
          DecimalAggregates) :+
        Batch("Object Expressions Optimization", fixedPoint,
          EliminateMapObjects,
          CombineTypedFilters) :+
        Batch("LocalRelation", fixedPoint,
          ConvertToLocalRelation,
          PropagateEmptyRelation) :+
        // The following batch should be executed after batch "Join Reorder" and "LocalRelation".
        Batch("Check Cartesian Products", Once,
          CheckCartesianProducts) :+
        Batch("RewriteSubquery", Once,
          RewritePredicateSubquery,
          ColumnPruning,
          CollapseProject,
          RemoveRedundantProject)
      }
    

      如上这便是在优化这步的所有的规则和策略例如消除子查询别名,表达式替换、算子下推、常量折叠等优化规则,经过这一步之后,就进入物理计划阶段了。

    举个栗子:

    在这里面标出来的子查询别名是在解析的时候加上的,但是在优化之后的logicalPlan中已经去掉,对应上述规则的(消除子查询别名),还有里面的算子下推,在analyzed的算子树中,Join Inner中对age字段做了条件限制,但是在优化后的LogicPlan中已经下推到离Project最近的位置;之后还有InferFiltersFromConstraints规则,对age、name增加了不为null的限定条件。

  • 相关阅读:
    hdu 4081 Qin Shi Huang's National Road System
    Finding Team Member
    hdu 5491 The Next
    Queue
    Backward Digit Sums
    HDU
    HDU
    CodeForces 500 A. New Year Transportation
    拓扑排序
    “玲珑杯”ACM比赛 Round #1 题解
  • 原文地址:https://www.cnblogs.com/ldsggv/p/13380953.html
Copyright © 2020-2023  润新知