• SparkSQL UDF使用方法与原理详解


    UDF是SQL中很常见的功能,但在Spark-1.6及之前的版本,只能创建临时UDF,不支持创建持久化的UDF,除非修改Spark源码。从Spark-2.0开始,SparkSQL终于支持持久化的UDF。本文基于当前最新的Spark-2.0.2版本,讲解SparkSQL中使用UDF和底层实现的原理。

    转载注明原文http://www.cnblogs.com/shenh062326/p/6189672.html

    1. 临时UDF

    创建和使用方法:

    create temporary function tmp_trans_array as ''com.test.spark.udf.TransArray' using jar 'spark-test-udf-1.0.0.jar';
    
    select tmp_trans_array (1, '\|' , id, position) as (id0, position0) from test_udf limit 10;

      实现原理,在org.apache.spark.sql.execution.command.CreateFunctionCommand类的run方法中,会判断创建的Function是否是临时方法,若是,则会创建一个临时Function。从下面的代码我可以看到,临时函数直接注册到functionRegistry(实现类是SimpleFunctionRegistry),即内存中。

    def createTempFunction(
        name: String,
        info: ExpressionInfo,
        funcDefinition: FunctionBuilder,
        ignoreIfExists: Boolean): Unit = {
      if (functionRegistry.lookupFunctionBuilder(name).isDefined && !ignoreIfExists) {
        throw new TempFunctionAlreadyExistsException(name)
      }
      functionRegistry.registerFunction(name, info, funcDefinition)
    }

    下面是实际的注册代码,所有需要的UDF都会加载到StringKeyHashMap。

    protected val functionBuilders =
      StringKeyHashMap[(ExpressionInfo, FunctionBuilder)](caseSensitive = false)
    
    override def registerFunction(
        name: String,
        info: ExpressionInfo,
        builder: FunctionBuilder): Unit = synchronized {
      functionBuilders.put(name, (info, builder))
    }

    2. 持久化UDF

    使用方法如下,注意jar包最好放在HDFS上,在其他机器上也能使用。

    create function trans_array as 'com.test.spark.udf.TransArray'  using jar 'hdfs://namenodeIP:9000/libs/spark-test-udf-1.0.0.jar';
    
    select trans_array (1, ' \|' , id, position) as (id0, position0) from test_spark limit 10;
    
    

    实现原理

    (1)创建永久函数时,在org.apache.spark.sql.execution.command.CreateFunctionCommand中,会调用SessionCatalog的createFunction,最终执行了HiveExternalCatalog的createFunction,这里可以看出,创建永久函数会在Hive元数据库中创建相应的函数。通过查询元数据库我们可以看到如下记录,说明函数已经创建到元数据库中。 

    mysql> select *  from FUNCS;
    | FUNC_ID    | CLASS_NAME                    | CREATE_TIME | DB_ID | FUNC_NAME     | FUNC_TYPE | OWNER_NAME | OWNER_TYPE |
    | 96         | com.test.spark.udf.TransArray |  1481459766 | 1     | trans_array   | 1         | NULL       | USER       |
    
    mysql> select *  from FUNC_RU;
    | FUNC_ID | RESOURCE_TYPE | RESOURCE_URI                                         | INTEGER_IDX |  
    |  96     | 1             | hdfs://namenodeIP:9000/libs/spark-test-udf-1.0.0.jar |  0          |

    (2)使用永久函数,在解析SQL中的UDF时,会调用SessionCatalog的lookupFunction0方法,在此方法中,首先会检查内存中是否存在,如果不存在则会加载此UDF,加载时会把RESOURCE_URI发到ClassLoader的路径中,如果把UDF注册到内存的functionRegistry中。主要代码在SessionCatalog,如下:

    def lookupFunction(
        name: FunctionIdentifier,
        children: Seq[Expression]): Expression = synchronized {
      // Note: the implementation of this function is a little bit convoluted.
      // We probably shouldn't use a single FunctionRegistry to register all three kinds of functions
      // (built-in, temp, and external).
      if (name.database.isEmpty && functionRegistry.functionExists(name.funcName)) {
        // This function has been already loaded into the function registry.
        return functionRegistry.lookupFunction(name.funcName, children)
      }
    
      // If the name itself is not qualified, add the current database to it.
      val database = name.database.orElse(Some(currentDb)).map(formatDatabaseName)
      val qualifiedName = name.copy(database = database)
    
      if (functionRegistry.functionExists(qualifiedName.unquotedString)) {
        // This function has been already loaded into the function registry.
        // Unlike the above block, we find this function by using the qualified name.
        return functionRegistry.lookupFunction(qualifiedName.unquotedString, children)
      }
    
      // The function has not been loaded to the function registry, which means
      // that the function is a permanent function (if it actually has been registered
      // in the metastore). We need to first put the function in the FunctionRegistry.
      // TODO: why not just check whether the function exists first?
      val catalogFunction = try {
        externalCatalog.getFunction(currentDb, name.funcName)
      } catch {
        case e: AnalysisException => failFunctionLookup(name.funcName)
        case e: NoSuchPermanentFunctionException => failFunctionLookup(name.funcName)
      }
      loadFunctionResources(catalogFunction.resources)
      // Please note that qualifiedName is provided by the user. However,
      // catalogFunction.identifier.unquotedString is returned by the underlying
      // catalog. So, it is possible that qualifiedName is not exactly the same as
      // catalogFunction.identifier.unquotedString (difference is on case-sensitivity).
      // At here, we preserve the input from the user.
      val info = new ExpressionInfo(catalogFunction.className, qualifiedName.unquotedString)
      val builder = makeFunctionBuilder(qualifiedName.unquotedString, catalogFunction.className)
      createTempFunction(qualifiedName.unquotedString, info, builder, ignoreIfExists = false)
      // Now, we need to create the Expression.
      functionRegistry.lookupFunction(qualifiedName.unquotedString, children)
    }
  • 相关阅读:
    常用html设置:
    Java Enum
    ajax
    Utils使用
    jdk免安装对应配置
    jdk mvn下载--操作系统
    SpringMvc 文件上传后台处理
    SpringMvc 获取ApplicationContext
    Jenkins 持续集成
    自定义 directive pagination
  • 原文地址:https://www.cnblogs.com/shenh062326/p/6189672.html
Copyright © 2020-2023  润新知