用户自定义函数
作者:尹正杰
版权声明:原创作品,谢绝转载!否则将追究法律责任。
一.用户自定义UDF函数
[root@hadoop101.yinzhengjie.org.cn ~]# spark-shell #在"spark-shell"窗口中可以通过spark.udf功能用户可以自定义函数。 20/07/14 01:21:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://hadoop101.yinzhengjie.org.cn:4040 Spark context available as 'sc' (master = local[*], app id = local-1594660981913). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_ version 2.4.6 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201) Type in expressions to have them evaluated. Type :help for more information. scala> val df = spark.read.json("file:///tmp/user.json") #读取数据,创建DataFrame df: org.apache.spark.sql.DataFrame = [name: string, passwd: string] scala> df.show() #展示数据 +-----------+------+ | name|passwd| +-----------+------+ |yinzhengjie| 2020| | Jason|666666| | Liming| 123| | Jenny| 456| | Danny| 789| +-----------+------+ scala> spark.udf.register("myNameFunc", (x:String)=> "Name:"+x) #注册自定义函数,函数名为"myNameFunc",功能就是在传入的字符串前添加"Nmae:"字样 res1: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType))) scala> df.createOrReplaceTempView("user") #创建一张user表 scala> spark.sql("select name, passwd from user").show() #正常查询user表信息 +-----------+------+ | name|passwd| +-----------+------+ |yinzhengjie| 2020| | Jason|666666| | Liming| 123| | Jenny| 456| | Danny| 789| +-----------+------+ scala> spark.sql("select myNameFunc(name), passwd from user").show() #使用咱们自定义函数(myNameFunc)来查询user表信息 +--------------------+------+ |UDF:myNameFunc(name)|passwd| +--------------------+------+ | Name:yinzhengjie| 2020| | Name:Jason|666666| | Name:Liming| 123| | Name:Jenny| 456| | Name:Danny| 789| +--------------------+------+ scala>
二.弱类型的DataFrame自定义用户聚合函数案例
{"name":"yinzhengjie","passwd":"2020","age":18} {"name":"Jason","passwd":"666666","age":27} {"name":"Liming","passwd":"123","age":49} {"name":"Jenny","passwd":"456","age":23} {"name":"Danny","passwd":"789","age":56}
package com.yinzhengjie.bigdata.spark.sql import org.apache.spark.SparkConf import org.apache.spark.sql.{DataFrame, Row, SparkSession} import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction} import org.apache.spark.sql.types.{DataType, DoubleType, LongType, StructType} object SparkSQL_UDAF { def main(args: Array[String]): Unit = { //创建spark配置信息 val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkSQLDemo3") //创建SparkSQL的环境对象,即创建SparkSession val spark:SparkSession = SparkSession.builder().config(sparkConf).getOrCreate() //创建咱们自定义的聚合函数对象 val udaf = new MyAgeAvgFunction //注册函数 spark.udf.register("avgAge",udaf) //读取json文件,构建DataFrame对象 val frame:DataFrame = spark.read.json("E:\yinzhengjie\bigdata\input\json\user.json") //使用聚合函数 frame.createOrReplaceTempView("user") spark.sql("select avgAge(age) from user").show() //释放资源 spark.close() } } /** * 强类型的Dataset和弱类型的DataFrame都提供了相关的聚合函数, 如 count(),countDistinct(),avg(),max(),min()。除此之外,用户可以设定自己的自定义聚合函数。 * * 弱类型用户自定义聚合函数:通过继承UserDefinedAggregateFunction来实现用户自定义聚合函数。 * * 声明用户自定义聚合函数大致步骤如下: * 1>.继承UserDefinedAggregateFunction * 2>.实现方法 */ class MyAgeAvgFunction extends UserDefinedAggregateFunction{ //定义函数输入的数据结构 override def inputSchema: StructType = { new StructType().add("age",LongType) } //定义计算时的数据结构 override def bufferSchema: StructType = { new StructType().add("sum",LongType).add("count",LongType) } //函数返回的数据类型 override def dataType: DataType = { DoubleType } //函数是否稳定(幂等性,也就是说给你相同的值你计算的结果是一致的) override def deterministic: Boolean = { true } //计算之前缓冲区的初始化 override def initialize(buffer: MutableAggregationBuffer): Unit = { buffer(0) = 0L //代表第一个结构,即上面定义的"sum"初始值为"0" buffer(1) = 0L //代表第二个结构,即上面定义的"count"初始值为"0" } //根据查询结果更新缓冲区数据 override def update(buffer: MutableAggregationBuffer, input: Row): Unit = { buffer(0) = buffer.getLong(0) + input.getLong(0) //sum进行累加操作 buffer(1) = buffer.getLong(1) + 1 //count加1操作 } //将多个节点的缓冲区进行合并 override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0) //让sum的值进行累加操作 buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1) //让count的值也进行累加 } //将最终的结果进行计算 override def evaluate(buffer: Row): Any = { buffer.getLong(0).toDouble / buffer.getLong(1) } }
三.强类型的Dataset自定义用户聚合函数案例
package com.yinzhengjie.bigdata.spark.sql import org.apache.spark.SparkConf import org.apache.spark.sql.expressions.Aggregator import org.apache.spark.sql._ object SparkSQL_UDAF2 { def main(args: Array[String]): Unit = { //创建spark配置信息 val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkSQLDemo3") //创建SparkSQL的环境对象,即创建SparkSession val spark:SparkSession = SparkSession.builder().config(sparkConf).getOrCreate() //创建咱们自定义的聚合函数对象 val udaf = new MyAgeAvgClassFunction //将聚合函数转换为查询的列 val avgColume:TypedColumn[UserBean,Double] = udaf.toColumn.name("avgAge") //读取json文件,构建DataFrame对象 val frame:DataFrame = spark.read.json("E:\yinzhengjie\bigdata\input\json\user.json") /** * 温馨提示: * 进行转换之前,需要引入隐式转换规则,这里的spark不是包名的含义,而是SparkSession对象的名字哟~ */ import spark.implicits._ val userDS:Dataset[UserBean] = frame.as[UserBean] //使用聚合函数 userDS.select(avgColume).show() //释放资源 spark.close() } } /** * 强类型的Dataset和弱类型的DataFrame都提供了相关的聚合函数, 如 count(),countDistinct(),avg(),max(),min()。除此之外,用户可以设定自己的自定义聚合函数。 * * 强类型用户自定义聚合函数:通过继承Aggregator来实现强类型自定义聚合函数 * * 声明用户自定义聚合函数(强类型)大致步骤如下: * 1>.继承Aggregator,设定泛型 * 2>.实现方法 */ class MyAgeAvgClassFunction extends Aggregator[UserBean,AvgBuffer,Double] { //缓冲区初始化操作 override def zero: AvgBuffer = { AvgBuffer(0,0) } /** * 聚合数据 * @param b * @param a * @return */ override def reduce(b: AvgBuffer, a: UserBean): AvgBuffer = { b.sum = b.sum + a.age b.count = b.count + 1 b } /** * 缓冲区的合并操作 * @param b1 * @param b2 * @return */ override def merge(b1: AvgBuffer, b2: AvgBuffer): AvgBuffer = { b1.sum = b1.sum + b2.sum b1.count = b1.count + b2.count b1 } //完成计算 override def finish(reduction: AvgBuffer): Double = { reduction.sum.toDouble / reduction.count } //指定缓冲编码器,固定写法,注意此时Encoder的泛型是咱们定义的样例类哟~ override def bufferEncoder: Encoder[AvgBuffer] = { Encoders.product } //指定输出编码器,固定写法,观察Encoder的泛型,然后调用Encoders对应的类型即可。 override def outputEncoder: Encoder[Double] = { Encoders.scalaDouble } } //定义样例类 case class UserBean(name:String,age:BigInt) case class AvgBuffer(var sum:BigInt,var count:Int)