本文简单介绍两种往SQLContext、HiveContext中注册自定义函数方法。
下边以sqlContext为例,在spark-shell下操作示例:
scala> sc res5: org.apache.spark.SparkContext = org.apache.spark.SparkContext@35d4035f scala> sqlContext res7: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@171b0d3 scala> val df = sc.parallelize(Seq(("张三", 25), ("李四", 30),("赵六", 27))).toDF("name", "age") df: org.apache.spark.sql.DataFrame = [name: string, age: int] scala> df.registerTempTable("emp") 1)外部定义函数: scala> def remainWorkYears(age: Int) : Int = { | 60 - age | } remainWorkYears: (age: Int)Int scala> sqlContext.udf.register("remainWorkYears", remainWorkYears _) res1: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,List()) scala> sqlContext.sql("select e.*, remainWorkYears(e.age) as remainedWorkYear from emp e").show hiveContext.sql("select e.*, remainWorkYears(e.age) as remainedWorkYear from emp e").show +----+---+----------------+ |name|age|remainedWorkYear| +----+---+----------------+ | 张三| 25| 35| | 李四| 30| 30| | 赵六| 27| 33| +----+---+----------------+ 2)匿名函数: scala> sqlContext.udf.register("remainWorkYears_anoymous", (age: Int) => { | 60 - age | }) res3: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,List()) scala> sqlContext.sql("select e.*, remainWorkYears_anoymous(e.age) as remainedWorkYear from emp e").show +----+---+----------------+ |name|age|remainedWorkYear| +----+---+----------------+ | 张三| 25| 35| | 李四| 30| 30| | 赵六| 27| 33| +----+---+----------------+