• Spark DataSet 、DataFrame 一些使用示例


    以前使用过DS和DF,最近使用Spark ML跑实验,再次用到简单复习一下。

    //案例数据
    1,2,3
    4,5,6
    7,8,9
    10,11,12
    13,14,15
    1,2,3
    4,5,6
    7,8,9
    10,11,12
    13,14,15
    1,2,3
    4,5,6
    7,8,9
    10,11,12
    13,14,15

    1:DS与DF关系?

    type DataFrame = Dataset[Row]

    2:加载txt数据

      val rdd = sc.textFile("data")
    
      val df = rdd.toDF()

    这种直接生成DF,df数据结构为(查询语句:df.select("*").show(5)):

    只有一列,属性为value。

     3: df.printSchema()

    4:case class 可以直接就转成DS

    // Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
    // you can use custom classes that implement the Product interface
    case class Person(name: String, age: Long)
    
    // Encoders are created for case classes
    val caseClassDS = Seq(Person("Andy", 32)).toDS()

    5:直接解析主流格式文件

    val path = "examples/src/main/resources/people.json"
    val peopleDS = spark.read.json(path).as[Person]

    6:RDD转成DataSet两种方法

    数据格式:

    xiaoming,18,iPhone
    mali,22,xiaomi
    jack,26,smartisan
    mary,16,meizu
    kali,45,huawei

    (a):使用反射推断模式

      val persons = rdd.map {
        x =>
          val fs = x.split(",")
          Person(fs(0), fs(1).toInt, fs(2))
      }
    
      persons.toDS().show(2)
      persons.toDF("newName", "newAge", "newPhone").show(2)
      persons.toDF().show(2)

    (b):编程方式指定模式

     步骤:

    import org.apache.spark.sql.types._
      //1:创建RDD
      val rddString = sc.textFile("C:\Users\Daxin\Documents\GitHub\OptimizedRF\sql_data")
      //2:创建schema
      val schemaString = "name age phone"
      val fields = schemaString.split(" ").map {
        filedName => StructField(filedName, StringType, nullable = true)
      }
      val schema = StructType(fields)
      //3:数据转成Row
      val rowRdd = rddString.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1), attributes(2)))
      //创建DF
      val personDF = spark.createDataFrame(rowRdd, schema)
      personDF.show(5)

     7:注册视图

      //全局表,生命周期多个session可以共享并且创建该视图的sparksession停止该视图也不会过期
      personDF.createGlobalTempView("GlobalTempView_Person")
      //临时表,存在的话覆盖。生命周期和sparksession相同
      personDF.createOrReplaceTempView("TempView_Person")
      //personDF.createTempView("TempView_Person") //如果视图已经存在则异常
    
      //  Global temporary view is tied to a system preserved database `global_temp`
      //全局视图存储在global_temp数据库中,如果不加数据库前缀异常,提示找不到视图
      spark.sql("select * from global_temp.GlobalTempView_Person").show(2)
      //临时表不需要添加数据库
      spark.sql("select * from TempView_Person").show(2)

     

    8:UDF 定义:

    Untyped User-Defined Aggregate Functions

    package com.daxin.sq.df
    
    import org.apache.spark.sql.expressions.MutableAggregationBuffer
    import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.Row
    
    /**
      * Created by Daxin on 2017/11/18.
      * url:http://spark.apache.org/docs/latest/sql-programming-guide.html#untyped-user-defined-aggregate-functions
      */
    
    //Untyped User-Defined Aggregate Functions
    object MyAverage extends UserDefinedAggregateFunction {
    
      // Data types of input arguments of this aggregate function
      override def inputSchema: StructType = StructType(StructField("inputColumn", IntegerType) :: Nil) //2
    
    
      // Updates the given aggregation buffer `buffer` with new input data from `input`
      //TODO  第一个缓冲区是sum,第二个缓冲区是元素个数
      override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
        if (!input.isNullAt(0)) {
          buffer(0) = buffer.getInt(0) + input.getInt(0) // input.getInt(0)是中inputSchema定义的第0个元素
          buffer(1) = buffer.getInt(1) + 1
          println()
        }
      }
    
    
      // Data types of values in the aggregation buffer
      //TODO  定义缓冲区的模型(也就是数据结构)
      override def bufferSchema: StructType = StructType(StructField("sum", IntegerType) :: StructField("count", IntegerType) :: Nil)
    
    
      // Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
      //TODO MutableAggregationBuffer 是Row子类
      override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
        //TODO 合并分区,将结果更新到buffer1
        buffer1(0) = buffer1.getInt(0) + buffer2.getInt(0)
        buffer1(1) = buffer1.getInt(1) + buffer2.getInt(1)
    
        println()
      }
    
    
      // Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
      // standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
      // the opportunity to update its values. Note that arrays and maps inside the buffer are still
      // immutable.
      override def initialize(buffer: MutableAggregationBuffer): Unit = {
        buffer(0) = 0
        buffer(1) = 0
      }
    
      // Whether this function always returns the same output on the identical input
      override def deterministic: Boolean = true
    
      // Calculates the final result
      override def evaluate(buffer: Row): Int = buffer.getInt(0) / buffer.getInt(1)
    
      // The data type of the returned value,返回值类型
      override def dataType: DataType = IntegerType // 1
    }

    测试代码:

      spark.udf.register("myAverage", MyAverage)
      val result = spark.sql("SELECT myAverage(age)  FROM TempView_Person")
      result.show()

     8:关于机器学习中的DataFrame的schema定:

    一列名字为 label,另一列名字为  features。一般可以使用case class完成转换

    case class UDLabelpOint(label: Double, features: org.apache.spark.ml.linalg.Vector)
  • 相关阅读:
    python中类方法、类实例方法、静态方法的使用与区别
    在python里如何动态添加类的动态属性呢?
    PYTHON基础
    EXCEL 写入
    thread 多线程
    Python 常用函数
    列表减列表
    04_Linux搭建Jdk和tomcat环境
    自动生成和安装requirements.txt依赖
    python+selenium面试题
  • 原文地址:https://www.cnblogs.com/leodaxin/p/7858018.html
Copyright © 2020-2023  润新知