转 RDDs转DF

原文链接：http://blog.csdn.net/Gavin_chun/article/details/78663826

一、方式1：反射的方法，但是生产上不建议使用。因为case class只能定义22个字段，有所限制。

二、方式2：编程的方式，一般三步走。
1、从原始RDD创建一个RDD[Row];
2、在步骤1中创建的RDD[Row]中，创建一个与scheam匹配的结构
3、通过SparkSession提供的createDataFrame方法将schema应用于RDD[Row]

package com.spark.sql

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{StringType, StructField, StructType}

/**
  * Created with IntelliJ IDEA.
  * Description: 
  * Author: A_ChunUnique
  * Date: 2017/11/28
  * Time: 14:27
  *
  **/
object CovertRdd {

  case class Person(name: String, age: Long)

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[2]").appName("RDD Covert DF").getOrCreate()
    //runInferSchemaExample(spark)
    runProgrammaticSchemaExample(spark)
  }

  private def runInferSchemaExample(spark: SparkSession): Unit = {
    /*
    * 方法1：通过反射方法 将RDD转成一个DF
    * */
    import spark.implicits._
    val peopleDF = spark.sparkContext
      .textFile("file:///D:/ruoze/people.txt")
      .map(_.split(","))
      .map(attributes => Person(attributes(0), attributes(1).trim.toInt)).toDF()
    peopleDF.createOrReplaceTempView("people")
    val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")
    teenagersDF.map(teenager => "Name: " + teenager(0) + ",Age:" + teenager(1)).show()
  }

  /*
    * 方法1：通过编程方法，将RDD转成一个DF
    * */
  private def runProgrammaticSchemaExample(spark: SparkSession): Unit = {
    import spark.implicits._
    // Create an RDD
    val peopleRDD = spark.sparkContext.textFile("file:///D:/ruoze/people.txt")
    // The schema is encoded in a string
    val schemaString = "name age"
    // Generate the schema based on the string of schema
    val fields = schemaString.split(" ")
      .map(fieldName => StructField(fieldName, StringType, nullable = true))
    val schema = StructType(fields)
    // Convert records of the RDD (people) to Rows
    val rowRDD = peopleRDD
      .map(_.split(","))
      .map(attributes => Row(attributes(0), attributes(1).trim))

    // Apply the schema to the RDD
    val peopleDF = spark.createDataFrame(rowRDD, schema)

    // Creates a temporary view using the DataFrame
    peopleDF.createOrReplaceTempView("people")

    // SQL can be run over a temporary view created using DataFrames
    val results = spark.sql("SELECT name FROM people")

    // The results of SQL queries are DataFrames and support all the normal RDD operations
    // The columns of a row in the result can be accessed by field index or by field name
    results.map(attributes => "Name: " + attributes(0)).show()
  }
}

相关阅读:
采用闭锁（CountDownLatch）控制线程的先后顺序（一）
采用java信号量（semaphore）让线程轮流打印
 生产者消费者模式的java实现（实现四）
生产者消费者模式的java实现（实现三）
生产者消费者模式的java实现（实现二）
生产者消费者模式的java实现（实现一）
求最大子串和最长子串的java写法
 Error attaching to process: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process 异常处理
 定时任务服务器不定时重启原因解析
 centos 6.3 + gerrit-2.8.6 + repo 实践
原文地址：https://www.cnblogs.com/kxgdby/p/7956584.html