• spark SQL学习(load和save操作)


    load操作:主要用于加载数据,创建出DataFrame

    save操作:主要用于将DataFrame中的数据保存到文件中

    代码示例(默认为parquet数据源类型)

    
    package wujiadong_sparkSQL
    
    import org.apache.spark.sql.SQLContext
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
      * Created by Administrator on 2017/2/3.
      */
    object GenericLoadSave {
      def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setAppName("GenericLoadSave")
        val sc = new SparkContext(conf)
        val sqlContext = new SQLContext(sc)
    //load默认是加载parquet格式文件
        val usersDF = sqlContext.read.load("hdfs://master:9000/student/2016113012/spark/users.parquet")
        usersDF.write.save("hdfs://master:9000/student/2016113012/parquet_out1")
    
      }
    
    }
    
    
    

    提交集群运行

    hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.GenericLoadSave  --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar
    
    

    运行后查看是否保存成功

    hadoop@slave01:~$ hadoop fs -ls /student/2016113012/parquet_out1
    17/02/03 12:06:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 4 items
    -rw-r--r--   3 hadoop supergroup          0 2017-02-03 12:05 /student/2016113012/parquet_out1/_SUCCESS
    -rw-r--r--   3 hadoop supergroup        476 2017-02-03 12:05 /student/2016113012/parquet_out1/_common_metadata
    -rw-r--r--   3 hadoop supergroup        841 2017-02-03 12:05 /student/2016113012/parquet_out1/_metadata
    -rw-r--r--   3 hadoop supergroup        864 2017-02-03 12:05 /student/2016113012/parquet_out1/part-r-00000-8025e2a8-ab06-4558-9d76-bb2cad0042cf.gz.parquet
    
    

    手动指定数据源类型(进行格式转换很方便)
    默认情况下不指定数据源类型的话就是parquet类型

    代码示例(手动指定数据源类型)

    package wujiadong_sparkSQL
    
    import org.apache.spark.sql.SQLContext
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
      * Created by Administrator on 2017/2/3.
      */
    object ManuallySpecifyOptions {
      def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setAppName("ManuallySpecifyOptions")
        val sc = new SparkContext(conf)
        val sqlContext = new SQLContext(sc)
    //load读其他格式文件如json时,需要先用format指定格式
        val peopleDF = sqlContext.read.format("json").load("hdfs://master:9000/student/2016113012/people.json")
        peopleDF.select("name").write.format("parquet").save("hdfs://master:9000/sudent/2016113012/people_out1")
        
    
      }
    
    }
    
    

    提交集群运行

    hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.ManuallySpecifyOptions  --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar
    
    

    查看是否运行成功

    hadoop@master:~/wujiadong$ hadoop fs -ls hdfs://master:9000/sudent/2016113012/people_out1
    17/02/03 12:24:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 4 items
    -rw-r--r--   3 hadoop supergroup          0 2017-02-03 12:22 hdfs://master:9000/sudent/2016113012/people_out1/_SUCCESS
    -rw-r--r--   3 hadoop supergroup        207 2017-02-03 12:22 hdfs://master:9000/sudent/2016113012/people_out1/_common_metadata
    -rw-r--r--   3 hadoop supergroup        327 2017-02-03 12:22 hdfs://master:9000/sudent/2016113012/people_out1/_metadata
    -rw-r--r--   3 hadoop supergroup        352 2017-02-03 12:22 hdfs://master:9000/sudent/2016113012/people_out1/part-r-00000-4d1a62a4-f550-4bde-899f-35e9aabfdc0c.gz.parquet
    
    
    

    Save Mode

    SaveMode.ErrorIfExists (默认):如果目标位置已经存在数据,那么抛出一个异常
    SaveMode.Append:如果目标位置已经存在数据,那么将数据追加进去
    SaveMode.Overwrite:如果目标位置已经存在数据,那么就将已经存在的数据删除,用新数据进行覆盖
    SaveMode.Ignore:如果目标位置已经存在数据,那么就忽略,不做任何操作
    
    

    代码示例1

    package wujiadong_sparkSQL
    
    import org.apache.spark.sql.{SQLContext, SaveMode}
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
      * Created by Administrator on 2017/2/3.
      */
    object SaveModelTest {
      def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setAppName("SaveModelTest")
        val sc = new SparkContext(conf)
        val sqlContext = new SQLContext(sc)
        val peopleDF = sqlContext.read.format("json").load("hdfs://master:9000/student/2016113012/people.json")
        peopleDF.save("hdfs://master:9000/student/2016113012/people.json",SaveMode.ErrorIfExists)
      }
    
    }
    
    因为这种save mode文件已存在就报错
    
    package wujiadong_sparkSQL
    
    import org.apache.spark.sql.{SQLContext, SaveMode}
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
      * Created by Administrator on 2017/2/3.
      */
    object SaveModelTest {
      def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setAppName("SaveModelTest")
        val sc = new SparkContext(conf)
        val sqlContext = new SQLContext(sc)
        val peopleDF = sqlContext.read.format("json").load("hdfs://master:9000/student/2016113012/people.json")
        peopleDF.save("hdfs://master:9000/student/2016113012/people.json",SaveMode.Overwrite)
      }
    
    }
    
    这种会直接覆盖
    
    
  • 相关阅读:
    4-9 内置函数和匿名函数的题
    4-09 试题
    4--2日 函数 装饰器 作业题
    if 语句
    4-4日 内置函数,匿名函数
    4-4日 列表推导式,生成器推导式
    4-3日 迭代器 生成器
    4-2日装饰器,带参数的装饰器
    python 函数名 、闭包 装饰器 day13
    [LeetCode]-DataBase-Department Top Three Salaries
  • 原文地址:https://www.cnblogs.com/wujiadong2014/p/6516558.html
Copyright © 2020-2023  润新知