• spark SQL学习(数据源之parquet)


    Parquet是面向分析型业务得列式存储格式

    编程方式加载数据

    代码示例

    package wujiadong_sparkSQL
    
    import org.apache.spark.sql.SQLContext
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
      * Created by Administrator on 2017/2/3.
      */
    object ParquetLoadData {
      def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setAppName("ParquetLoadData")
        val sc = new SparkContext(conf)
        val sqlContext = new  SQLContext(sc)
        val usersDF = sqlContext.read.parquet("hdfs://master:9000/student/2016113012/spark/users.parquet")
        usersDF.registerTempTable("t_users")
        //查询name
        val usersNameDF = sqlContext.sql("select name from t_users")
        //转换成RDD并执行相关操作
        usersNameDF.rdd.map(row => "Name:"+row(0)).collect().foreach(username => println(username))
    
      }
    
    }
    
    

    运行结果

    hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.ParquetLoadData  --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar
    17/02/03 14:36:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    17/02/03 14:36:02 INFO Slf4jLogger: Slf4jLogger started
    17/02/03 14:36:03 INFO Remoting: Starting remoting
    17/02/03 14:36:03 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.131:40895]
    17/02/03 14:36:07 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
    SLF4J: Defaulting to no-operation (NOP) logger implementation
    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
    17/02/03 14:36:20 INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
    17/02/03 14:36:21 INFO CodecPool: Got brand-new decompressor [.snappy]
    Name:Alyssa
    Name:Ben
    17/02/03 14:36:21 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
    17/02/03 14:36:21 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
    
    

    自动分区

    hadoop@master:~$ hadoop fs -mkdir /student/2016113012/spark/users
    hadoop@master:~$ hadoop fs -mkdir /student/2016113012/spark/users/gender=male/
    hadoop@master:~$ hadoop fs -mkdir /student/2016113012/spark/users/gender=male/country=us
    hadoop@master:~/wujiadong$ hadoop fs -put users.parquet /student/2016113012/spark/users/gender=male/country=us
    
    
    hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.ParquetPartitionTest  --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar
    17/02/03 15:13:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    17/02/03 15:13:43 INFO Slf4jLogger: Slf4jLogger started
    17/02/03 15:13:43 INFO Remoting: Starting remoting
    17/02/03 15:13:44 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.131:37709]
    17/02/03 15:13:46 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
    SLF4J: Defaulting to no-operation (NOP) logger implementation
    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
    17/02/03 15:13:59 INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
    17/02/03 15:13:59 INFO CodecPool: Got brand-new decompressor [.snappy]
    +------+--------------+----------------+------+-------+
    |  name|favorite_color|favorite_numbers|gender|country|
    +------+--------------+----------------+------+-------+
    |Alyssa|          null|  [3, 9, 15, 20]|  male|     us|
    |   Ben|           red|              []|  male|     us|
    +------+--------------+----------------+------+-------+
    
    17/02/03 15:14:00 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
    17/02/03 15:14:00 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
    
    自动推断出了性别和国家
    

    合并元数据

    1)读取parquet文件时,将数据源的选项mergeSchema,设置为true
    2)使用SQLContext.setConf()方法,将spark.sql.parquet.mergeSchema参数设置为true

    案例:合并学生的基本信息和成绩的元数据

    
    
    
    
  • 相关阅读:
    EPPlus实战篇——Excel写入
    EPPlus实战篇——Excel读取
    Elasticsearch,Kibana,Logstash,NLog实现ASP.NET Core 分布式日志系统
    ElasticSearch+NLog+Elmah实现Asp.Net分布式日志管理
    C盘清理
    vs2017离线安装且安装包不占用C盘空间
    CommandLine exe参数
    Nuget 打包 for .Net Standart project
    Redis 应用:缓存
    Redis架构设计
  • 原文地址:https://www.cnblogs.com/wujiadong2014/p/6516574.html
Copyright © 2020-2023  润新知