• [Spark][Python][DataFrame][Write]DataFrame写入的例子


    $ hdfs dfs -cat people.json



    sqlContext = HiveContext(sc)

    peopleDF = sqlContext.read.json("people.json")


    17/10/07 00:58:18 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 65.5 KB, free 338.2 KB)
    17/10/07 00:58:18 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 21.4 KB, free 359.6 KB)
    17/10/07 00:58:18 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:59616 (size: 21.4 KB, free: 208.8 MB)
    17/10/07 00:58:18 INFO spark.SparkContext: Created broadcast 2 from saveAsTable at NativeMethodAccessorImpl.java:-2
    17/10/07 00:58:18 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 251.1 KB, free 610.7 KB)
    17/10/07 00:58:18 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 21.6 KB, free 632.4 KB)
    17/10/07 00:58:18 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:59616 (size: 21.6 KB, free: 208.7 MB)
    17/10/07 00:58:18 INFO spark.SparkContext: Created broadcast 3 from saveAsTable at NativeMethodAccessorImpl.java:-2
    17/10/07 00:58:19 INFO parquet.ParquetRelation: Using default output committer for Parquet: parquet.hadoop.ParquetOutputCommitter
    17/10/07 00:58:19 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
    17/10/07 00:58:19 INFO datasources.DynamicPartitionWriterContainer: Using user defined output committer class parquet.hadoop.ParquetOutputCommitter
    17/10/07 00:58:19 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
    17/10/07 00:58:19 INFO mapred.FileInputFormat: Total input paths to process : 1
    17/10/07 00:58:19 INFO spark.SparkContext: Starting job: saveAsTable at NativeMethodAccessorImpl.java:-2
    17/10/07 00:58:19 INFO scheduler.DAGScheduler: Got job 1 (saveAsTable at NativeMethodAccessorImpl.java:-2) with 1 output partitions
    17/10/07 00:58:19 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (saveAsTable at NativeMethodAccessorImpl.java:-2)
    17/10/07 00:58:19 INFO scheduler.DAGScheduler: Parents of final stage: List()
    17/10/07 00:58:19 INFO scheduler.DAGScheduler: Missing parents: List()
    17/10/07 00:58:19 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[7] at saveAsTable at NativeMethodAccessorImpl.java:-2), which has no missing parents
    17/10/07 00:58:19 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 72.7 KB, free 705.0 KB)
    17/10/07 00:58:20 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 26.4 KB, free 731.4 KB)
    17/10/07 00:58:20 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:59616 (size: 26.4 KB, free: 208.7 MB)
    17/10/07 00:58:20 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006
    17/10/07 00:58:20 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[7] at saveAsTable at NativeMethodAccessorImpl.java:-2)
    17/10/07 00:58:20 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
    17/10/07 00:58:20 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,PROCESS_LOCAL, 2149 bytes)
    17/10/07 00:58:20 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1)
    17/10/07 00:58:20 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/people.json:0+179
    17/10/07 00:58:20 INFO codegen.GenerateUnsafeProjection: Code generated in 314.888218 ms
    17/10/07 00:58:20 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
    17/10/07 00:58:20 INFO datasources.DynamicPartitionWriterContainer: Using user defined output committer class parquet.hadoop.ParquetOutputCommitter
    17/10/07 00:58:20 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
    17/10/07 00:58:20 INFO codegen.GenerateUnsafeProjection: Code generated in 46.978197 ms
    17/10/07 00:58:20 INFO codegen.GenerateUnsafeProjection: Code generated in 64.665839 ms
    17/10/07 00:58:21 INFO codegen.GenerateUnsafeProjection: Code generated in 94.259071 ms
    17/10/07 00:58:21 INFO codec.CodecConfig: Compression: GZIP
    17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Parquet block size to 134217728
    17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Parquet page size to 1048576
    17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
    17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Dictionary is on
    17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Validation is off
    17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
    17/10/07 00:58:21 INFO hadoop.ParquetOutputFormat: Maximum row group padding size is 8388608 bytes
    17/10/07 00:58:21 INFO parquet.CatalystWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
    "type" : "struct",
    "fields" : [ {
    "name" : "name",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
    }, {
    "name" : "pcode",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
    }, {
    "name" : "pcoe",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
    } ]
    and corresponding Parquet message type:
    message spark_schema {
    optional binary name (UTF8);
    optional binary pcode (UTF8);
    optional binary pcoe (UTF8);

    17/10/07 00:58:21 INFO compress.CodecPool: Got brand-new compressor [.gz]
    17/10/07 00:58:21 INFO datasources.DynamicPartitionWriterContainer: Maximum partitions reached, falling back on sorting.
    17/10/07 00:58:21 INFO codegen.GenerateUnsafeProjection: Code generated in 34.281133 ms
    17/10/07 00:58:21 INFO codegen.GenerateOrdering: Code generated in 85.573905 ms
    17/10/07 00:58:21 INFO datasources.DynamicPartitionWriterContainer: Sorting complete. Writing out partition files one at a time.
    17/10/07 00:58:21 INFO hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 54
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/parquet-hadoop-bundle-1.5.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/parquet-pig-bundle-1.5.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/parquet-format-2.1.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/hive-jdbc-1.1.0-cdh5.7.0-standalone.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/hive-exec-1.1.0-cdh5.7.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [shaded.parquet.org.slf4j.helpers.NOPLoggerFactory]
    17/10/07 00:58:21 INFO hadoop.ColumnChunkPageWriteStore: written 80B for [name] BINARY: 2 values, 26B raw, 43B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
    17/10/07 00:58:21 INFO hadoop.ColumnChunkPageWriteStore: written 73B for [pcode] BINARY: 2 values, 24B raw, 38B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
    17/10/07 00:58:21 INFO hadoop.ColumnChunkPageWriteStore: written 47B for [pcoe] BINARY: 2 values, 6B raw, 26B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
    17/10/07 00:58:22 INFO codec.CodecConfig: Compression: GZIP
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet block size to 134217728
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet page size to 1048576
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Dictionary is on
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Validation is off
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Maximum row group padding size is 8388608 bytes
    17/10/07 00:58:22 INFO parquet.CatalystWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
    "type" : "struct",
    "fields" : [ {
    "name" : "name",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
    }, {
    "name" : "pcode",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
    }, {
    "name" : "pcoe",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
    } ]
    and corresponding Parquet message type:
    message spark_schema {
    optional binary name (UTF8);
    optional binary pcode (UTF8);
    optional binary pcoe (UTF8);

    17/10/07 00:58:22 INFO compress.CodecPool: Got brand-new compressor [.gz]
    17/10/07 00:58:22 INFO hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 26
    17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 68B for [name] BINARY: 1 values, 15B raw, 33B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
    17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 47B for [pcode] BINARY: 1 values, 6B raw, 26B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
    17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 68B for [pcoe] BINARY: 1 values, 15B raw, 33B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
    17/10/07 00:58:22 INFO codec.CodecConfig: Compression: GZIP
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet block size to 134217728
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet page size to 1048576
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Dictionary is on
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Validation is off
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Maximum row group padding size is 8388608 bytes
    17/10/07 00:58:22 INFO parquet.CatalystWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
    "type" : "struct",
    "fields" : [ {
    "name" : "name",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
    }, {
    "name" : "pcode",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
    }, {
    "name" : "pcoe",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
    } ]
    and corresponding Parquet message type:
    message spark_schema {
    optional binary name (UTF8);
    optional binary pcode (UTF8);
    optional binary pcoe (UTF8);

    17/10/07 00:58:22 INFO compress.CodecPool: Got brand-new compressor [.gz]
    17/10/07 00:58:22 INFO hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 28
    17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 74B for [name] BINARY: 1 values, 17B raw, 35B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
    17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 68B for [pcode] BINARY: 1 values, 15B raw, 33B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
    17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 47B for [pcoe] BINARY: 1 values, 6B raw, 26B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
    17/10/07 00:58:22 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on localhost:59616 in memory (size: 21.4 KB, free: 208.7 MB)
    17/10/07 00:58:22 INFO codec.CodecConfig: Compression: GZIP
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet block size to 134217728
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet page size to 1048576
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Dictionary is on
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Validation is off
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
    17/10/07 00:58:22 INFO hadoop.ParquetOutputFormat: Maximum row group padding size is 8388608 bytes
    17/10/07 00:58:22 INFO parquet.CatalystWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
    "type" : "struct",
    "fields" : [ {
    "name" : "name",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
    }, {
    "name" : "pcode",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
    }, {
    "name" : "pcoe",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
    } ]
    and corresponding Parquet message type:
    message spark_schema {
    optional binary name (UTF8);
    optional binary pcode (UTF8);
    optional binary pcoe (UTF8);

    17/10/07 00:58:22 INFO compress.CodecPool: Got brand-new compressor [.gz]
    17/10/07 00:58:22 INFO hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 13
    17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 68B for [name] BINARY: 1 values, 15B raw, 33B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
    17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 47B for [pcode] BINARY: 1 values, 6B raw, 26B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
    17/10/07 00:58:22 INFO hadoop.ColumnChunkPageWriteStore: written 47B for [pcoe] BINARY: 1 values, 6B raw, 26B comp, 1 pages, encodings: [RLE, BIT_PACKED, PLAIN]
    17/10/07 00:58:22 INFO output.FileOutputCommitter: Saved output of task 'attempt_201710070058_0001_m_000000_0' to hdfs://localhost:8020/user/hive/warehouse/people/_temporary/0/task_201710070058_0001_m_000000
    17/10/07 00:58:22 INFO mapred.SparkHadoopMapRedUtil: attempt_201710070058_0001_m_000000_0: Committed
    17/10/07 00:58:22 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 2057 bytes result sent to driver
    17/10/07 00:58:22 INFO scheduler.DAGScheduler: ResultStage 1 (saveAsTable at NativeMethodAccessorImpl.java:-2) finished in 2.797 s
    17/10/07 00:58:22 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 2797 ms on localhost (1/1)
    17/10/07 00:58:22 INFO scheduler.DAGScheduler: Job 1 finished: saveAsTable at NativeMethodAccessorImpl.java:-2, took 3.236619 s
    17/10/07 00:58:22 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
    17/10/07 00:58:23 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
    17/10/07 00:58:23 INFO datasources.DynamicPartitionWriterContainer: Job job_201710070058_0000 committed.
    17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people on driver
    17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=19 on driver
    17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=30 on driver
    17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=46 on driver
    17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=__HIVE_DEFAULT_PARTITION__ on driver
    17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people on driver
    17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=19 on driver
    17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=30 on driver
    17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=46 on driver
    17/10/07 00:58:23 INFO parquet.ParquetRelation: Listing hdfs://localhost:8020/user/hive/warehouse/people/age=__HIVE_DEFAULT_PARTITION__ on driver
    17/10/07 00:58:24 WARN hive.HiveContext$$anon$2: Persisting partitioned data source relation `people` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. Input path(s): 

    [training@localhost ~]$ hive

    > show tables like 'people';
    Time taken: 5.046 seconds, Fetched: 1 row(s)

    sqlContext =HiveContext(sc)
    newPeopleDF = sqlContext.read.table("people")


    | name|pcode| pcoe| age|
    |Brayden|94304| null| 30|
    | Diana| null| null| 46|
    | Carla| null|10036| 19|
    | Alice|94304| null|null|
    |Etienne|94104| null|null|

    可以看到,确实把一个从jason 读取得到的 DataFrame,写入了parquet 格式的表,表名为 people

    然后,通过再一次地通过 HiveContext 来读取此表,得到并显示了它的数据。

  • 相关阅读:
    [BZOJ2662][BeiJing wc2012]冻结
  • 原文地址:https://www.cnblogs.com/gaojian/p/dataframe_write.html
Copyright © 2020-2023  润新知