• hive文件格式


    hive 默认格式为文本格式,便于文本查看数据,便于与其他工具共享,与二进制文件相比占用较大的空间

    hive> create table tb_test(id int,name string) stored as textfile;
    OK
    Time taken: 0.968 seconds
    hive> show create table tb_test;
    OK
    createtab_stmt
    CREATE TABLE `tb_test`(
      `id` int,
      `name` string)
    ROW FORMAT SERDE
      'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
    STORED AS INPUTFORMAT
      'org.apache.hadoop.mapred.TextInputFormat'
    OUTPUTFORMAT
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_test'
    TBLPROPERTIES (
      'COLUMN_STATS_ACCURATE'='{"BASIC_STATS":"true"}',
      'numFiles'='0',
      'numRows'='0',
      'rawDataSize'='0',
      'totalSize'='0',
      'transient_lastDdlTime'='1536636132')
    Time taken: 0.275 seconds, Fetched: 18 row(s)

    sequencefile 含有键值对的二进制文件格式,是hadoop本身就支持的一种标准文件格式,可以在块和记录级别进行压缩,对优化磁盘利用率以及I/O有重要的意义,支持按照块级别的文件分割,以方便并行操作。

    hive> create table tb_test2(id int,name string) stored as sequencefile;
    OK
    Time taken: 0.264 seconds
    hive> show create table tb_test2;
    OK
    createtab_stmt
    CREATE TABLE `tb_test2`(
      `id` int,
      `name` string)
    ROW FORMAT SERDE
      'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
    STORED AS INPUTFORMAT
      'org.apache.hadoop.mapred.SequenceFileInputFormat'
    OUTPUTFORMAT
      'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
    LOCATION
      'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_test2'
    TBLPROPERTIES (
      'COLUMN_STATS_ACCURATE'='{"BASIC_STATS":"true"}',
      'numFiles'='0',
      'numRows'='0',
      'rawDataSize'='0',
      'totalSize'='0',
      'transient_lastDdlTime'='1536636180')
    Time taken: 0.222 seconds, Fetched: 18 row(s)

    RCfile是hive支持的另一种高效的二进制存储格式。大多数hadoop,hive都是行数存储,大多数情况下比较高效。但对于特定类型的数据和应用,列式存储会更好。如果某表有成百上千的字段,大多数查询只查询部分字段,扫描所有行过滤大部分数据显得浪费,如果按列存储的话,只需要选择需要的列就可以了,提高性能。

    对于列式存储,压缩通常比较高效,一些列式存储并不需要物理存储null值列

    hive功能强大的一个方面表现在不同格式存储转换数据非常简单。不同存储格式的数据通过insert into ....select ,自动完成转换过程。

    hive> create table tb_test3(id int,name string) stored as RCfile;
    OK
    Time taken: 0.438 seconds
    hive> show create table tb_test3;
    OK
    createtab_stmt
    CREATE TABLE `tb_test3`(
      `id` int,
      `name` string)
    ROW FORMAT SERDE
      'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
    STORED AS INPUTFORMAT
      'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
    OUTPUTFORMAT
      'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
    LOCATION
      'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_test3'
    TBLPROPERTIES (
      'COLUMN_STATS_ACCURATE'='{"BASIC_STATS":"true"}',
      'numFiles'='0',
      'numRows'='0',
      'rawDataSize'='0',
      'totalSize'='0',
      'transient_lastDdlTime'='1536645316')
    Time taken: 0.244 seconds, Fetched: 18 row(s)

    hive提供了rcfilecat工具来展示rcfile的内容:

    [root@host ~]# hdfs dfs -ls /user/hive/warehouse/gamedw.db/tb_test3
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    Found 4 items
    -rwx-wx-wx   1 root supergroup         82 2018-09-11 13:57 /user/hive/warehouse/gamedw.db/tb_test3/part-00000-3b1d31b4-cde8-4054-b5d3-28179d2a4cc8-c000
    -rwx-wx-wx   1 root supergroup         87 2018-09-11 14:05 /user/hive/warehouse/gamedw.db/tb_test3/part-00000-565683b3-738e-4627-8c7a-53d43d819a0e-c000
    -rwx-wx-wx   1 root supergroup         88 2018-09-11 14:05 /user/hive/warehouse/gamedw.db/tb_test3/part-00000-b2ccd667-954d-4f8e-8110-04915d810e17-c000
    -rwx-wx-wx   1 root supergroup         82 2018-09-11 13:57 /user/hive/warehouse/gamedw.db/tb_test3/part-00000-bc3be7c3-86de-4873-8390-a850872fe0c7-c000

    [root@host ~]# hive --service rcfilecat /user/hive/warehouse/gamedw.db/tb_test3/part-00000-565683b3-738e-4627-8c7a-53d43d819a0e-c000
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/root/spark/spark-2.2.0-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
           wang    2

  • 相关阅读:
    javamail.providers not found
    vue.js实现购物车功能2.0版本
    vue.js实现购物车功能
    iframe高度自适应
    C语言 自定义字符串拼接函数
    php安装
    Apache安装
    python:爬虫初体验
    c:forEach 显示下拉框并回显
    数据结构 --- 线性表学习(php模拟)
  • 原文地址:https://www.cnblogs.com/playforever/p/9627118.html
Copyright © 2020-2023  润新知