• Apache Hive 存储方式、压缩格式



    简介:

    Apache hive 存储方式跟压缩格式!

    1、Text File

    hive> create external table tab_textfile (
    host string comment 'client ip address', 
    local_time string comment 'client access time', 
    api string comment 'request api', 
    request_type string comment 'request method, http version', 
    http_code int, body_bytes int, request_body map<string, string>, 
    referer string, user_agent string, upstr string, response_time string, request_time string) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY '&' MAP KEYS TERMINATED BY '=';
    OK
    Time taken: 0.162 seconds

    # 创建一张 Text File 存储格式、不压缩的外部表

    hive> load data local inpath '/data/logs/api/201711/tvlog_20171101/bftvapi.20171101.log' overwrite into table tab_textfile;
    Loading data to table tmpdb.tab_textfile
    OK
    Time taken: 1015.974 seconds

    # 原始文件 9.8G,加载到该表中需要花费 1015.974 秒 ( 这里可以优化,不使用 load 指令,直接 put 文件到数据表目录 )

    hive> select count(*) from tab_textfile;
    ...
    Stage-Stage-1: Map: 39  Reduce: 1   Cumulative CPU: 269.51 sec   HDFS Read: 10463240195 HDFS Write: 108 SUCCESS
    Total MapReduce CPU Time Spent: 4 minutes 29 seconds 510 msec
    OK
    27199202
    Time taken: 95.68 seconds, Fetched: 1 row(s)

    # 总共 27199202 行数据,用时 95.68 秒
    # 优化点:set [ hive.exec.reducers.bytes.per.reducer=<number>, hive.exec.reducers.max=<number>, mapreduce.job.reduces=<number> ]

    2、ORC File

    # 官方文档:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

    # ORC文档:https://orc.apache.org/docs

    hive> create external table tab_orcfile (
    host string comment 'client ip address', 
    local_time string comment 'client access time', 
    api string comment 'request api', 
    request_type string comment 'request method, http version', 
    http_code int, body_bytes int, request_body map<string, string>, 
    referer string, user_agent string, upstr string, response_time string, request_time string) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY '&' MAP KEYS TERMINATED BY '=' 
    STORED AS ORC tblproperties ("orc.compress"="NONE");
    OK
    Time taken: 0.058 seconds

    # 创建一张 ORC File 存储格式、不压缩的外部表

    hive> insert overwrite table tab_orcfile select * from tab_textfile;
    ...
    Stage-Stage-1: Map: 39   Cumulative CPU: 2290.24 sec   HDFS Read: 10463288479 HDFS Write: 2474228733 SUCCESS
    Total MapReduce CPU Time Spent: 38 minutes 10 seconds 240 msec
    OK
    Time taken: 289.954 seconds

    # 向 tab_orcfile 中加载数据,注意:ORC File 不能直接 load data !!!

    # 可以先创建 Text File 的临时表,将数据手动上传到该表指定目录,然后转换成 ORC File 格式。

    hive> select count(*) from tab_orcfile;
    OK
    27199202
    Time taken: 2.555 seconds, Fetched: 1 row(s)

    # 额,同样的语句,上面执行花费 95.68 秒,现在只要 2.555 秒。
    # 换一种方式测试,先查 tab_orcfile 表,然后再查 tab_textfile 表。

    hive> select count(host) from tab_orcfile;
    ...
    Stage-Stage-1: Map: 9  Reduce: 1   Cumulative CPU: 81.02 sec   HDFS Read: 96908995 HDFS Write: 108 SUCCESS
    Total MapReduce CPU Time Spent: 1 minutes 21 seconds 20 msec
    OK
    27199202
    Time taken: 33.55 seconds, Fetched: 1 row(s)

    # ORC File 花费 33.55 秒

    hive> select count(host) from tab_textfile;
    ...
    Stage-Stage-1: Map: 39  Reduce: 1   Cumulative CPU: 349.77 sec   HDFS Read: 10463246048 HDFS Write: 108 SUCCESS
    Total MapReduce CPU Time Spent: 5 minutes 49 seconds 770 msec
    OK
    27199202
    Time taken: 87.308 seconds, Fetched: 1 row(s)

    # Text File 花费 87.308 秒,高下立见!

    3、启用压缩

    # ORC 文档:https://orc.apache.org/docs/hive-config.html

    hive> create external table tab_orcfile_zlib (
    host string comment 'client ip address', 
    local_time string comment 'client access time', 
    api string comment 'request api', 
    request_type string comment 'request method, http version', 
    http_code int, body_bytes int, request_body map<string, string>, 
    referer string, user_agent string, upstr string, response_time string, request_time string) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY '&' MAP KEYS TERMINATED BY '=' 
    STORED AS ORC;

    # 默认的 ORC 压缩方式为 ZLIB,还支持 LZO、SNAPPY 等

    hive> insert overwrite table tab_orcfile_zlib select * from tab_textfile;
    ...
    Stage-Stage-1: Map: 39   Cumulative CPU: 2344.68 sec   HDFS Read: 10463292808 HDFS Write: 1077757683 SUCCESS
    Total MapReduce CPU Time Spent: 39 minutes 4 seconds 680 msec
    OK
    Time taken: 299.204 seconds

    # 数据加载完成

    hive> select count(host) from tab_orcfile_zlib;
    ...
    Stage-Stage-1: Map: 4  Reduce: 1   Cumulative CPU: 43.66 sec   HDFS Read: 66760966 HDFS Write: 108 SUCCESS
    Total MapReduce CPU Time Spent: 43 seconds 660 msec
    OK
    27199202
    Time taken: 31.369 seconds, Fetched: 1 row(s)

    # 查询速度不受影响

    hive> dfs -ls -h /user/hive/warehouse/tmpdb.db/tab_orcfile_zlib/
    Found 39 items
    -rwxrwxrwx   3 root supergroup     24.6 M 2017-11-10 16:55 /user/hive/warehouse/tmpdb.db/tab_orcfile_zlib/000000_0
    -rwxrwxrwx   3 root supergroup     23.0 M 2017-11-10 16:56 /user/hive/warehouse/tmpdb.db/tab_orcfile_zlib/000001_0
    -rwxrwxrwx   3 root supergroup     25.9 M 2017-11-10 16:55 /user/hive/warehouse/tmpdb.db/tab_orcfile_zlib/000002_0
    -rwxrwxrwx   3 root supergroup     26.5 M 2017-11-10 16:55 /user/hive/warehouse/tmpdb.db/tab_orcfile_zlib/000003_0

    # 总共分成 39 个文件,每个平均 25M,总过不到 1G,原始文件 9.8G,这压缩比如何 ?

  • 相关阅读:
    Vue 计算属性(四)
    Vue 方法与事件(三)
    Vue 基本指令使用(二)
    Vue 项目开发环境搭建(一)
    SpringBoot 整合 Dubbo
    Nginx 中 include 指令使用
    Nginx 中 root 和 alias 的使用区别
    JS动态修改网站图标以及标题
    vue中使用轮播图插件carousel,克隆的图片点击事件无效的解决办法
    根据 url + fileName下载文件,并更改文件名
  • 原文地址:https://www.cnblogs.com/wangxiaoqiangs/p/7815560.html
Copyright © 2020-2023  润新知