• hive表多种存储格式的文件大小差异,无重复数据


    -- 重点,目标表无重复数据

    -- dbName.num_result 无重复记录
    -- 插入数据
    CREATE TABLE dbName.test_textfile(
      `key` string, 
      `value` string,
      `p_key` string, 
      `p_key2` string)
    STORED AS textfile
    ;
    insert overwrite table dbName.test_textfile select * from dbName.num_result where p_key='9' and p_key2='0';
    
    drop table dbName.test_orcfile;
    CREATE TABLE dbName.test_orcfile(
      `key` string, 
      `value` string,
      `p_key` string, 
      `p_key2` string)
    STORED AS orc
    ;
    insert overwrite table dbName.test_orcfile select * from test_textfile;
    
    CREATE TABLE dbName.test_rcfile(
      `key` string, 
      `value` string,
      `p_key` string, 
      `p_key2` string)
    STORED AS rcfile
    ;
    insert overwrite table dbName.test_rcfile select * from test_textfile;
    
    CREATE TABLE dbName.test_parquet(
      `key` string, 
      `value` string,
      `p_key` string, 
      `p_key2` string)
    STORED AS parquet
    ;
    insert overwrite table dbName.test_parquet select * from test_textfile;
    
    -- 统计数据量
    select count(1) as cnt from dbName.test_textfile;
    select count(1) as cnt from dbName.test_orcfile;
    select count(1) as cnt from dbName.test_rcfile;
    select count(1) as cnt from dbName.test_parquet;
    
    -- 统计文件大小
    dfs -du -s -h hdfs://nameservice1/user/hive/warehouse/dbName.db/test_text*;
    dfs -du -s -h hdfs://nameservice1/user/hive/warehouse/dbName.db/test_par*;
    dfs -du -s -h hdfs://nameservice1/user/hive/warehouse/dbName.db/test_rc*;
    dfs -du -s -h hdfs://nameservice1/user/hive/warehouse/dbName.db/test_orc*;
    1.0 G  3.1 G  hdfs://nameNode/user/hive/warehouse/dbName.db/test_textfile
    1.1 G  3.3 G  hdfs://nameNode/user/hive/warehouse/dbName.db/test_parquet
    984.0 M  2.9 G  hdfs://nameNode/user/hive/warehouse/dbName.db/test_rcfile
    470.0 M  1.4 G  hdfs://nameNode/user/hive/warehouse/dbName.db/test_orcfile

    从结果可以看出,在无重复数据的情况下,parquet的压缩无用武之地,占用空间比textfile还大,ORC是压缩最强的文件模式。

    hive (dbName)> dfs -du -s hdfs://nameNode/user/hive/warehouse/dbName.db/test_text*;
    1110741501  3332224503  hdfs://nameNode/user/hive/warehouse/dbName.db/test_textfile
    hive (dbName)> dfs -du -s hdfs://nameNode/user/hive/warehouse/dbName.db/test_par*;
    1167366639  3502099917  hdfs://nameNode/user/hive/warehouse/dbName.db/test_parquet
    hive (dbName)> dfs -du -s hdfs://nameNode/user/hive/warehouse/dbName.db/test_rc*;
    1031774688  3095324064  hdfs://nameNode/user/hive/warehouse/dbName.db/test_rcfile
    hive (dbName)> dfs -du -s hdfs://nameNode/user/hive/warehouse/dbName.db/test_orc*;
    492795434  1478386302  hdfs://nameNode/user/hive/warehouse/dbName.db/test_orcfile
  • 相关阅读:
    3.31上午英语视频
    3.30上午
    leetcode 38
    leetcode 36
    leetcode 28
    leetcode 27
    leetcode 26
    leetcode 24
    leetcode 21
    leetcode 20
  • 原文地址:https://www.cnblogs.com/chenzechao/p/10072555.html
Copyright © 2020-2023  润新知