• Hive中的HiveServer2、Beeline及数据的压缩和存储


      1、使用HiveServer2及Beeline

      HiveServer2的作用:将hive变成一种server服务对外开放,多个客户端可以连接。

      启动namenode、datanode、resourcemanager、nodemanager。

      一个窗口输入:hive-0.13.1]$ bin/hiveserver2 启动hiveserver2服务,等效于:$ bin/hive --service hiveserver2

      第二个窗口输入:~]$ ps -ef | grep java 查看hiveserver2进程

      第二个窗口输入:hive-0.13.1]$ bin/beeline 启动beeline

      第二个窗口输入:beeline> !connect jdbc:hive2://hadoop-senior.ibeifeng.com:10000 beifeng beifeng org.apache.hive.jdbc.HiveDriver 将beeline连接hiveserver2服务

      hiveserver2的默认端口号:10000,临时修改hiveserver2的端口号:$ bin/hiveserver2 --hiveconf hive.server2.thrift.port=14000

      使用beeline:

      0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> show databases;

      0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> use default;

      0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> show tables;

      0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> select * from emp; 不使用mapreduce

      0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> select empno from emp; 使用mapreduce

      [beifeng@hadoop-senior hive-0.13.1]$ bin/beeline -u jdbc:hive2://hadoop-senior.ibeifeng.com:10000/default

      直接连接hiveserver2进入beeline的default数据库

      $ !connect jdbc:hive2://bigdata-senior01.ibeifeng.com:10000

      $ bin/beeline -u jdbc:hive2://bigdata-senior01.ibeifeng.com:10000 -n beifeng -p 123456

      u:代表链接的意思

      $ bin/beeline --help:通过帮助命令获取常用选项参数

      HiveServer2 JDBC的使用:

      将分析的结果存储在hive表中,前端通过DAO代码,进行数据的查询。但是HiveServer2的并发存在问题,需要做并发处理。使用JDBC连接之前一定要先启动HiveServer2。

      Hive使用JDBC格式:

      Class.forName("org.apache.hive.jdbc.HiveDriver");

      Connection conn = DriverManager.getConnection("jdbc:hive2://:","","");

      2、Hive运行配置

      Shell命令行临时生效:

      set hive.fetch.task.conversion;

      hive.fetch.task.conversion=minimal

      hive-site.xml配置文件生效:

      hive.fetch.task.conversion

      minimal

      Some select queries can be converted to single FETCH task minimizing latency.

      Currently the query should be single sourced not having any subquery and should not have

      any aggregations or distincts (which incurs RS), lateral views and joins.

      1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only

      2. more : SELECT, FILTER, LIMIT only (TABLESAMPLE, virtual columns)

      3、虚拟列

      One is 【INPUT__FILE__NAME】, which is the input file's name for a mapper task.

      the other is 【BLOCK__OFFSET__INSIDE__FILE】, which is the current global file position.

      【INPUT__FILE__NAME】代表这行数据属于哪个文件中的:

      select deptno,dname,INPUT__FILE__NAME from dept;

      【BLOCK__OFFSET__INSIDE__FILE】代表块的偏移量:

      select deptno,dname,BLOCK__OFFSET__INSIDE__FILE from dept;

      这两个虚拟列的格式都需要两个下划线。

      4、安装snappy数据压缩格式

      (1)安装snappy:下载snappy安装包,并解压安装。

      snappy安装包的下载地址:http://google.github.io/snappy/。

      (2)编译haodop 2.x源码:

      mvn package -Pdist,native -DskipTests -Dtar -Drequire.snappy /opt/modules/hadoop-2.5.0-src/target/hadoop-2.5.0/lib/native

      [beifeng@hadoop-senior hadoop-2.5.0]$ bin/hadoop checknative

      15/08/31 23:10:16 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native

      15/08/31 23:10:16 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library

      Native library checking:

      hadoop: true /opt/modules/hadoop-2.5.0/lib/native/libhadoop.so

      zlib: true /lib64/libz.so.1

      snappy: true /opt/modules/hadoop-2.5.0/lib/native/libsnappy.so.1

      lz4: true revision:99

      bzip2: true /lib64/libbz2.so.1

      用安装好的snappy压缩格式运行mapreduce程序wordcount。

      bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount /user/beifeng/mapreduce/wordcount/input /user/beifeng/mapreduce/wordcount/output bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec /user/beifeng/mapreduce/wordcount/input /user/beifeng/mapreduce/wordcount/output2

      5、Hive中的数据压缩

      压缩格式:bzip2,gzip,lzo,snappy等

      压缩比:bzip2>gzip>lzo bzip2最节省存储空间

      解压速度:lzo>gzip>bzip2 lzo解压速度是最快的

      数据压缩的好处:

      (1)节省了磁盘IO(map输出压缩)和网络传输IO(reduce输出压缩)。

      (2)数据大小变小。

      (3)任务运行的性能提高(由于任务大小小了)。

      (4)必须考虑压缩后的文件可拆分,即压缩后的每一个任务分片可独立运行。

      MapReduce过程中的数据压缩与解压缩:

      Hadoop中支持的压缩格式:

      压缩格式  压缩格式所在的类

      Zlib  org.apache.hadoop.io.compress.DefaultCodec

      Gzip  org.apache.hadoop.io.compress.GzipCodec

      Bzip2  org.apache.hadoop.io.compress.BZip2Codec

      Lzo  com.hadoop.compression.lzo.LzoCodec

      Lz4  org.apache.hadoop.io.compress.Lz4Codec

      Snappy  org.apache.hadoop.io.compress.SnappyCodec

      MapReduce压缩的属性设置:

      压缩的用法  在core-site.xml中设置的属性

      Map 输出设置  mapreduce.map.output.compress = True;

      mapreduce.map.output.compress.codec = CodecName;

      Reducer 输出设置  mapreduce.output.fileoutputformat.compress = True;

      mapreduce.output.fileoutputformat.compress.codec = CodecName;

      MapReduce配置Snappy压缩运行示例:

      bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec /user/beifeng/mr/input /user/beifeng/mr/output2

      Hive压缩的属性设置:

      hive.exec.compress.intermediate

      true

      压缩的用法  新建表时定义的属性

      Map 输出设置  SET hive.exec.compress.intermediate = True;

      SET mapreduce.map.output.compress=true

      SET mapred.map.output.compression.codec = CodecName;

      SET mapred.map.output.compression.type = BLOCK/RECORD;

      Reducer 输出设置  SET hive.exec.compress.output = True;

      SET mapred.output.compression.codec = CodecName;

      SET mapred.output.compression.type = BLOCK/RECORD;

      6、数据文件存储格式

      file_format:

      : SEQUENCEFILE

      | TEXTFILE -- (Default, depending on hive.default.fileformat configuration)

      | RCFILE -- (Note: Available in Hive 0.6.0 and later)

      | ORC -- (Note: Available in Hive 0.11.0 and later)

      | PARQUET -- (Note: Available in Hive 0.13.0 and later)

      | AVRO -- (Note: Available in Hive 0.14.0 and later)

      | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

      数据存储格式分为按行存储数据和按列存储数据。

      (1)ORCFile(Optimized Row Columnar File):hive/shark/spark支持。使用ORCFile格式存储列数较多的表。

      (2)Parquet(twitter+cloudera开源,被Hive、Spark、Drill、Impala、Pig等支持)。Parquet比较复杂,其灵感主要来自于dremel,parquet存储结构的主要亮点是支持嵌套数据结构以及高效且种类丰富算法(以应对不同值分布特征的压缩)。

      (1)存储为TEXTFILE格式

      create table page_views(

      track_time string,

      url string,

      session_id string,

      referer string,

      ip string,

      end_user_id string,

      city_id string

      )

      ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '

      STORED AS TEXTFILE ;

      load data local inpath '/opt/datas/page_views.data' into table page_views ;

      dfs -du -h /user/hive/warehouse/page_views/ ;

      18.1 M /user/hive/warehouse/page_views/page_views.data

      (2)存储为ORC格式

      create table page_views_orc(

      track_time string,

      url string,

      session_id string,

      referer string,

      ip string,

      end_user_id string,

      city_id string

      )

      ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '

      STORED AS orc ;

      insert into table page_views_orc select * from page_views ;

      dfs -du -h /user/hive/warehouse/page_views_orc/ ;

      2.6 M /user/hive/warehouse/page_views_orc/000000_0

      (3)存储为Parquet格式

      create table page_views_parquet(

      track_time string,

      url string,

      session_id string,

      referer string,

      ip string,

      end_user_id string,

      city_id string

      )

      ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '

      STORED AS PARQUET ;

      insert into table page_views_parquet select * from page_views ;

      dfs -du -h /user/hive/warehouse/page_views_parquet/ ;

      13.1 M /user/hive/warehouse/page_views_parquet/000000_0

      (4)存储为ORC格式,使用snappy压缩

      create table page_views_orc_snappy(

      track_time string,

      url string,

      session_id string,

      referer string,

      ip string,

      end_user_id string,

      city_id string

      )无锡人流医院哪家好 http://www.wxbhnkyy120.com/

      ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '

      STORED AS orc tblproperties ("orc.compress"="SNAPPY");

      insert into table page_views_orc_snappy select * from page_views ;

      dfs -du -h /user/hive/warehouse/page_views_orc_snappy/ ;

      (5)存储为ORC格式,不使用压缩

      create table page_views_orc_none(

      track_time string,

      url string,

      session_id string,

      referer string,

      ip string,

      end_user_id string,

      city_id string

      )

      ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '

      STORED AS orc tblproperties ("orc.compress"="NONE");

      insert into table page_views_orc_none select * from page_views ;

      dfs -du -h /user/hive/warehouse/page_views_orc_none/ ;

      (6)存储为Parquet格式,使用snappy压缩

      set parquet.compression=SNAPPY ;

      create table page_views_parquet_snappy(

      track_time string,

      url string,

      session_id string,

      referer string,

      ip string,

      end_user_id string,

      city_id string

      )

      ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '

      STORED AS parquet;

      insert into table page_views_parquet_snappy select * from page_views ;

      dfs -du -h /user/hive/warehouse/page_views_parquet_snappy/ ;

      在实际的项目开发当中,hive表的数据的存储格式一般使用orcfile / parquet,数据压缩一般使用snappy压缩格式。

  • 相关阅读:
    python文件操作,读取,修改,合并
    LWIP学习之流程架构
    嵌入式网络笔记
    AD17笔记
    STM32之VCP1/VCAP2引脚的处理
    AD中添加中文字符丝印的方法:
    磁珠与电感
    稳压二极管选型
    TVS选型
    光耦的使用
  • 原文地址:https://www.cnblogs.com/djw12333/p/11282019.html
Copyright © 2020-2023  润新知