BG.Hive - 润新知

BG.Hive
1. 将mysql的订单数据导入hive的分区表（桶、倾斜）【partition，bucket，skew】

　　a> 在Hive中新建分区表　　
```
CREATE TABLE IF NOT EXISTS HelloHive.orders
(customernumber int COMMENT 'customernumber in orders',    #列注释
customername string,
orderdate date,
country string,
state string)
COMMENT 'orders table'    #表注释
PARTITIONED BY (orderyear int)    #分区字段
CLUSTERED BY (country) SORTED BY (orderdate) INTO 4 BUCKETS    #桶
SKEWED BY (state) ON ('null')    #倾斜字段
ROW FORMAT DELIMITED
STORED AS TEXTFILE;
```
　　b> 在Hive中创建和上述分区表格式一样的非分区表
```
hive> create table ori_orders
    > (customernumber int,
    > customername string,
    > orderdate date,
    > country string,
    > state string,
    > orderyear int);
```
　　c> 将mysql的数据导入到非分区表

　　　　sqoop --options-file load_mysql_to_hive.hql
```
#file name: load_mysql_to_hive.hql
import
--connect
jdbc:mysql://bigdata.mysql:3306/classicmodels
--username
bigdata
-P
--query
"select A.customernumber, B.customername, A.orderdate, B.country, B.state, year(A.orderDate) as orderyear from orders A left join customers B on A.customerNumber = B.customerNumber where $CONDITIONS"
--target-dir
'/usee/myql2hive/ori_orders'
--split-by
"A.customernumber"
--hive-import
--hive-overwrite
--hive-table
HelloHive.ori_orders
```
　　d> 将ori_order的数据导入到order中

　　set hive.exec.dynamic.partition=true; 　　#开启动态分区

　　set hive.exec.dynamic.partition.mode=nonstrict;　　#设置动态分区模式，strict:必须制定一个静态分区列；nonstrict:可以不用指定动态分区列
```
hive> insert overwrite table orders partition (orderyear) 
    > select customernumber,customername,orderdate,country,state,orderyear 
    > from ori_orders;
```
　　e> 在hdfs上查看order表的目录，会发现已经根据数据成功生成了3个分区。且分为了4个bucket和是按照state字段倾斜的表。

　　　　hive> show partitions orders;

　　f> 对比分区表和非分区表的查询速度

　　　　select * from ori_orders where orderyear=2003; 耗时0.781s；select * from orders where orderyear=2003; 耗时0.127s

　　　　select count(1) from ori_orders where orderyear=2003; 耗时17.747s; select count(1) from orders where orderyear=2003; 耗时13.7s

　　　　由于对表进行分区后，按照分区字段查询数据不需要对整个hdfs目录进行扫描，查询速度会快很多。

　　e> 该表的桶，字段为country，桶的作用

　　　　对于每一个表（table）或者分区， Hive可以进一步组织成桶，也就是说桶是更为细粒度的数据范围划分。Hive也是针对某一列进行桶的组织。Hive采用对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。

把表（或者分区）组织成桶（Bucket）有两个理由：

（1）获得更高的查询处理效率。桶为表加上了额外的结构，Hive 在处理有些查询时能利用这个结构。具体而言，连接两个在（包含连接列的）相同列上划分了桶的表，可以使用 Map 端连接（Map-side join）高效的实现。比如JOIN操作。对于JOIN操作两个表有一个相同的列，如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以，可以大大较少JOIN的数据量。

（2）使取样（sampling）更高效。在处理大规模数据集时，在开发和修改查询的阶段，如果能在数据集的一小部分数据上试运行查询，会带来很多方便。

2. 通过sqoop将mysql表导入Hive特定库中，生成新表

　　sqoop import --connect jdbc:mysql://bigdata.mysql:3306/classicmodels --table customers --username bigdata -P --hive-import --target-dir /user/hive/warehouse/tmpCustomer --hive-database HelloHive --hive-table customer

　　生成的Hive表为内表（Managed Table)

　　上面虽然指定的--target-dir，但生成的HDFS文件实在HelloHive数据库的目录下；

3. 创建Json格式表

　　错误：Cannot validate serde: org.apache.hive.hcatalog.data.JsonSerde，解决方案如下

　　http://maven.apache.org/download.cgi　　#下载maven

　　tar zxf apache-maven-3.3.9-bin.tar.gz　　#复制到虚拟机并解压

　　vim /etc/profile　　#配置环境变量

　　　　MAVEN_HOME=/opt/apache-maven-3.3.9

　　　　PATH=$MAVEN_HOME/bin:(...)$PATH

　　source /etc/profile　　#使环境变量生效

　　mvn -version　　#检查maven是否成功安装

　　git clone https://github.com/rcongiu/Hive-JSON-Serde.git　　#下载Hive-JSON-Serde

　　cd /usr/local/Hive-JSON-Serde/　　#进入下载目录

　　hive> add jar /usr/local/Hive-JSON-Serde/json-udf/target/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar;　　#添加Json Jar包到Hive

　　hive> create table JsonTable(　　#创建表
    > id string,
    > name string)
    > row format serde 'org.openx.data.jsonserde.JsonSerDe'
    > stored as textfile;

　　hive> insert into JsonTable(id,name) values('port','8080')　　#插入数据

　　hdfs dfs -cat /user/hive/warehouse/jsontable/000000_0　　#查看hdfs文件，可发现数据结构为Json格式：{"name":"8080","id":"port"}

4. 多分区表

　　hive> create table partTable(id string, name string) partitioned by (iYear int, sCity string);　　#创建分区表，并定义分区字段

　　hive> alter table parttable add partition(iYear=2017, sCity='Beijing');　　#创建分区/iYear=2017/Beijing

　　hive> alter table parttable add partition(iYear=2017, sCity='Shanghai);　　#创建分区/iYear=2017/Shanghai

　　hive> insert into parttable partition(iYear=2017,sCity='Shanghai') select '1','2017ShangHai'　　#向表中插入数据

　　hive> insert into parttable partition(iYear=2017,sCity='Beijing') select '1','2017Beijing';

　　hive> select * from partTable;　　#结果为： 1 2017Beijing 2017 Beijing　　1 2017ShangHai 2017 Shanghai

　　select * from parttable where sCity='Shanghai';　　#结果为： select * from parttable where sCity='Shanghai';

5. Alter操作

　　hive> alter table orders rename to altorder;　　#重命名表

　　hive> alter table altorder set tblproperties('sort columns'='customernumber');　　#修改表属性，更改排序列；最终结果被保存在元数据库中

　　hive> alter table altorder set tblproperties ('comment'='alter orders comment');　　#修改表注释

　　hive> alter table altorder clustered by (customernumber) into 4 buckets;　　#修改桶属性

　　hive> alter table altorder clustered by (customernumber) sorted by (orderdate) into 4 buckets;　　#修改桶属性--增加排序字段

　　hive> alter table altorder skewed by (customername) on ('') stored as directories;　　#修改表的倾斜，

　　hive> alter table altorder add if not exists

　　　　>partition (orderyear=2006) location '/user/hive/warehouse/hellohive.db/altorder/orderyear=2006'

　　　　>partition (orderyear=2007) location '/user/hive/warehouse/hellohive.db/altorder/orderyear=2007'　　　　　　#增加两个新的分区

　　hive> alter table altorder partition (orderyear=2007) rename to partition (orderyear=2008);　　#重命名分区名，将分区2007改为2008

　　hive> create table order2 like altorder;　　#创建一个和altorder结构一样的表

　　hive> alter table order2 exchange partition (orderyear=2003) with table altorder;　　#交换分区，该语句允许将一个分区中的数据移动另一个拥有相同schema但没有那个分区的表中。

　　hdfs dfs -mkdir '/user/hive/warehouse/hellohive.db/order2/orderyear=2004'　　#修复分区 - 1；在hdfs中，增加分区目录

　　hive> msck repair table order2;　　#修复分区 - 2；在Hive中进行修复分区操作，将新分区加入元数据表

　　hive> show partitions order2;　　#修复分区 - 3；查看分区，可以发现，order表的新分区已经被加入

　　hive> alter table order2 drop if exists partition (orderyear=2004) purge;　　#删除分区，会删除分区的数据和分区的元数据

　　hive> alter table order2 archive partition (orderyear=2003);　　#归档分区　　SemanticException [Error 10107]: Archiving methods are currently disabled.

　　hive> alter table order2 unarchive partition (orderyear=2003);

　　hive> alter table order2 partition (orderyear=2003) set fileformat ORC;　　#修改分区文件格式；textfile/ orc/ rcfile/ parquet/ avro/ sequencefile

　　hive> alter table order2 partition (orderyear=2003) set location '/user/hive/warehouse/hellohive.db/order2/orderyear=2008'　　#修改分区路径，在新数据插入后，会放入新的路径；就数据不变

　　hive> create table t(a int, b int, c int);　　#创建测试表

　　hive> alter table t change a a1 int;　　#将列名a改成a1

　　hive> alter table t change a1 a2 string comment 'a2' after b;　　#j将a1改成a2，类型改为string，放到字段b后，增加描述

　　hive> alter table t add columns (d int);　　#增加d字段，位置为所有列后，分区列之前

　　hive> alter table t replace columns (a int, b string, c int, e int);　　#更改列名。重命名所有字段名；如果列数少于原表，则后面缺少的字段将在元数据中被删除，hdfs中数据不会删除。

6. Hive视图

　　视图如果不指明列名，视图的列名会按照AS子句中的字段命名；如果子句中存在未命名的字段，则在视图中显示为默认的字段名；如果视图指定了列名，则子句的列会按照顺序指定到视图列名。

　　hive> create view if not exists v_order(cid, cname)
    > comment 'view of order'
    > as select customernumber, customername from altorder;

　　hive> desc v_order;

　　hive> desc formatted v_order;

　　hive> alter view v_order set tblproperties ('comment' = 'view');　　#修改视图属性

　　hive> alter view v_order as select customernumber a, customername b from altorder;　　#修改视图子句，并重命名了视图字段名

　　hive> drop view if exists v_order;　　#删除视图

　　如果当前视图被其他视图引用，当前视图删除后，其他引用该视图的视图将会无效。

7. Hive索引

　　hive> create index idx_order2_customernumber　　#创建索引，索引会在hdfs上创建一个新的目录，在元数据中会创建一张[DBName]__[TableName]__[IndexName]__的表
    > on table order2 (customernumber)
    > as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
    > with deferred rebuild

　　hive> show index on order2;

　　hive> hive> desc formatted hellohive__order2_idx_order2_customernumber__;

　　hive> alter index idx_order2_customernumber on order2 rebuild;　　#修改（重建）索引

　　hive> alter index idx_order2_customernumber on order2 partition (orderyear=2003) rebuild;　　#修改（重建）某个分区上的索引

　　hive> drop index idx_order2_customernumber on order2;　　#删除索引，会删除元数据中的表，hdfs中的目录也会被删除

8. Show命令

　　hive> show databases like 'h*';　　#like中*表示匹配任意字符；

　　hive> show databases like 'h*';　　#like中|表示或者

　　hive> show tables in hellohive like 'o*';　　#显示数据库中的所有表和视图

　　hive> show partitions altorder;　　#显示表分区情况

　　hive> show tblproperties altorder;　　#显示表的属性

　　hive> show tblproperties altorder ('sort columns');　　#显示表指定属性信息

　　hive> show create table hellohive.altorder;　　#显示表的创建信息，也可以显示View的

　　hive> show formatted index on order2 in hellohive;　　#显示某个表中所有的索引的详细信息

　　hive> show columns in altorder in hellohive;　　#显示表中所有字段信息，包括分区字段

　　hive> show locks　　#会显示如下信息

　　　　数据库名、表名、分区名

　　　　锁状态：获得（请求者已经获得锁）、等待（请求者等待获得锁）、终止（锁已经超时但还没有被清理）

　　　　锁类型：独占锁（其他用户不能再当前时间获得锁）、共享读锁（任意数量的共享读锁可以锁定相同的资源）、共享写锁（其他共享写锁不能锁定已经被共享写锁的资源）

　　　　与锁关联的事务ID

　　　　请求锁的用户

　　　　用户运行的主机

　　hive> show transactions;　　#会显示如下信息

　　　　事务ID、状态

　　hive> show compactions;　　#显示当Hive事务被使用时，所有正在被压缩或者即将被压缩的所有事务信息

　　　　数据库名称、表名、分区名

　　　　主压缩还是次压缩

　　　　压缩状态：初始化（队列中等待被压缩）、工作（正在被压缩）、清除（压缩已经结束）

　　

　　hive> desc database extended hellohive;　　#显示数据库的信息，路径、拥有者

　　hive> desc extended order2;　　#显示表信息；extended，以序列化格式显示；formatted，已格式化的格式显示　　

　　

　　

　　

　　

　　

　　

　　
相关阅读:
perf-stat
perf原理
 ubuntu中Docker的安装与使用
 NVM相关手册及新特性理解
 #2018BIT软件工程基础#结对项目：四则运算题目生成
 #2018BIT软件工程基础#个人项目：数独
 第一篇博文：自我介绍&新学期展望
 越早明白这些道理，越能少走一些弯路
 把知识连接起来就是创意
 【翻译】24款界面精美的免费UI工具包
原文地址：https://www.cnblogs.com/Niko12230/p/6518330.html