hive 桶相关特性分析

hive 桶相关特性分析

1. hive 桶相关概念

桶(bucket)是指将表或分区中指定列的值为key进行hash，hash到指定的桶中，这样可以支持高效采样工作。

抽样（ sampling ）可以在全体数据上进行采样，这样效率自然就低，它还是要去访问所有数据。而如果一个表已经对某一列制作了bucket，就可以采样所有桶中指定序号的某个桶，这就减少了访问量。

2. 桶作用

1）数据抽样

2）提升某些查询操作效率，如：mapside join

3. 桶的使用

--场景一、场景二使用此表说明

以id来划分桶，测试是否可以直接使用load将数据放入相关的桶中。

create table tb_user (id int ,name string,age int) partitioned by (timeflag bigint) clustered by (id) sorted by (age asc) into 4 buckets row format delimited fields terminated by ',';

测试数据：

[hadoop@mwtec-50 tmp]$ hadoop fs -cat /user/hive/warehouse2/tb_user/timeflag=130730/tb_user;

1,nat,18,130731

0,joe,19,130730

2,kay,14,130729

3,ann,18,130730

4,add,19,130730

5,ads,20,130821

6,dsf,19,130901

7,ll,19,130721

8,aas,15,130721

--场景七

使用name来划分桶，测试是否可以使用字符串进行划分。

create table tb_stu_1(id int,age int, name string,timeflag bigint) clustered by (name) sorted by (age) into 5 buckets row format delimited fields terminated by ',';

--其他场景：

使用id来划分桶，测试桶的一些特性。

create table tb_tmp(id int,age int, name string ,timeflag bigint) row format delimited fields terminated by ',';

create table tb_stu(id int,age int, name string,timeflag bigint) clustered by (id) sorted by (age) into 5 buckets row format delimited fields terminated by ',';

测试数据：

1,20,zxm,20130730

2, 21, ljz,20130730

3, 19, cds,20130730

4, 18, mac,20130730

5, 22, android,20130730

6, 23, symbian,20130730

7, 25, wp, 20130730

相关操作语句：

1. [hadoop@mwtec-50 tmp]$ vi tb_tmp

1,20,zxm,20130730

2, 21, ljz,20130730

3, 19, cds,20130730

4, 18, mac,20130730

5, 22, android,20130730

6, 23, symbian,20130730

7, 25, wp, 20130730

2. hadoop fs -put /tmp/tb_tmp /user/hadoop/output

3. load data inpath '/user/hadoop/output/tb_tmp' into table tb_tmp;

场景一：使用load data inpath 进行导入

语句：

load data inpath '/user/hadoop/output/tb_user' into table tb_user partition(timeflag=130730);

注：使用load data时数据之间不能有空格，否则输入的数据会为null

执行结果：

结果分析：

直接使用load data inpath 不能自动分为四个桶，所有数据都在tb_user目录下。

场景二：先 set hive.enforce.bucketing = true; 再使用load data inpath 进行导入

注：退出hive客户端

语句：

set hive.enforce.bucketing = true;

load data inpath '/user/hadoop/output/tb_user' into table tb_user partition(timeflag=130730);

执行结果：

结果分析：

先 set hive.enforce.bucketing = true; 再使用load data inpath 进行导入不能自动分为四个桶，所有数据都在tb_user目录下。

场景三：退出hive客户端后再进入hive客户端,不使用set hive.enforce.bucketing = true，使用insert into table 命令写数据至tb_stu表中。

语句：

insert into table tb_stu select id,age,name,timeflag from tb_tmp where timeflag=20130730 sort by age;

执行过程：

结果分析：

在没有set hive.enforce.bucketing = true时，只有一个job且查看hdfs时，只发现有一个目录，并非有5个目录。从而可知在执行桶的插入语句时需要先执行set hive.enforce.bucketing = true；

场景四：使用set hive.enforce.bucketing = true，使用insert into table 命令写数据至tb_stu表中。

语句：

insert into table tb_stu select id,age,name,timeflag from tb_tmp where timeflag=20130730 sort by age;

执行过程：

结果分析：

使用set hive.enforce.bucketing = true后，插入的语句的job数为2个，hdfs下有5个桶。

场景五：使用桶的抽样（ sampling）

语句：

select * from tb_stu tablesample(bucket 1 out of 5 on id);

执行过程：

结果分析：

用tablesample 子句对表进行取样，可以获得相同结果，这样子句将查询限定在表的一部分桶内，而不是使用整个表。如：上图所示，能被5求余得1的桶的所有数据将被查询出来。

场景六：使用桶的抽样（ sampling），且使用rand()函数

语句：

select * from tb_stu tablesample(bucket 1 out of 5 on rand());

执行过程：

结果分析：

查询只需要读取和tablesample子句匹配的桶，所取样分桶是非常高效的操作。如果使用rand()函数对没有划分成桶的表进行取样，即使只需要读取很小一部分样本，也要扫描整个输入数据集。

场景七：

语句：

insert into table tb_stu_1 select id,age,name,timeflag from tb_tmp;

相关结果：

结果分析：

hive表中分桶不仅可以使用数字也可以使用字符串进行桶。

总结分析:

1. 定义了桶，但要生成桶的数据，只能是由其他表通过insert into 或是insert overwrite ，若表有分区只能使用insert overwrite

2. 定义桶可以使用整型字段或是string类型字段

3. 若表没有定义桶也可以进行随机抽样

4. 必须先set hive.enforce.bucketing = true才可以将数据正常写入桶中

未解决问题：

问题一、桶表有分区时不支持insert into

问题二、

此时建表语句为：

create table tb_stu(id int,age int, name string) partitioned by (timeflag bigint) clustered by (id) sorted by (age) into 4 buckets row format delimited fields terminated by ',';

使用：

insert into table partition(timeflag=130801) select id,age,name,timeflag from tb_tmp;

出现如下异常：
相关阅读:
js string to int
有的事情是无可奈何的，有的事情是能够改变的……
拼接字符串去掉最后多余的串，JSON的遍历
 git入门
 js的闭包
 nodejs系列（二）REPL交互解释事件循环
 nodejs系列（一）安装和介绍
 学习mongo系列（十一）关系
 学习mongo系列（十）MongoDB 备份(mongodump)与恢复(mongorerstore) 监控（mongostat mongotop）
学习mongo系列（九）索引，聚合，复制（副本集），分片
原文地址：https://www.cnblogs.com/aukle/p/3233826.html