当数据量很大时,需要查找一个数据的子集用于加快数据的分析,这种技术就是抽样技术。Hive中,数据抽样分为以下三种:
- 随机抽样;
- 桶表抽样;
- 块抽样;
1 随机抽样
1)语法结构
使用Rand()和LIMIT关键字得到抽样数据,Distribute和Sort关键字确保数据在mappers和reducers之间高效的随机分布,也可以使用order by rand()实现,但是性能不好。
语法:
SELECT * FROM <Table_Name> DISTRIBUTE BY RAND() SORT BY RAND()
LIMIT <N rows to sample>;
2)示例
0: jdbc:hive2://localhost:10000/hive> select * from ana
. . . . . . . . . . . . . . . . . . > distribute by rand() sort by rand() limit 3;
+-----------+-------------+-------------+
| ana.name | ana.depart | ana.salary |
+-----------+-------------+-------------+
| Mike | 1001 | 6400 |
| Will | 1000 | 4000 |
| Richard | 1002 | 8000 |
+-----------+-------------+-------------+
3 rows selected (123.999 seconds)
2 桶表抽样
1)语法结构
桶表抽样是桶表已优化的特殊的抽样方法,colname指定列在哪里取样数据,当在整个行取样时,可以使用Rand()函数,如果抽样列是Clustered By列,Tablesample语句会更高效。
语法:
SELECT * FROM <Table_Name>
TABLESAMPLE(BUCKET <specified bucket number to sample> OUT OF <total
number of buckets> ON [colname|RAND()]) table_alias;
2)示例
略
3 块抽样
1)语法结构
块抽样允许Hive随机从数据中挑选N行、数据量的百分比或者数据的N字节大小。该种抽样的粒度是HDFS的块大小。
语法:
SELECT *
FROM <Table_Name> TABLESAMPLE(N PERCENT|ByteLengthLiteral|N ROWS) s;
-- ByteLengthLiteral
-- (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G')
2)示例
示例一:按行抽样
0: jdbc:hive2://localhost:10000/hive> select name from ana tablesample(4 rows) a;
+----------+
| name |
+----------+
| Lucy |
| Michael |
| Steven |
| Will |
+----------+
4 rows selected (0.29 seconds)
示例二:按数据大小的百分比抽样
0: jdbc:hive2://localhost:10000/hive> select name from ana tablesample(10 percent) a;
+----------+
| name |
+----------+
| Lucy |
| Michael |
+----------+
2 rows selected (0.345 seconds)
示例三:按数据大小抽样
0: jdbc:hive2://localhost:10000/hive> select name from ana tablesample(2M) a;
+----------+
| name |
+----------+
| Lucy |
| Michael |
| Steven |
| Will |
| Will |
| Jess |
| Lily |
| Mike |
| Richard |
| Wei |
| Yun |
+----------+
11 rows selected (0.264 seconds)