Hive学习小记-（7）group by原理&tips

原理

参考：https://blog.csdn.net/u013668852/article/details/79866931

group by多字段

select rank, isonline, count(*) from city group by rank, isonline;

将GroupBy的字段组合为map的输出key值，利用MapReduce的排序，在reduce阶段保存LastKey区分不同的key。MapReduce的过程如下（当然这里只是说明Reduce端的非Hash聚合过程）

group by单字段

group by 单字段和多字段时的不同在于key上，以如下例子为例（出处太多）：
SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;

hive> SELECT * FROM logs;
a   苹果  5
a   橙子  3
a   苹果  2
b   烧鸡  1

hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
a   10
b   1

其过程如下图：

默认设置了hive.map.aggr=true，所以会在mapper端先group by一次，最后再把结果merge起来，为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是hash，reducer是mergepartial。如果把hive.map.aggr=false，那将groupby放到reducer才做，他的mode是complete.

TIPS

减少group by字段

措施一：先group by ；再关联

参考：https://blog.51cto.com/imysqldba/1364808

措施二：若是其中有些字段是固定取值，可以限定死

参考：https://www.cnblogs.com/hoojjack/p/7460574.html

措施三：本身在其他字段取了group by 就已经能确保取值唯一的字段，如张三项目组的职责一定是洗碗，李四组的职责一定是刷盘子，可以给字段打上聚合函数max,min等：select proj_group, duty, sum(salary) from test_table group by proj_group, duty 可以改写为：select proj_group, max(duty), sum(salary) from test_table group by proj_group

hive与MySQL group by的区别

1.hive执行顺序group 在前，select在后；MySQL select在前，group 在后

参考：https://blog.csdn.net/m0_48714980/article/details/109136659

eg:

--在year(date)起别名
-- 1.在Mysql中顺利执行
select count(*),year(date) a from student group by a ;

-- 2.但在hive中出现报错
select count(*),year(date) a from student group by a ;
Invalid table alias or column reference ‘a’: (possible column names are: s_id, s_name, s_birth, s_sex

-- 原因分析：select执行顺序中，group by 在select前先执行，所以此时别名在解析器中没法识别

2.hive要将非聚合字段都加到group中；MySQL没加到group的字段可能会随机取：

eg:

mysql> select * from youhua1;
+--------+--------------+----------------+----------------+
| custid | jijin_change | baoxian_change | cunkuan_change |
+--------+--------------+----------------+----------------+
|      1 |          1.1 |            1.2 |            1.3 |
|      2 |         2.67 |           2.34 |            2.1 |
|      3 |          3.1 |            3.4 |           3.78 |
|      3 |          1.1 |            2.3 |            7.8 |
+--------+--------------+----------------+----------------+
4 rows in set (0.00 sec)

mysql> select custid,jijin_change,sum(baoxian_change) from youhua1 group by custid;;
+--------+--------------+---------------------+
| custid | jijin_change | sum(baoxian_change) |
+--------+--------------+---------------------+
|      1 |          1.1 |  1.2000000476837158 |
|      2 |         2.67 |  2.3399999141693115 |
|      3 |          3.1 |   5.700000047683716 |
+--------+--------------+---------------------+
3 rows in set (0.00 sec)

相关阅读:
使用docker试用各种软件及docker-ES设置
 Atom读写MarkDown插件选择，以及墙内安装markdown-preview-enhanced，及markdown和mermaid使用教程
 杂谈迁移tomcat项目到docker，以及遇到的问题
 MongoSpark 28799错误
 MongoDB运维心得（一）
Ubuntu16.04 appstreamcli错误
 MongoExport后的负载均衡问题查询及解决：can't accept new chunks because there are still 2 deletes from previous migration
MongoDB集群单mongos的问题总结
 Linux上不了网的几个排查点
 git忽略某些文件的几种方法
原文地址：https://www.cnblogs.com/foolangirl/p/14208537.html