• Hive学习小记-(7)group by原理&tips


    原理

    参考:https://blog.csdn.net/u013668852/article/details/79866931

    group by多字段

    select rank, isonline, count(*) from city group by rank, isonline;

    将GroupBy的字段组合为map的输出key值,利用MapReduce的排序,在reduce阶段保存LastKey区分不同的key。MapReduce的过程如下(当然这里只是说明Reduce端的非Hash聚合过程)

    image

    group by单字段

    group by 单字段和多字段时的不同在于key上,以如下例子为例(出处太多):
    SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;

    hive> SELECT * FROM logs;
    a   苹果  5
    a   橙子  3
    a   苹果  2
    b   烧鸡  1
    
    hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
    a   10
    b   1

    其过程如下图:
    image

    默认设置了hive.map.aggr=true,所以会在mapper端先group by一次,最后再把结果merge起来,为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是hash,reducer是mergepartial。如果把hive.map.aggr=false,那将groupby放到reducer才做,他的mode是complete.

    image

    TIPS

    减少group by字段

    措施一:先group by ;再关联

    参考:https://blog.51cto.com/imysqldba/1364808

    措施二:若是其中有些字段是固定取值,可以限定死

    参考:https://www.cnblogs.com/hoojjack/p/7460574.html

    措施三:本身在其他字段取了group by 就已经能确保取值唯一的字段,如张三项目组的职责一定是洗碗,李四组的职责一定是刷盘子,可以给字段打上聚合函数max,min等:select proj_group, duty, sum(salary) from test_table group by proj_group, duty  可以改写为:select proj_group, max(duty), sum(salary) from test_table group by proj_group

    hive与MySQL group by的区别

    1.hive执行顺序group 在前,select在后;MySQL select在前,group 在后

    参考:https://blog.csdn.net/m0_48714980/article/details/109136659

    eg:

    --在year(date)起别名
    -- 1.在Mysql中顺利执行
    select count(*),year(date) a from student group by a ;
    
    -- 2.但在hive中出现报错
    select count(*),year(date) a from student group by a ;
    Invalid table alias or column reference ‘a’: (possible column names are: s_id, s_name, s_birth, s_sex
    
    -- 原因分析:select执行顺序中,group by 在select前先执行,所以此时别名在解析器中没法识别

    2.hive要将非聚合字段都加到group中;MySQL没加到group的字段可能会随机取:

    eg:

    mysql> select * from youhua1;
    +--------+--------------+----------------+----------------+
    | custid | jijin_change | baoxian_change | cunkuan_change |
    +--------+--------------+----------------+----------------+
    |      1 |          1.1 |            1.2 |            1.3 |
    |      2 |         2.67 |           2.34 |            2.1 |
    |      3 |          3.1 |            3.4 |           3.78 |
    |      3 |          1.1 |            2.3 |            7.8 |
    +--------+--------------+----------------+----------------+
    4 rows in set (0.00 sec)
    
    mysql> select custid,jijin_change,sum(baoxian_change) from youhua1 group by custid;;
    +--------+--------------+---------------------+
    | custid | jijin_change | sum(baoxian_change) |
    +--------+--------------+---------------------+
    |      1 |          1.1 |  1.2000000476837158 |
    |      2 |         2.67 |  2.3399999141693115 |
    |      3 |          3.1 |   5.700000047683716 |
    +--------+--------------+---------------------+
    3 rows in set (0.00 sec)
  • 相关阅读:
    使用docker试用各种软件及docker-ES设置
    Atom读写MarkDown插件选择,以及墙内安装markdown-preview-enhanced,及markdown和mermaid使用教程
    杂谈迁移tomcat项目到docker,以及遇到的问题
    MongoSpark 28799错误
    MongoDB运维心得(一)
    Ubuntu16.04 appstreamcli错误
    MongoExport后的负载均衡问题查询及解决:can't accept new chunks because there are still 2 deletes from previous migration
    MongoDB集群单mongos的问题总结
    Linux上不了网的几个排查点
    git忽略某些文件的几种方法
  • 原文地址:https://www.cnblogs.com/foolangirl/p/14208537.html
Copyright © 2020-2023  润新知