• [Hive


    Group By Syntax

    groupByClause: GROUP BY groupByExpression (, groupByExpression)*
     
    groupByExpression: expression
     
    groupByQuery: SELECT expression (, expression)* FROM src groupByClause?

    Simple Examples

    In order to count the number of rows in a table:

    SELECT COUNT(*) FROM table2;

    Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

    In order to count the number of distinct users by gender one could write the following query:

    INSERT OVERWRITE TABLE pv_gender_sum
    SELECT pv_users.gender, count (DISTINCT pv_users.userid)
    FROM pv_users
    GROUP BY pv_users.gender;

    Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns. For example, the following is possible because count(DISTINCT) and sum(DISTINCT) specify the same column:

    INSERT OVERWRITE TABLE pv_gender_agg
    SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
    FROM pv_users
    GROUP BY pv_users.gender;

    Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

    However, the following query is not allowed. We don't allow multiple DISTINCT expressions in the same query.

    INSERT OVERWRITE TABLE pv_gender_agg
    SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
    FROM pv_users
    GROUP BY pv_users.gender;

    Select statement and group by clause

    When using group by clause, the select statement can only include columns included in the group by clause. Of course, you can have as many aggregation functions (e.g. count) in the select statement as well.
    Let's take a simple example

    CREATE TABLE t1(a INTEGER, b INTGER);

    A group by query on the above table could look like:

    SELECT
       a,
       sum(b)
    FROM
       t1
    GROUP BY
       a;

    The above query works because the select clause contains a (the group by key) and an aggregation function (sum(b)).

    However, the query below DOES NOT work:

    SELECT
       a,
       b
    FROM
       t1
    GROUP BY
       a;

    This is because the select clause has an additional column (b) that is not included in the group by clause (and it's not an aggregation function either). This is because, if the table t1 looked like:

    a    b
    ------
    100  1
    100  2
    100  3

    Since the grouping is only done on a, what value of b should Hive display for the group a=100? One can argue that it should be the first value or the lowest value but we all agree that there are multiple possible options. Hive does away with this guessing by making it invalid SQL (HQL, to be precise) to have a column in the select clause that is not included in the group by clause.

    Advanced Features  高级特性

    Multi-Group-By Inserts

    The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilitites). (可将输出结果插入新表中或者直接覆盖到HDFS上的文件目录)

    e.g. if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query:

    FROM pv_users
    INSERT OVERWRITE TABLE pv_gender_sum
      SELECT pv_users.gender, count(DISTINCT pv_users.userid)
      GROUP BY pv_users.gender
    INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum'
      SELECT pv_users.age, count(DISTINCT pv_users.userid)
      GROUP BY pv_users.age;

    Map-side Aggregation for Group By

    hive.map.aggr controls how we do aggregations. The default is false. If it is set to true, Hive will do the first-level aggregation directly in the map task.
    This usually provides better efficiency, but may require more memory to run successfully.

    hive.map.aggr控制如何聚合,默认是false,如果设置为true, Hive将会在map端做第一级的聚合,这通常提供更好的效果,但是要求更多的内存才能运行成功。

    set hive.map.aggr=true;
    SELECT COUNT(*) FROM table2;

    Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

    Grouping Sets, Cubes, Rollups, and the GROUPING__ID Function

    Version

    Icon

    Grouping sets, CUBE and ROLLUP operators, and the GROUPING__ID function were added in Hive release 0.10.0.

    See Enhanced Aggregation, Cube, Grouping and Rollup for information about these aggregation operators.

    Also see the JIRAs:

    • HIVE-2397 Support with rollup option for group by
    • HIVE-3433 Implement CUBE and ROLLUP operators in Hive
    • HIVE-3471 Implement grouping sets in Hive
    • HIVE-3613 Implement grouping_id function

    New in Hive release 0.11.0:

    • HIVE-3552 HIVE-3552 performant manner for performing cubes/rollups/grouping sets for a high number of grouping set keys
     
    谨言慎行,专注思考 , 工作与生活同乐
  • 相关阅读:
    【sqli-labs】 less26 GET- Error based -All you SPACES and COMMENTS belong to us(GET型基于错误的去除了空格和注释的注入)
    【sqli-labs】 less25a GET- Blind based -All you OR&AND belong to us -Intiger based(GET型基于盲注的去除了or和and的整型注入)
    【sqli-labs】 less25 GET- Error based -All you OR&AND belong to us -string single quote(GET型基于错误的去除了or和and的单引号注入)
    Apache rewrite地址重写
    PHP安装-phpMyAdmin+Discuz
    SElinux解决web网站无法访问
    Squid代理服务部署
    Apache动态加载模块
    Apache虚拟主机+AD压力测试
    nginx php-fpm conf文件编写
  • 原文地址:https://www.cnblogs.com/tmeily/p/4248731.html
Copyright © 2020-2023  润新知