• Hive函数:CUME_DIST,PERCENT_RANK


    参考自:大数据田地http://lxw1234.com/archives/2015/04/185.htm

    数据准备:

    d1,user1,1000
    d1,user2,2000
    d1,user3,3000
    d2,user4,4000
    d2,user5,5000
     
    CREATE EXTERNAL TABLE test_data (
    dept STRING,
    userid string,
    sal INT
    ) ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ',' 
    stored as textfile location '/user/jc_rc_ftp/test_data';
     
     
    hive> select * from test_data;
    +-----------------+-------------------+----------------+--+
    | test_data.dept  | test_data.userid  | test_data.sal  |
    +-----------------+-------------------+----------------+--+
    | d1              | user1             | 1000           |
    | d1              | user2             | 2000           |
    | d1              | user3             | 3000           |
    | d2              | user4             | 4000           |
    | d2              | user5             | 5000           |
    +-----------------+-------------------+----------------+--+

    CUME_DIST

    –CUME_DIST 小于等于当前值的行数/分组内总行数
    –比如,统计小于等于当前薪水的人数,所占总人数的比例

    SELECT 
    dept,
    userid,
    sal,
    CUME_DIST() OVER(ORDER BY sal) AS rn1,
    CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2 
    FROM test_data;
    +-------+---------+-------+------+---------------------+--+
    | dept  | userid  |  sal  | rn1  |         rn2         |
    +-------+---------+-------+------+---------------------+--+
    | d1    | user1   | 1000  | 0.2  | 0.3333333333333333  |
    | d1    | user2   | 2000  | 0.4  | 0.6666666666666666  |
    | d1    | user3   | 3000  | 0.6  | 1.0                 |
    | d2    | user4   | 4000  | 0.8  | 0.5                 |
    | d2    | user5   | 5000  | 1.0  | 1.0                 |
    +-------+---------+-------+------+---------------------+--+
    rn1: 没有partition,所有数据均为1组,总行数为5,
         第一行:小于等于1000的行数为1,因此,1/5=0.2
         第三行:小于等于3000的行数为3,因此,3/5=0.6
    rn2: 按照部门分组,dpet=d1的行数为3,
         第二行:小于等于2000的行数为2,因此,2/3=0.6666666666666666

    PERCENT_RANK

    –PERCENT_RANK 分组内当前行的RANK值-1/分组内总行数-1
    应用场景不了解,可能在一些特殊算法的实现中可以用到吧。

    SELECT 
    dept,
    userid,
    sal,
    PERCENT_RANK() OVER(ORDER BY sal) AS rn1,   --分组内
    RANK() OVER(ORDER BY sal) AS rn11,          --分组内RANK值
    SUM(1) OVER(PARTITION BY NULL) AS rn12,     --分组内总行数
    PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2 
    FROM test_data;
    +-------+---------+-------+-------+-------+-------+------+--+
    | dept  | userid  |  sal  |  rn1  | rn11  | rn12  | rn2  |
    +-------+---------+-------+-------+-------+-------+------+--+
    | d1    | user1   | 1000  | 0.0   | 1     | 5     | 0.0  |
    | d1    | user2   | 2000  | 0.25  | 2     | 5     | 0.5  |
    | d1    | user3   | 3000  | 0.5   | 3     | 5     | 1.0  |
    | d2    | user4   | 4000  | 0.75  | 4     | 5     | 0.0  |
    | d2    | user5   | 5000  | 1.0   | 5     | 5     | 1.0  |
    +-------+---------+-------+-------+-------+-------+------+--+
    
    这样只要排序字段为null,就会放在最后,而不会影响排序结果
    
    rn1: rn1 = (rn11-1) / (rn12-1) 
           第一行,(1-1)/(5-1)=0/4=0
           第二行,(2-1)/(5-1)=1/4=0.25
           第四行,(4-1)/(5-1)=3/4=0.75
    rn2: 按照dept分组,
         dept=d1的总行数为3
         第一行,(1-1)/(3-1)=0
         第三行,(3-1)/(3-1)=1
  • 相关阅读:
    一些精简的小技巧
    POJ题目分类(转)
    【慢慢学算法】:连通图
    【菜鸟做水题】: 杭电1004
    杭电ACM试题分类,一步一个脚印!(转)
    【慢慢学Android】:获得当前时间
    【慢慢学算法】:小白鼠排队
    【慢慢学Android】:12.Notification示例
    【慢慢学Android】:13.打电话代码
    “/”应用程序中的服务器错误。
  • 原文地址:https://www.cnblogs.com/yy3b2007com/p/8582860.html
Copyright © 2020-2023  润新知