• Hive函数:LAG,LEAD,FIRST_VALUE,LAST_VALUE


    参考自大数据田地:http://lxw1234.com/archives/2015/04/190.htm

    测试数据准备:

    create external table test_data (
    cookieid string,
    createtime string,  --页面访问时间
    url string       --被访问页面
    ) ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ',' 
    stored as textfile location '/user/jc_rc_ftp/test_data';
    
     select * from test_data l;
    +-------------+----------------------+---------+--+
    | l.cookieid  |     l.createtime     |  l.url  |
    +-------------+----------------------+---------+--+
    | cookie1     | 2015-04-10 10:00:02  | url2    |
    | cookie1     | 2015-04-10 10:00:00  | url1    |
    | cookie1     | 2015-04-10 10:03:04  | 1url3   |
    | cookie1     | 2015-04-10 10:50:05  | url6    |
    | cookie1     | 2015-04-10 11:00:00  | url7    |
    | cookie1     | 2015-04-10 10:10:00  | url4    |
    | cookie1     | 2015-04-10 10:50:01  | url5    |
    | cookie2     | 2015-04-10 10:00:02  | url22   |
    | cookie2     | 2015-04-10 10:00:00  | url11   |
    | cookie2     | 2015-04-10 10:03:04  | 1url33  |
    | cookie2     | 2015-04-10 10:50:05  | url66   |
    | cookie2     | 2015-04-10 11:00:00  | url77   |
    | cookie2     | 2015-04-10 10:10:00  | url44   |
    | cookie2     | 2015-04-10 10:50:01  | url55   |
    +-------------+----------------------+---------+--+

    LAG
    LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值

    第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

    SELECT cookieid,
    createtime,
    url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
    LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time 
    FROM test_data;
    +-----------+----------------------+---------+-----+----------------------+----------------------+--+
    | cookieid  |      createtime      |   url   | rn  |     last_1_time      |     last_2_time      |
    +-----------+----------------------+---------+-----+----------------------+----------------------+--+
    | cookie1   | 2015-04-10 10:00:00  | url1    | 1   | 1970-01-01 00:00:00  | NULL                 |
    | cookie1   | 2015-04-10 10:00:02  | url2    | 2   | 2015-04-10 10:00:00  | NULL                 |
    | cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | 2015-04-10 10:00:02  | 2015-04-10 10:00:00  |
    | cookie1   | 2015-04-10 10:10:00  | url4    | 4   | 2015-04-10 10:03:04  | 2015-04-10 10:00:02  |
    | cookie1   | 2015-04-10 10:50:01  | url5    | 5   | 2015-04-10 10:10:00  | 2015-04-10 10:03:04  |
    | cookie1   | 2015-04-10 10:50:05  | url6    | 6   | 2015-04-10 10:50:01  | 2015-04-10 10:10:00  |
    | cookie1   | 2015-04-10 11:00:00  | url7    | 7   | 2015-04-10 10:50:05  | 2015-04-10 10:50:01  |
    | cookie2   | 2015-04-10 10:00:00  | url11   | 1   | 1970-01-01 00:00:00  | NULL                 |
    | cookie2   | 2015-04-10 10:00:02  | url22   | 2   | 2015-04-10 10:00:00  | NULL                 |
    | cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | 2015-04-10 10:00:02  | 2015-04-10 10:00:00  |
    | cookie2   | 2015-04-10 10:10:00  | url44   | 4   | 2015-04-10 10:03:04  | 2015-04-10 10:00:02  |
    | cookie2   | 2015-04-10 10:50:01  | url55   | 5   | 2015-04-10 10:10:00  | 2015-04-10 10:03:04  |
    | cookie2   | 2015-04-10 10:50:05  | url66   | 6   | 2015-04-10 10:50:01  | 2015-04-10 10:10:00  |
    | cookie2   | 2015-04-10 11:00:00  | url77   | 7   | 2015-04-10 10:50:05  | 2015-04-10 10:50:01  |
    +-----------+----------------------+---------+-----+----------------------+----------------------+--+

    LEAD

    与LAG相反
    LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
    第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

    SELECT cookieid,
    createtime,
    url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
    LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time 
    FROM test_data;
    +-----------+----------------------+---------+-----+----------------------+----------------------+--+
    | cookieid  |      createtime      |   url   | rn  |     next_1_time      |     next_2_time      |
    +-----------+----------------------+---------+-----+----------------------+----------------------+--+
    | cookie1   | 2015-04-10 10:00:00  | url1    | 1   | 2015-04-10 10:00:02  | 2015-04-10 10:03:04  |
    | cookie1   | 2015-04-10 10:00:02  | url2    | 2   | 2015-04-10 10:03:04  | 2015-04-10 10:10:00  |
    | cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | 2015-04-10 10:10:00  | 2015-04-10 10:50:01  |
    | cookie1   | 2015-04-10 10:10:00  | url4    | 4   | 2015-04-10 10:50:01  | 2015-04-10 10:50:05  |
    | cookie1   | 2015-04-10 10:50:01  | url5    | 5   | 2015-04-10 10:50:05  | 2015-04-10 11:00:00  |
    | cookie1   | 2015-04-10 10:50:05  | url6    | 6   | 2015-04-10 11:00:00  | NULL                 |
    | cookie1   | 2015-04-10 11:00:00  | url7    | 7   | 1970-01-01 00:00:00  | NULL                 |
    | cookie2   | 2015-04-10 10:00:00  | url11   | 1   | 2015-04-10 10:00:02  | 2015-04-10 10:03:04  |
    | cookie2   | 2015-04-10 10:00:02  | url22   | 2   | 2015-04-10 10:03:04  | 2015-04-10 10:10:00  |
    | cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | 2015-04-10 10:10:00  | 2015-04-10 10:50:01  |
    | cookie2   | 2015-04-10 10:10:00  | url44   | 4   | 2015-04-10 10:50:01  | 2015-04-10 10:50:05  |
    | cookie2   | 2015-04-10 10:50:01  | url55   | 5   | 2015-04-10 10:50:05  | 2015-04-10 11:00:00  |
    | cookie2   | 2015-04-10 10:50:05  | url66   | 6   | 2015-04-10 11:00:00  | NULL                 |
    | cookie2   | 2015-04-10 11:00:00  | url77   | 7   | 1970-01-01 00:00:00  | NULL                 |
    +-----------+----------------------+---------+-----+----------------------+----------------------+--+

    FIRST_VALUE

    取分组内排序后,截止到当前行,第一个值

    SELECT cookieid,
    createtime,
    url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1 
    FROM test_data;
    
    
    +-----------+----------------------+---------+-----+---------+--+
    | cookieid  |      createtime      |   url   | rn  | first1  |
    +-----------+----------------------+---------+-----+---------+--+
    | cookie1   | 2015-04-10 10:00:00  | url1    | 1   | url1    |
    | cookie1   | 2015-04-10 10:00:02  | url2    | 2   | url1    |
    | cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | url1    |
    | cookie1   | 2015-04-10 10:10:00  | url4    | 4   | url1    |
    | cookie1   | 2015-04-10 10:50:01  | url5    | 5   | url1    |
    | cookie1   | 2015-04-10 10:50:05  | url6    | 6   | url1    |
    | cookie1   | 2015-04-10 11:00:00  | url7    | 7   | url1    |
    | cookie2   | 2015-04-10 10:00:00  | url11   | 1   | url11   |
    | cookie2   | 2015-04-10 10:00:02  | url22   | 2   | url11   |
    | cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | url11   |
    | cookie2   | 2015-04-10 10:10:00  | url44   | 4   | url11   |
    | cookie2   | 2015-04-10 10:50:01  | url55   | 5   | url11   |
    | cookie2   | 2015-04-10 10:50:05  | url66   | 6   | url11   |
    | cookie2   | 2015-04-10 11:00:00  | url77   | 7   | url11   |
    +-----------+----------------------+---------+-----+---------+--+

    LAST_VALUE

    取分组内排序后,截止到当前行,最后一个值

    SELECT cookieid,
    createtime,
    url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1 
    FROM test_data;
    +-----------+----------------------+---------+-----+---------+--+
    | cookieid  |      createtime      |   url   | rn  |  last1  |
    +-----------+----------------------+---------+-----+---------+--+
    | cookie1   | 2015-04-10 10:00:00  | url1    | 1   | url1    |
    | cookie1   | 2015-04-10 10:00:02  | url2    | 2   | url2    |
    | cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | 1url3   |
    | cookie1   | 2015-04-10 10:10:00  | url4    | 4   | url4    |
    | cookie1   | 2015-04-10 10:50:01  | url5    | 5   | url5    |
    | cookie1   | 2015-04-10 10:50:05  | url6    | 6   | url6    |
    | cookie1   | 2015-04-10 11:00:00  | url7    | 7   | url7    |
    | cookie2   | 2015-04-10 10:00:00  | url11   | 1   | url11   |
    | cookie2   | 2015-04-10 10:00:02  | url22   | 2   | url22   |
    | cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | 1url33  |
    | cookie2   | 2015-04-10 10:10:00  | url44   | 4   | url44   |
    | cookie2   | 2015-04-10 10:50:01  | url55   | 5   | url55   |
    | cookie2   | 2015-04-10 10:50:05  | url66   | 6   | url66   |
    | cookie2   | 2015-04-10 11:00:00  | url77   | 7   | url77   |
    +-----------+----------------------+---------+-----+---------+--+
    
    SELECT cookieid,
    createtime,
    url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last1 
    FROM test_data;
    +-----------+----------------------+---------+-----+---------+--+
    | cookieid  |      createtime      |   url   | rn  |  last1  |
    +-----------+----------------------+---------+-----+---------+--+
    | cookie1   | 2015-04-10 11:00:00  | url7    | 7   | url7    |
    | cookie1   | 2015-04-10 10:50:05  | url6    | 6   | url6    |
    | cookie1   | 2015-04-10 10:50:01  | url5    | 5   | url5    |
    | cookie1   | 2015-04-10 10:10:00  | url4    | 4   | url4    |
    | cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | 1url3   |
    | cookie1   | 2015-04-10 10:00:02  | url2    | 2   | url2    |
    | cookie1   | 2015-04-10 10:00:00  | url1    | 1   | url1    |
    | cookie2   | 2015-04-10 11:00:00  | url77   | 7   | url77   |
    | cookie2   | 2015-04-10 10:50:05  | url66   | 6   | url66   |
    | cookie2   | 2015-04-10 10:50:01  | url55   | 5   | url55   |
    | cookie2   | 2015-04-10 10:10:00  | url44   | 4   | url44   |
    | cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | 1url33  |
    | cookie2   | 2015-04-10 10:00:02  | url22   | 2   | url22   |
    | cookie2   | 2015-04-10 10:00:00  | url11   | 1   | url11   |
    +-----------+----------------------+---------+-----+---------+--+

    如果不指定ORDER BY,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果

    SELECT cookieid,
    createtime,
    url,
    FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2  
    FROM test_data;
    +-----------+----------------------+---------+---------+--+
    | cookieid  |      createtime      |   url   | first2  |
    +-----------+----------------------+---------+---------+--+
    | cookie1   | 2015-04-10 10:00:02  | url2    | url2    |
    | cookie1   | 2015-04-10 10:50:01  | url5    | url2    |
    | cookie1   | 2015-04-10 10:10:00  | url4    | url2    |
    | cookie1   | 2015-04-10 11:00:00  | url7    | url2    |
    | cookie1   | 2015-04-10 10:50:05  | url6    | url2    |
    | cookie1   | 2015-04-10 10:03:04  | 1url3   | url2    |
    | cookie1   | 2015-04-10 10:00:00  | url1    | url2    |
    | cookie2   | 2015-04-10 10:50:01  | url55   | url55   |
    | cookie2   | 2015-04-10 10:10:00  | url44   | url55   |
    | cookie2   | 2015-04-10 11:00:00  | url77   | url55   |
    | cookie2   | 2015-04-10 10:50:05  | url66   | url55   |
    | cookie2   | 2015-04-10 10:03:04  | 1url33  | url55   |
    | cookie2   | 2015-04-10 10:00:00  | url11   | url55   |
    | cookie2   | 2015-04-10 10:00:02  | url22   | url55   |
    +-----------+----------------------+---------+---------+--+
    SELECT cookieid,
    createtime,
    url,
    LAST_VALUE(url) OVER(PARTITION BY cookieid) AS last2  
    FROM test_data;
    +-----------+----------------------+---------+--------+--+
    | cookieid  |      createtime      |   url   | last2  |
    +-----------+----------------------+---------+--------+--+
    | cookie1   | 2015-04-10 10:00:02  | url2    | url1   |
    | cookie1   | 2015-04-10 10:50:01  | url5    | url1   |
    | cookie1   | 2015-04-10 10:10:00  | url4    | url1   |
    | cookie1   | 2015-04-10 11:00:00  | url7    | url1   |
    | cookie1   | 2015-04-10 10:50:05  | url6    | url1   |
    | cookie1   | 2015-04-10 10:03:04  | 1url3   | url1   |
    | cookie1   | 2015-04-10 10:00:00  | url1    | url1   |
    | cookie2   | 2015-04-10 10:50:01  | url55   | url22  |
    | cookie2   | 2015-04-10 10:10:00  | url44   | url22  |
    | cookie2   | 2015-04-10 11:00:00  | url77   | url22  |
    | cookie2   | 2015-04-10 10:50:05  | url66   | url22  |
    | cookie2   | 2015-04-10 10:03:04  | 1url33  | url22  |
    | cookie2   | 2015-04-10 10:00:00  | url11   | url22  |
    | cookie2   | 2015-04-10 10:00:02  | url22   | url22  |
    +-----------+----------------------+---------+--------+--+
    14 rows selected (78.058 seconds)

    如果想要取分组内排序后最后一个值,则需要变通一下:

    SELECT cookieid,
    createtime,
    url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
    FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2 
    FROM test_data 
    ORDER BY cookieid,createtime;
    +-----------+----------------------+---------+-----+---------+--------+--+
    | cookieid  |      createtime      |   url   | rn  |  last1  | last2  |
    +-----------+----------------------+---------+-----+---------+--------+--+
    | cookie1   | 2015-04-10 10:00:00  | url1    | 1   | url1    | url7   |
    | cookie1   | 2015-04-10 10:00:02  | url2    | 2   | url2    | url7   |
    | cookie1   | 2015-04-10 10:03:04  | 1url3   | 3   | 1url3   | url7   |
    | cookie1   | 2015-04-10 10:10:00  | url4    | 4   | url4    | url7   |
    | cookie1   | 2015-04-10 10:50:01  | url5    | 5   | url5    | url7   |
    | cookie1   | 2015-04-10 10:50:05  | url6    | 6   | url6    | url7   |
    | cookie1   | 2015-04-10 11:00:00  | url7    | 7   | url7    | url7   |
    | cookie2   | 2015-04-10 10:00:00  | url11   | 1   | url11   | url77  |
    | cookie2   | 2015-04-10 10:00:02  | url22   | 2   | url22   | url77  |
    | cookie2   | 2015-04-10 10:03:04  | 1url33  | 3   | 1url33  | url77  |
    | cookie2   | 2015-04-10 10:10:00  | url44   | 4   | url44   | url77  |
    | cookie2   | 2015-04-10 10:50:01  | url55   | 5   | url55   | url77  |
    | cookie2   | 2015-04-10 10:50:05  | url66   | 6   | url66   | url77  |
    | cookie2   | 2015-04-10 11:00:00  | url77   | 7   | url77   | url77  |
    +-----------+----------------------+---------+-----+---------+--------+--+
  • 相关阅读:
    raise PDFEncryptionError('Unknown algorithm: param=%r' % param) pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm
    Hive与Hbase的区别
    HIVE—索引、分区和分桶的区别
    MapReduce编程之Semi Join多种应用场景与使用
    MapReduce编程之Map Join多种应用场景与使用
    MapReduce编程之Reduce Join多种应用场景与使用
    Mapreduce——视频播放数据分类统计
    Docker-compose实战——Django+PostgreSQL
    Docker基础教程
    1.node接口搭建--express搭建服务器
  • 原文地址:https://www.cnblogs.com/yy3b2007com/p/8582831.html
Copyright © 2020-2023  润新知