• Hive-03 常用函数


    Hive常用函数大全一览

    hive中split、coalesce及collect_list函数的用法(可举例)
    Split将字符串转化为数组。
    split('a,b,c,d' , ',') ==> ["a","b","c","d"]
    COALESCE(T v1, T v2, …) 返回参数中的第一个非空值;如果所有值都为 NULL,那么返回NULL。
    collect_list列出该字段所有的值,不去重  select collect_list(id) from table;

    常用时间函数:

    将mongodb中时区转换过来: 由UTC时区 转换为 GMT时区差8个小时

    date_format(from_utc_timestamp( CONCAT_WS(' ',substring(updatetime,1,10),substring(updatetime,12,8) ) ,'GMT+8'),'yyyy-MM-dd HH:mm:ss') updatetime
    
    select date_format(from_utc_timestamp(create_time,"UTC"),'yyyy-MM-dd HH:mm:ss') as local_time
    
    select date_format(from_utc_timestamp(create_time,"GMT+8"),'yyyy-MM-dd HH:mm:ss') as local_time
    时间戳 秒S是10位;  毫秒ms是13位;
    
    date_format(from_unixtime(cast(h.updatetime as int)),'yyyy-MM-dd HH:mm:ss')
    
    substring(h.id, 10, 24) 59409d1d2cdcc90b91c62be5    ObjectId(59409d1d2cdcc90b91c62be5)
    
    今天: select date_format(current_timestamp,'yyyy-MM-dd')
    前一天: select date_sub(current_date,1);

    Hive中字段的合并

    contact:简单合并功能;

    CONCAT_WS("/",r.province,r.city,a.area) channel_address  => 北京/北京市/朝阳区 ,字段必须是string;
    
    concat(payid,' ',carlogid)  => 01a893092b914703b75941b713767ebf 408693
    concat(substr(summary_time,9,2),'',substr(summary_time,6,2),'',substr(summary_time,1,4),'_',concat(market_id),'_' ,concat(mid))
    09022019_108_0
    12022019_108_0
    21022019_108_0
    
    concat_ws("_" ,substr(summary_time,9,2),substr(summary_time,6,2),substr(summary_time,1,4),concat(market_id),concat(mid))
    09_03_2019_108_0
    13_03_2019_108_0
    21_03_2019_108_0

    concat(sum(total_amount_pur),'&&',sum(total_amount_sig),'&&',sum(total_amount_jump))
    0.0&&16665.0&&0.0
    order by date desc, market_id asc;
    date    market_id    total_fee
    2019-10-08    110    23000
    2019-10-08    110    13000
    2019-10-08    141    1400
    2019-10-08    141    2250
    2019-10-08    218    4000
    2019-10-08    218    1100
    2019-10-08    218    2300
    2019-10-08    218    4500
    2019-10-08    234    0
    2019-10-08    234    0

    查询仅出现一次的数据

    SELECT name,count(name) AS count_times  FROM tb_test  GROUP BY name  HAVING count_times = 1;

    查询语句返回某字段出现超过1次的所有记录

    select * from stu where sname in (select sname from stu group by sname having count(sname)>1);
    有重复的sname的记录,并计算相同sname的数量
    select *,count(sname)as count from stu group by sname having (count(sname)>1);

    Hive取非Group by字段数据的方法,但可加聚合函数如count、sum、avg、max等

      方法①  

      输入数据为一个ID对应多个name,要求输出数据为ID是唯一的,name随便取一个就可以。

       HIVE有这么一个函数collect_set,类似于mysql的group_concat函数,把每个分组的其他字段,按照逗号进行拼接,得到一个最终字符串:

    collect_set(col)
    返回类型:array
    解释:返回一个去重后的对象集合

        

    select sid,collect_set(class_id) from table1 group by sid;
    ===>>
    1 [11,12,13]
    2 [11,14]
    3 [12,15]
    可以针对第二列做一些计数、求和操作,分别对应到Hive的聚合函数count、sum。
    对应到本文的目的,直接从数组获取第一个元素就达到目的了,这样做:
    select sid,collect_set(class_id)[0] from table1 group by sid;

    方法②  select后需要几个字段就gruop by几个

    select sid,class_id from table1 group by sid, class_id;

     相同的user_id,但是每个create_time不一样,现在的需求是根据create_time创建时间选取最早的,那么思路是现根据user_id进行分组,然后根据user_id,create_time进行排序,取row_number 为1的值

    SELECT * FROM(SELECT *, row_number() OVER(PARTITION BY dt,id ORDER BY dt,id DESC) rk FROM test_table)t 
    WHERE t.rk = 1 

    Hive查询中数值累加

    1. 需求分析

    现有 hive 表 record, 内容如下:

    其中字段意义: channel_type(string)   dt(string)   num(int); 分别代表: 渠道类型 日期  该天个数,原数据模拟如下:

    select * from record;
    channel_type  dt     num
    A    2015-01-01    8
    A    2015-01-02    4
    A    2015-01-02    5
    C    2015-02-01    1
    A    2015-01-04    5
    A    2015-01-05    6
    B    2015-01-03    2
    B    2015-01-02    3
    A    2015-01-03    2
    C    2015-01-30    8
    C    2015-01-30    7
    B    2015-01-02    9
    B    2015-01-01    1
    C    2015-02-02    3
    View Code

    统计每个渠道截止到当天为止的最大单日人数和累计到该天的总人数:

    # 先求出每个渠道每天总访问量:
    create table record_nj as  
    select
    channel_type,dt,sum(num) as new_join
    from record
    group by channel_type,dt;
    channel_type    dt    new_join
    A    2015-01-01    8
    A    2015-01-02    9
    A    2015-01-03    2
    A    2015-01-04    5
    A    2015-01-05    6
    B    2015-01-01    1
    B    2015-01-02    12
    B    2015-01-03    2
    C    2015-01-30    15
    C    2015-02-01    1
    C    2015-02-02    3
    View Code
    方法一: 使用Hive窗口函数over  max()、sum()
    
    select 
        channel_type, dt, new_join, 
        sum(new_join) over(partition by channel_type order by dt) as sum_count,
        max(new_join) over(partition by channel_type order by dt) as max_count
    from record_nj;
    
    方法二:使用group by  join自连接
    select 
        t1.channel_type,t1.dt,t1.new_join, 
        sum(t2.new_join) sum_count, max(t2.new_join) max_count 
    from record_nj t1 join record_nj t2 on t1.channel_type = t2.channel_type 
    where t1.dt >= t2.dt 
    group by t1.channel_type,t1.dt,t1.new_join order by t1.channel_type,t1.dt;
    
    数据结果如下:
    channel_type    dt    new_join    sum_count    max_count
    A    2015-01-01    8    8    8
    A    2015-01-02    9    17    9
    A    2015-01-03    2    19    9
    A    2015-01-04    5    24    9
    A    2015-01-05    6    30    9
    B    2015-01-01    1    1    1
    B    2015-01-02    12    13    12
    B    2015-01-03    2    15    12
    C    2015-01-30    15    15    15
    C    2015-02-01    1    16    15
    C    2015-02-02    3    19    15
    View Code

     累加、累乘、最大值:

    select 
        channel_type,
        new_join, 
        sum(new_join) over(partition by channel_type order by dt) as sum_count,--累加
        sum(new_join) over(partition by channel_type order by dt rows between unbounded preceding and current row) sum_count, --累加
        round(power(10, sum(log(10, new_join))over(partition by channel_type order by dt rows between unbounded preceding and current row))) as tired,--累乘处理-
        max(new_join) over(partition by channel_type order by dt) as max_count --最大值
    from record_nj;

    行列转换

    原数据如下:

    select * from score;
    name    subject    score
    孙悟空    语文    87
    孙悟空    数学    95
    孙悟空    英语    68
    大海    语文    94
    大海    数学    56
    大海    英语    84
    kris    语文    64
    kris    数学    86
    kris    英语    84
    婷婷    语文    65
    婷婷    数学    85
    婷婷    英语    78
    View Code

    求语文成绩比数学成绩好的学生:

    方法一:join 
    select
    s1.name,s1.subject, s1.score from score s1 inner join score s2 on s1.name = s2.name where s1.score > s2.score and s1.subject = '语文' and s2.subject = '数学';
    s1.name    s1.subject    s1.score
    大海    语文    94
    View Code
    方法二:行列转换create table t1 AS
    select 
     name,
     case subject when '语文' then score else 0 end as chinese_score,
     case subject when '数学' then score else 0 end as math_score 
    from score;
    name    chinese_score    math_score
    孙悟空    87    0
    孙悟空    0    95
    孙悟空    0    0
    大海            94    0
    大海            0    56
    大海            0    0
    kris           64    0
    kris           0    86
    kris           0    0
    婷婷          65    0
    婷婷          0       85
    婷婷          0        0
    View Code
    create table t2 AS
    select
    name,max(chinese_score) chinese_score,max(math_score) math_score
    from t1 group by name;
    name    chinese_score    math_score
    kris    64    86
    大海    94    56
    婷婷    65    85
    孙悟空    87    95
    View Code
    select 
      name, chinese_score, math_score
    from t2 where chinese_score > math_score;
    或者三个hql合并为一个如下
    ====> select name,chinese_score,math_score from(
    select name,max(chinese_score) chinese_score,max(math_score) math_score from( select name, case subject when '语文' then score else 0 end as chinese_score, case subject when '数学' then score else 0 end as math_score from score)t1
    group by
    t1.name
    )t2
    where chinese_score >= math_score ;
    name    chinese_score    math_score
    大海    94    56
    View Code

    列转行的实现:

    数据如下:

    id  sname   math    computer    english
    1   Jed     34      58          58
    2   Tony    45      87          45
    3   Tom     76      34          89
    View Code
    select id, sname, 'math' as course, math as score from score
    union 
    select id, sname, 'computer' as course, computer as score from score
    union 
    select id, sname, 'english' as course, english as score from score
    order by id, sname, course;

    结果如下:

    id  sname   course      score
    1   Jed     computer    58
    1   Jed     english     58
    1   Jed     math        34
    2   Tony    computer    87
    2   Tony    english     45
    2   Tony    math        45
    3   Tom     computer    34
    3   Tom     english     89
    3   Tom     math        76
    View Code
     
     

     窗口函数

    一、聚合函数sum、avg、max、min

    建表,load数据:
    create table cookie(
      cookie_id  string,
      create_time string,
      pv int
    )row format delimited fields terminated by ",";
    
    a_cookie,2019-06-10,1
    a_cookie,2019-06-11,9
    a_cookie,2019-06-12,7
    a_cookie,2019-06-13,3
    a_cookie,2019-06-14,2
    a_cookie,2019-06-15,4
    a_cookie,2019-06-16,4
    b_cookie,2019-08-17,6
    b_cookie,2019-08-18,9
    b_cookie,2019-08-19,5
    b_cookie,2019-08-17,2
    load data local inpath  "/opt/module/datas/cookie.txt" into table cookie;
    View Code
    select 
        cookie_id,create_time,pv,
        sum(pv) over(partition by cookie_id) as sum1, --对组内的pv值进行全部累加
        sum(pv) over(partition by  cookie_id order by create_time) as accu_sum  --从第一行开始累加到当前行
    from cookie;

    数据如下:

    a_cookie    2019-06-10    1    30    1
    a_cookie    2019-06-11    9    30    10
    a_cookie    2019-06-12    7    30    17
    a_cookie    2019-06-13    3    30    20
    a_cookie    2019-06-14    2    30    22
    a_cookie    2019-06-15    4    30    26
    a_cookie    2019-06-16    4    30    30
    b_cookie    2019-08-17    2    22    8
    b_cookie    2019-08-17    6    22    8
    b_cookie    2019-08-18    9    22    17
    b_cookie    2019-08-19    5    22    22
    View Code
    select
        cookie_id,create_time,pv,
        sum(pv) over(partition by cookie_id) as sum1, --对组内的pv值进行全部累加
        sum(pv) over(partition by  cookie_id order by create_time) as accu_sum, --从第一行开始累加到当前行
        sum(pv) over(partition by cookie_id order by create_time rows between unbounded preceding and current row) as sum3,
                             --默认就是从起点到当前行往前累加,所以between unbounded(起点) preceding(往前) and current row这个条件可以不写。
        sum(pv) over(partition by  cookie_id order by create_time rows between current row and unbounded following) as sum4,
                           --从当前行累加到最前边,跟前一个是相反的累加,相当于是降序累加
        sum(pv) over(partition by  cookie_id order by create_time rows between 3 preceding and current row) as sum5,
                           --取当前行-至往前数3行的数进行累加, 一共4行进行累加处理
        sum(pv) over(partition by  cookie_id order by create_time rows between 3 preceding and 1 following) as sum6
                            --取当前行,往前数3行,往后数1行的数值进行累加处理
    from cookie;

    数据如下:

    a_cookie    2019-06-10    1    30    1    1    30    1    10
    a_cookie    2019-06-11    9    30    10    10    29    10    17
    a_cookie    2019-06-12    7    30    17    17    20    17    20
    a_cookie    2019-06-13    3    30    20    20    13    20    22
    a_cookie    2019-06-14    2    30    22    22    10    21    25
    a_cookie    2019-06-15    4    30    26    26    8    16    20
    a_cookie    2019-06-16    4    30    30    30    4    13    13
    b_cookie    2019-08-17    2    22    8    2    22    2    8
    b_cookie    2019-08-17    6    22    8    8    20    8    17
    b_cookie    2019-08-18    9    22    17    17    14    17    22
    b_cookie    2019-08-19    5    22    22    22    5    22    22
    View Code

    sum、avg、max、min这些窗口函数的语法都是一样的;同上;

    二、 ntile分片

     ntile(n),用于将分组数据按照顺序切分成n片,返回当前切片值,ntile不支持rows between; 如果切片不均匀,默认增加第一个切片的分布。 
    将分组数据按照顺序切分成1份、2份、3份、4份
    select 
        cookie_id,create_time,pv,
        ntile(1) over(partition by  cookie_id order by create_time) as nt1,
        ntile(2) over(partition by  cookie_id order by create_time) as nt2,
        ntile(3) over(partition by  cookie_id order by create_time) as nt3,
        ntile(4) over(partition by  cookie_id order by create_time) as nt4
    from cookie;
    结果: a_cookie组有7条数据, 所以默认第一个分片加1条数据
    cookie_id    create_time pv nt1 nt2 nt3 nt4
    a_cookie    2019-06-10    1    1    1    1    1
    a_cookie    2019-06-11    9    1    1    1    1
    a_cookie    2019-06-12    7    1    1    1    2
    a_cookie    2019-06-13    3    1    1    2    2
    a_cookie    2019-06-14    2    1    2    2    3
    a_cookie    2019-06-15    4    1    2    3    3
    a_cookie    2019-06-16    4    1    2    3    4
    b_cookie    2019-08-17    6    1    1    1    1
    b_cookie    2019-08-17    2    1    1    1    2
    b_cookie    2019-08-18    9    1    2    2    3
    b_cookie    2019-08-19    5    1    2    3    4

     应用场景:统计一个每个cookie的pv数最多的前1/3的天

    select 
        cookie_id,create_time,pv
    from(
    select 
        cookie_id,create_time,pv,
        ntile(3) over(partition by  cookie_id order by pv desc) as nt1 --按pv降序排,取nt = 1份的数据就是要求的结果
    from cookie
    )t where nt1 = 1;
    a_cookie    2019-06-11    9
    a_cookie    2019-06-12    7
    a_cookie    2019-06-16    4
    b_cookie    2019-08-18    9
    b_cookie    2019-08-17    6
    View Code

     三、排序 row_number()等

         row_number()从1开始,按照顺序,生成分组内记录的行号, 值相同的排名不同;
        rank:生成数据项在分组中的排名,相同的值编相同的号,排名相等会在名次中留下空位
        dense_rank:按顺序编号,相同的值编相同的号,不留空位
    select 
        cookie_id,create_time,pv,
        row_number() over(partition by  cookie_id order by pv desc) as row_index,
        rank() over(partition by  cookie_id order by pv desc) as rank_index,
        dense_rank() over(partition by  cookie_id order by pv desc) as dense_index
    from cookie;
    cookie_id    create_time pv  row_index rank_index dense_index
    a_cookie 2019-06-11 9 1 1 1 a_cookie 2019-06-12 7 2 2 2 a_cookie 2019-06-16 4 3 3 3 a_cookie 2019-06-15 4 4 3 3 a_cookie 2019-06-13 3 5 5 4 a_cookie 2019-06-14 2 6 6 5 a_cookie 2019-06-10 1 7 7 6

    常用场景:分组取TopN, 比如求每个cookie排名前三的pv

    create table tmp as
    select cookieid, createtime,  pv,
    row_number() over(partition by cookieid order by pv desc) as index
    from cookie2;
    
    select * from tmp where index <= 3;

     四、lag | lead | first_value| last_value

    LAG(col,n,DEFAULT)用于统计窗口内往上第n行值

    第一个参数为列名
    第二个参数为往上第n行(可选,默认为1)
    第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

    select 
        cookie_id,create_time,pv, 
        lag(pv,1) over (partition by cookie_id order by create_time) as last_1_pv //可设置默认值为0即 lag(pv,1,0) over(...)
     from cookie;

    数据如下:

    结果: 没有设置默认值,没有上一行时显示为null
    cookie_id  create_time  pv  last_1_pv
    a_cookie    2019-06-10    1    NULL
    a_cookie    2019-06-11    9    1
    a_cookie    2019-06-12    7    9
    a_cookie    2019-06-13    3    7
    a_cookie    2019-06-14    2    3
    a_cookie    2019-06-15    4    2
    a_cookie    2019-06-16    4    4
    b_cookie    2019-08-17    6    NULL
    b_cookie    2019-08-17    2    6
    b_cookie    2019-08-18    9    2
    b_cookie    2019-08-19    5    9
    View Code
    lead的作用与lag相反
    LEAD(col,n,DEFAULT)用于统计窗口内往下第n行值
    第一个参数为列名
    第二个参数为往下第n行(可选,默认为1)
    第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
    select 
        cookie_id,create_time,pv, 
        lead(pv,1) over(partition by cookie_id order by create_time) as next_1_pv,
        lead(create_time,1,'1970-01-01') over(partition by cookie_id order by create_time) as next_1_time 
    from cookie;
    数据如下:
    cookie_id create_time pv next_1_pv next_1_time 
    a_cookie    2019-06-10    1    9    2019-06-11
    a_cookie    2019-06-11    9    7    2019-06-12
    a_cookie    2019-06-12    7    3    2019-06-13
    a_cookie    2019-06-13    3    2    2019-06-14
    a_cookie    2019-06-14    2    4    2019-06-15
    a_cookie    2019-06-15    4    4    2019-06-16
    a_cookie    2019-06-16    4    NULL    1970-01-01
    b_cookie    2019-08-17    6    2    2019-08-17
    b_cookie    2019-08-17    2    9    2019-08-18
    b_cookie    2019-08-18    9    5    2019-08-19
    b_cookie    2019-08-19    5    NULL    1970-01-01
    View Code

    first_value: 取分组内排序后,截止到当前行,第一个值

    select 
        cookie_id,create_time,pv, 
        first_value(pv) over (partition by cookie_id  order by pv) as first_pv 
    from cookie;
    数据如下:
    cookie_id create_time  pv first_pv 
    a_cookie    2019-06-10    1    1
    a_cookie    2019-06-14    2    1
    a_cookie    2019-06-13    3    1
    a_cookie    2019-06-16    4    1
    a_cookie    2019-06-15    4    1
    a_cookie    2019-06-12    7    1
    a_cookie    2019-06-11    9    1
    b_cookie    2019-08-17    2    2
    b_cookie    2019-08-19    5    2
    b_cookie    2019-08-17    6    2
    b_cookie    2019-08-18    9    2
    View Code

    last_value 取分组内排序后,截止到当前行,最后一个值(其实就是它自己)

    select 
        cookie_id,create_time,pv, 
        last_value(pv) over (partition by cookie_id  order by pv) as last_pv 
    from cookie;
    cookie_id create_time  pv last_pv 
    a_cookie    2019-06-10    1    1
    a_cookie    2019-06-14    2    2
    a_cookie    2019-06-13    3    3
    a_cookie    2019-06-16    4    4
    a_cookie    2019-06-15    4    4
    a_cookie    2019-06-12    7    7
    a_cookie    2019-06-11    9    9
    b_cookie    2019-08-17    2    2
    b_cookie    2019-08-19    5    5
    b_cookie    2019-08-17    6    6
    b_cookie    2019-08-18    9    9
    View Code

    参考:

    grouping sets 、 grouping__id 、 cube 、 rollup函数;  

    cume_dist 、 percent_rank等函数

    https://www.jianshu.com/nb/19948302

     
     
     
     
  • 相关阅读:
    osds have slow requests
    supervisor 管理 celery
    guacamole部署
    openstack IPV6
    修复VSAN无法看到主机磁盘
    kolla之docker私有仓库创建
    CSS日食与太阳碰撞
    vue-devtools必备工具
    VUE在BODY上绑定enter事件
    VUE输入框显示时自动聚焦
  • 原文地址:https://www.cnblogs.com/shengyang17/p/11204631.html
Copyright © 2020-2023  润新知