• Hive 学习(五) Hive之HSQL基础


    一,前言

    二,连接查询

      2.1 内连接(join | inner join)

      2.2 左外连接(left join | left outer join

      2.3 右外连接(right join | right outer join)

      2.4 全外连接(full join | full outer join)

      2.5 左半连接(left semi join)

    三,分组聚合

      3.1 分组聚合须知

      3.2 分组聚合练习1

      3.3 分组聚合练习2

    四,条件判断

      4.1 case...when...

      4.2 if判断

    正文

    一,前言

      和其他SQL一样,HSQL与其他的SQL差别不大,在这就进行一些简单的介绍。

    二,连接查询

      数据准备和表创建如下:

    create table t_a(name string,numb int)
    row format delimited
    fields terminated by ',';
    
    create table t_b(name string,nick string)
    row format delimited
    fields terminated by ',';
    
    load data local inpath '/root/hivetest/a.txt' into table t_a;
    load data local inpath '/root/hivetest/b.txt' into table t_b;
    
    0: jdbc:hive2://localhost:10000> select * from t_a;
    +-----------+-----------+--+
    | t_a.name  | t_a.numb  |
    +-----------+-----------+--+
    | a         | 4         |
    | b         | 5         |
    | c         | 6         |
    | d         | 8         |
    +-----------+-----------+--+
    4 rows selected (0.197 seconds)
    0: jdbc:hive2://localhost:10000> select * from t_b;
    +-----------+-----------+--+
    | t_b.name  | t_b.nick  |
    +-----------+-----------+--+
    | a         | laowang   |
    | b         | qiangzi   |
    | d         | paoge     |
    | g         | haizi     |
    +-----------+-----------+--+
    4 rows selected (0.135 seconds)

      2.1 内连接(join | inner join)

    -- 各类join
    --1/ 内连接
    -- 笛卡尔积
    select a.*,b.*
    from t_a a inner join t_b b;  // inner join == join
    
    +-----------+-----------+-----------+-----------+--+
    | t_a.name  | t_a.numb  | t_b.name  | t_b.nick  |
    +-----------+-----------+-----------+-----------+--+
    | a         | 4         | a         | laowang   |
    | b         | 5         | a         | laowang   |
    | c         | 6         | a         | laowang   |
    | d         | 8         | a         | laowang   |
    | a         | 4         | b         | qiangzi   |
    | b         | 5         | b         | qiangzi   |
    | c         | 6         | b         | qiangzi   |
    | d         | 8         | b         | qiangzi   |
    | a         | 4         | d         | paoge     |
    | b         | 5         | d         | paoge     |
    | c         | 6         | d         | paoge     |
    | d         | 8         | d         | paoge     |
    | a         | 4         | g         | haizi     |
    | b         | 5         | g         | haizi     |
    | c         | 6         | g         | haizi     |
    | d         | 8         | g         | haizi     |
    +-----------+-----------+-----------+-----------+--+
    16 rows selected (14.104 seconds)
    
    
    -- 指定join条件
    select a.*,b.*
    from 
    t_a a join t_b b on a.name=b.name;
    
    +---------+---------+---------+----------+--+
    | a.name  | a.numb  | b.name  |  b.nick  |
    +---------+---------+---------+----------+--+
    | a       | 4       | a       | laowang  |
    | b       | 5       | b       | qiangzi  |
    | d       | 8       | d       | paoge    |
    +---------+---------+---------+----------+--+
    3 rows selected (13.429 seconds)

      2.2 左外连接(left join | left outer join)

    -- 2/ 左外连接(左连接)
    select a.*,b.*
    from 
    t_a a left outer join t_b b on a.name=b.name;
    
    +---------+---------+---------+----------+--+
    | a.name  | a.numb  | b.name  |  b.nick  |
    +---------+---------+---------+----------+--+
    | a       | 4       | a       | laowang  |
    | b       | 5       | b       | qiangzi  |
    | c       | 6       | NULL    | NULL     |
    | d       | 8       | d       | paoge    |
    +---------+---------+---------+----------+--+
    4 rows selected (9.25 seconds)

      2.3 右外连接(right join | right outer join)

    -- 3/ 右外连接(右连接)
    select a.*,b.*
    from 
    t_a a right outer join t_b b on a.name=b.name;
    +---------+---------+---------+----------+--+
    | a.name  | a.numb  | b.name  |  b.nick  |
    +---------+---------+---------+----------+--+
    | a       | 4       | a       | laowang  |
    | b       | 5       | b       | qiangzi  |
    | d       | 8       | d       | paoge    |
    | NULL    | NULL    | g       | haizi    |
    +---------+---------+---------+----------+--+
    4 rows selected (8.914 seconds)

      2.4 全外连接(full join | full outer join)

    -- 4/ 全外连接
    select a.*,b.*
    from
    t_a a full outer join t_b b on a.name=b.name;
    
    +---------+---------+---------+----------+--+
    | a.name  | a.numb  | b.name  |  b.nick  |
    +---------+---------+---------+----------+--+
    | a       | 4       | a       | laowang  |
    | b       | 5       | b       | qiangzi  |
    | c       | 6       | NULL    | NULL     |
    | d       | 8       | d       | paoge    |
    | NULL    | NULL    | g       | haizi    |
    +---------+---------+---------+----------+--+
    5 rows selected (1.74 seconds)

      2.5 左半连接(left semi join)

      注意:left semi是HSQL特有,且没有右半连接

    -- 5/ 左半连接 :注意没有右半连接
    select a.*
    from 
    t_a a left semi join t_b b on a.name=b.name;
    
    +---------+---------+--+
    | a.name  | a.numb  |
    +---------+---------+--+
    | a       | 4       |
    | b       | 5       |
    | d       | 8       |
    +---------+---------+--+
    3 rows selected (12.236 seconds)

    三,分组聚合

      数据下载:access.log

      表创建和数据插入:

    create table t_access(ip string,url string,access_time string)
    partitioned by (dt string)
    row format delimited fields terminated by ',';
    
    
    -- 导入数据
    load data local inpath '/root/hivetest/access.log.0804' into table t_access partition(dt='2017-08-04');
    load data local inpath '/root/hivetest/access.log.0805' into table t_access partition(dt='2017-08-05');
    load data local inpath '/root/hivetest/access.log.0806' into table t_access partition(dt='2017-08-06');
    
    -- 查看表的分区
    show partitions t_access;

      3.1 分组聚合须知

       一旦有group by子句,那么,在select子句中就不能有 (分组字段,聚合函数) 以外的字段

      3.2 分组聚合练习1

    -- 针对每一行进行运算
    select ip,upper(url),access_time  -- 该表达式是对数据中的每一行进行逐行运算
    from t_access;
    
    
    
    -- 求每条URL的访问总次数
    
    select url,count(1) as cnts   -- 该表达式是对分好组的数据进行逐组运算
    from t_access 
    group by url;
    
    +---------------------------------+------+--+
    |               url               | _c1  |
    +---------------------------------+------+--+
    | http://www.edu360.cn/excersize  | 3    |
    | http://www.edu360.cn/job        | 7    |
    | http://www.edu360.cn/pay        | 1    |
    | http://www.edu360.cn/register   | 2    |
    | http://www.edu360.cn/stu        | 4    |
    | http://www.edu360.cn/teach      | 2    |
    +---------------------------------+------+--+
    6 rows selected (1.683 seconds)
    
    
    -- 求每个URL的访问者中ip地址最大的
    
    select url,max(ip)
    from t_pv_log
    group by url;
    
    
    +---------------------------------+----------------+--+
    |               url               |      _c1       |
    +---------------------------------+----------------+--+
    | http://www.edu360.cn/excersize  | 192.168.33.46  |
    | http://www.edu360.cn/job        | 192.168.33.55  |
    | http://www.edu360.cn/pay        | 192.168.34.44  |
    | http://www.edu360.cn/register   | 192.168.133.3  |
    | http://www.edu360.cn/stu        | 192.168.33.44  |
    | http://www.edu360.cn/teach      | 192.168.44.3   |
    +---------------------------------+----------------+--+
    6 rows selected (1.595 seconds)
    
    
    -- 求每个用户访问同一个页面的所有记录中,时间最晚的一条
    
    select ip,url,max(access_time) 
    from  t_pv_log
    group by ip,url;
    
    +----------------+---------------------------------+----------------------+--+
    |       it       |               url               |         _c2          |
    +----------------+---------------------------------+----------------------+--+
    | 192.168.111.3  | http://www.edu360.cn/register   | 2017-08-06 15:35:20  |
    | 192.168.133.3  | http://www.edu360.cn/register   | 2017-08-06 15:30:20  |
    | 192.168.33.25  | http://www.edu360.cn/job        | 2017-08-06 15:40:20  |
    | 192.168.33.3   | http://www.edu360.cn/stu        | 2017-08-05 15:30:20  |
    | 192.168.33.3   | http://www.edu360.cn/teach      | 2017-08-04 15:35:20  |
    | 192.168.33.36  | http://www.edu360.cn/excersize  | 2017-08-06 16:30:20  |
    | 192.168.33.4   | http://www.edu360.cn/job        | 2017-08-04 16:30:20  |
    | 192.168.33.4   | http://www.edu360.cn/stu        | 2017-08-04 15:30:20  |
    | 192.168.33.44  | http://www.edu360.cn/stu        | 2017-08-05 15:30:20  |
    | 192.168.33.46  | http://www.edu360.cn/excersize  | 2017-08-06 16:30:20  |
    | 192.168.33.46  | http://www.edu360.cn/job        | 2017-08-05 16:30:20  |
    | 192.168.33.5   | http://www.edu360.cn/job        | 2017-08-04 15:40:20  |
    | 192.168.33.55  | http://www.edu360.cn/job        | 2017-08-06 15:40:20  |
    | 192.168.34.44  | http://www.edu360.cn/pay        | 2017-08-06 15:30:20  |
    | 192.168.44.3   | http://www.edu360.cn/teach      | 2017-08-05 15:35:20  |
    +----------------+---------------------------------+----------------------+--+
    15 rows selected (1.597 seconds)

      3.3 分组聚合练习二

    -- 求8月4号以后,每天http://www.edu360.cn/job的总访问次数,及访问者中ip地址中最大的
    
    
    select dt,'http://www.edu360.cn/job',count(1),max(it)
    from t_access
    where url='http://www.edu360.cn/job'
    group by dt having dt>'2017-08-04';
    
    
    select dt,max(url),count(1),max(ip)
    from t_access
    where url='http://www.edu360.cn/job'
    group by dt having dt>'2017-08-04';
    
    
    select dt,url,count(1),max(ip)
    from t_access
    where url='http://www.edu360.cn/job'
    group by dt,url having dt>'2017-08-04';
    
    
    
    select dt,url,count(1),max(ip)
    from t_access
    where url='http://www.edu360.cn/job' and dt>'2017-08-04'
    group by dt,url;
    
    +-------------+---------------------------+------+----------------+--+
    |     dt      |            _c1            | _c2  |      _c3       |
    +-------------+---------------------------+------+----------------+--+
    | 2017-08-05  | http://www.edu360.cn/job  | 2    | 192.168.33.55  |
    | 2017-08-06  | http://www.edu360.cn/job  | 3    | 192.168.33.55  |
    +-------------+---------------------------+------+----------------+--+
    2 rows selected (1.609 seconds)
    
    -- 求8月4号以后,每天每个页面的总访问次数,及访问者中ip地址中最大的,且,只查询出总访问次数>2 的记录
    -- 方式1:
    select dt,url,count(1) as cnts,max(ip)
    from t_access
    where dt>'2017-08-04'
    group by dt,url having cnts>2;
    
    
    -- 方式2:用子查询
    select dt,url,cnts,max_ip
    from
    (select dt,url,count(1) as cnts,max(ip) as max_ip
    from t_access
    where dt>'2017-08-04'
    group by dt,url) tmp
    where cnts>2;
    
    
    +----------------+---------------------------------+-----------------------+--------------+--+
    |  t_access.ip   |          t_access.url           | t_access.access_time  | t_access.dt  |
    +----------------+---------------------------------+-----------------------+--------------+--+
    
    | 192.168.33.46  | http://www.edu360.cn/job        | 2017-08-05 16:30:20   | 2017-08-05   |
    | 192.168.33.55  | http://www.edu360.cn/job        | 2017-08-05 15:40:20   | 2017-08-05   |
    | 192.168.33.55  | http://www.edu360.cn/job        | 2017-08-06 15:40:20   | 2017-08-06   |
    | 192.168.33.25  | http://www.edu360.cn/job        | 2017-08-06 15:40:20   | 2017-08-06   |
    | 192.168.33.55  | http://www.edu360.cn/job        | 2017-08-06 15:40:20   | 2017-08-06   |
    +----------------+---------------------------------+-----------------------+--------------+

    四,条件判断

      数据下载:点击下载

      4.1 case...when...

      语法结构:

    case  // 开始提示符
        when 条件1 then 符合条件1的结果
        when 条件2 then 符合条件2的结果
        else 上述都不符合的结果
    end   //结束提示符

      表创建和数据导入:

    create table t_user(id int,name string,info struct<age:int,sex:string,addr:string>)
    row format delimited fields terminated by ','
    collection items terminated by ':';
    
    -- 导入数据
    load data local inpath '/root/hivetest/user.dat' into table t_user;

      数据查询:

    select * from t_user;
    +------------+--------------+----------------------------------------+--+
    | t_user.id  | t_user.name  |              t_user.info               |
    +------------+--------------+----------------------------------------+--+
    | 1          | zhangsan     | {"age":18,"sex":"male","addr":"深圳"}    |
    | 2          | lisi         | {"age":28,"sex":"female","addr":"北京"}  |
    | 3          | wangwu       | {"age":38,"sex":"male","addr":"广州"}    |
    | 4          | 赵六           | {"age":26,"sex":"female","addr":"上海"}  |
    | 5          | 钱琪           | {"age":35,"sex":"male","addr":"杭州"}    |
    | 6          | 王八           | {"age":48,"sex":"female","addr":"南京"}  |
    +------------+--------------+----------------------------------------+--+

      需求:查询出用户的id、name、年龄(如果年龄在30岁以下,显示年轻人,30-40之间,显示中年人,40以上老年人)

      需求实现:

    select id,name,
    case
      when info.age<30 then '青年'
      when info.age>=30 and info.age<40 then '中年'
      else '老年'
    end
    from t_user;

      4.2 if判断

      语法结构:

    if(条件,条件成立结果,条件不成立结果)

      表创建和数据导入:

    -- 建表映射:
    create table t_movie(movie_name string,actors array<string>,first_show date)
    row format delimited fields terminated by ','
    collection items terminated by ':';
    
    -- 导入数据
    load data local inpath '/root/hivetest/actor.dat' into table t_movie;

      数据查询:

    select * from t_movie;
    
    +---------------------+------------------------+---------------------+--+
    | t_movie.movie_name  |     t_movie.actors     | t_movie.first_show  |
    +---------------------+------------------------+---------------------+--+
    | 战狼2                 | ["吴京","吴刚","龙母"]       | 2017-08-16          |
    | 三生三世十里桃花            | ["刘亦菲","痒痒"]           | 2017-08-20          |
    | 普罗米修斯               | ["苍老师","小泽老师","波多老师"]  | 2017-09-17          |
    | 美女与野兽               | ["吴刚","加藤鹰"]           | 2017-09-17          |
    +---------------------+------------------------+---------------------+--+

      需求:查询电影信息,并且如果主演中有吴刚的,显示好电影,否则烂片

      需求实现:

    select movie_name,actors,first_show,
    if(array_contains(actors,'吴刚'),'好片儿','烂片儿')
    from t_movie;
  • 相关阅读:
    吹气球
    Leetcode 235
    什么是BPMN网关?
    BPMN中的任务和活动之间有什么区别?
    两款流程图设计推荐
    Activiti7.1, jBPM7.25, Camunda, Flowable6.3技术组成对比
    Flowable与activiti对比
    机器学习中的数学
    WopiServerTutorial
    如何整合Office Web Apps至自己开发的系统(二)
  • 原文地址:https://www.cnblogs.com/tashanzhishi/p/10901234.html
Copyright © 2020-2023  润新知