• HQL练习


    Hive学习笔记总结

    05. Hql练习

    1. hql基础练习

    题目和数据来源:http://www.w2b-c.com/article/150326(去掉-)

    create和load

    create table students(Sno int,Sname string,Sex string,Sage int,Sdept string)row format delimited fields terminated by ','stored as textfile;
    create table course(Cno int,Cname string) row format delimited fields terminated by ',' stored as textfile;
    create table sc(Sno int,Cno int,Grade int)row format delimited fields terminated by ',' stored as textfile;
    
    load data local inpath '/home/hadoop/hivedata/students.txt' overwrite into table student;
    load data local inpath '/home/hadoop/hivedata/sc.txt' overwrite into table sc;
    load data local inpath '/home/hadoop/hivedata/course.txt' overwrite into table course;
    

    1.查询全体学生的学号与姓名

    hive> select Sno,Sname from students;
    

    2.查询选修了课程的学生姓名

    select distinct Sname from students, sc where students.Sno = sc.Sno;
    

    或:

    select distinct Sname from students inner join sc on students.Sno = sc.Sno;
    

    3.查询学生的总人数

    select count(*) from students;
    

    4.计算1号课程的学生平均成绩

    select avg(Grade) from sc where Cno = 1;
    

    5.查询各科成绩平均分

    select Cname,avg(Grade) from sc, course where sc.Cno = course.Cno group by sc.Cno;
    

    //Grade要么出现在group关键词之后,要么使用聚合函数。

    6.查询选修1号课程的学生最高分数

    select max(Grade) from sc where Cno = 1;
    

    7.求各个课程号及相应的选课人数

    select Cno,count(*) from sc group by Cno;
    

    8.查询选修了3门以上的课程的学生学号

    select Sno from sc group by Sno having count(Cno) >3 ;
    

    9.查询学生信息,结果按学号全局有序

    select * from students order by Sno;
    

    10.查询学生信息,结果区分性别按年龄有序

    set mapred.reduce.tasks=2;
    select * from students distribute by sex sort by sage;
    

    11.查询每个学生及其选修课程的情况

    select students.*,sc.* from students join sc on (students.Sno =sc.Sno);
    

    12.查询学生的得分情况
    13.查询选修2号课程且成绩在90分以上的所有学生。

    select students.Sname from sc,students where sc.Cno = 2 and sc.Grade > 90 and sc.Sno = students.Sno;
    

    或者:

    select students.Sname,sc.Grade from students join sc on students.Sno=sc.Sno where  sc.Cno=2 and sc.Grade>90;
    

    14.查询所有学生的信息,如果在成绩表中有成绩,则输出成绩表中的课程号

    select students.Sname,sc.Cno from students join sc on students.Sno=sc.Sno;
    

    15.重写以下子查询为LEFT SEMI JOIN

    SELECT a.key, a.value FROM a WHERE a.key exist in (SELECT b.key FROM B);
    

    查询目的:查找A中,key值在B中存在的数据。
    可以被重写为:

    select a.key,a.value from a left semi join b on a.key = b.key;
    

    16.查询与“刘晨”在同一个系学习的学生

    select s1.Sname from students s1 where sdept in (select sdept from students where sname = '刘晨');
    

    或者:

    select s1.Sname from students s1 left semi join students s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';
    

    注意比较:

    select * from students s1 left join students s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';
    select * from students s1 right join students s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';
    select * from students s1 inner join students s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';
    select * from students s1 left semi join students s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';
    

    2. 执行顺序

    标准顺序:
    select--from--where--group by--having--order by

    join操作中,on条件与where条件的区别

    数据库在通过连接两张或多张表来返回记录时,都会生成一张中间的临时表,然后再将这张临时表返回给用户。
    join发生在where字句之前,在使用left jion时,on和where条件的区别如下:

    1、on条件是在生成临时表时使用的条件,它不管on中的条件是否为真,都会返回左边表中的记录。(右边置为Null了)

    2、where条件是在临时表生成好后,再对临时表进行过滤的条件。这时已经没有left join的含义(必须返回左边表的记录)了,条件不为真的就全部过滤掉。

    假设有两张表:

    表1:tab1

    id size 
    1  10 
    2  20 
    3  30 
    

    表2:tab2

    size name 
    10   AAA 
    20   BBB 
    20   CCC 
    

    两条SQL:

    1、select * from tab1 left join tab2 on tab1.size = tab2.size where tab2.name='AAA'
    2、select * from tab1 left join tab2 on tab1.size = tab2.size and tab2.name='AAA'
    

    第一条SQL的过程:
    1、中间表
    on条件:

    tab1.size = tab2.size 
    tab1.id tab1.size tab2.size tab2.name 
    1 10 10 AAA 
    2 20 20 BBB 
    2 20 20 CCC 
    3 30 (null) (null) 
    

    2、再对中间表过滤
    where 条件:

    tab2.name='AAA'
    tab1.id tab1.size tab2.size tab2.name 
    1 10 10 AAA 
    

    第二条SQL的过程:
    1、中间表
    on条件:

    tab1.size = tab2.size and tab2.name='AAA'
    (条件不为真也会返回左表中的记录) tab1.id tab1.size tab2.size tab2.name 
    1 10 10 AAA 
    2 20 (null) (null) 
    3 30 (null) (null) 
    

    其实以上结果的关键原因就是left join,right join,full join的特殊性,
    不管on上的条件是否为真都会返回left或right表中的记录,full则具有left和right的特性的并集。

    ** 而inner join没这个特殊性,则条件放在on中和where中,返回的结果集是相同的。**

    3. Hive实战--级联求和(累计报表)

    需求:
    有如下访客访问次数统计表 t_access_times

    访客 月份 访问次数
    A 2015-01 5
    A 2015-01 15
    B 2015-01 5
    A 2015-01 8
    B 2015-01 25
    A 2015-01 5
    A 2015-02 4
    A 2015-02 6
    B 2015-02 10
    B 2015-02 5

    需要输出报表:t_access_times_accumulate
    月访问:当月的总次数;累计访问总计:截止到当月的月访问次数之和。

    访客 月份 月访问总计 累计访问总计
    A 2015-01 33 33
    A 2015-02 10 43
    B 2015-01 30 30
    B 2015-02 15 45

    准备数据:
    A,2015-01,5
    A,2015-01,15
    B,2015-01,5
    A,2015-01,8
    B,2015-01,25
    A,2015-01,5
    A,2015-02,4
    A,2015-02,6
    B,2015-02,10
    B,2015-02,5

    create table t_access_time(username string,month string,salary int)
    row format delimited fields terminated by ',';
    
    load data local inpath '/home/hadoop/t_access_times.dat' into table t_access_time;
    

    1、第一步,先求每个用户的月总金额

    select username,month,sum(salary) from t_access_time group by username,month;
    

    +-----------+----------+---------+--+
    | username | month | salary |
    +-----------+----------+---------+--+
    | A | 2015-01 | 33 |
    | A | 2015-02 | 10 |
    | B | 2015-01 | 30 |
    | B | 2015-02 | 15 |
    +-----------+----------+---------+--+

    2、第二步,将月总金额表 自己连接(自连接)

    select * from 
    (select username,month,sum(salary) as salary from t_access_time group by username,month) TabA 
    inner join 
    (select username,month,sum(salary) as salary from t_access_time group by username,month) TabB 
    on TabA.username = TabB.username;
    

    +-------------+----------+-----------+-------------+----------+-----------+--+
    | a.username | a.month | a.salary | b.username | b.month | b.salary |
    +-------------+----------+-----------+-------------+----------+-----------+--+
    | A | 2015-01 | 33 | A | 2015-01 | 33 |
    | A | 2015-01 | 33 | A | 2015-02 | 10 |
    | A | 2015-02 | 10 | A | 2015-01 | 33 |
    | A | 2015-02 | 10 | A | 2015-02 | 10 |
    | B | 2015-01 | 30 | B | 2015-01 | 30 |
    | B | 2015-01 | 30 | B | 2015-02 | 15 |
    | B | 2015-02 | 15 | B | 2015-01 | 30 |
    | B | 2015-02 | 15 | B | 2015-02 | 15 |
    +-------------+----------+-----------+-------------+----------+-----------+--+

    3、第三步,从上一步的结果中
    进行分组查询,分组的字段是a.username a.month
    求月累计值: 将b.month <= a.month的所有b.salary求和即可

    select TabA.username,TabA.month,max(TabA.salary) as month_salary,sum(TabB.salary) as sum_salary 
    from 
    (select username,month,sum(salary) as salary from t_access_time group by username,month) TabA 
    inner join 
    (select username,month,sum(salary) as salary from t_access_time group by username,month) TabB 
    on TabA.username = TabB.username 
    where TabB.month<= TabA.month 
    group by TabA.username,TabA.month;
    

    max(TabA.salary)不能直接写成TabA.salary,因为这个字段没有出现在group by中,也没有聚合函数,所以使用max表示。

    结果:
    A 2015-01 33 33
    A 2015-02 10 43
    B 2015-01 30 30
    B 2015-02 15 45

    参考http://www.w2b-c.com/article/150326(去掉-)

    初接触,记下学习笔记,还有很多问题,望指导,谢谢。

  • 相关阅读:
    python CreateUniqueName()创建唯一的名字
    node 创建静态服务器并自动打开浏览器
    基于jQuery 的插件开发
    Fetch
    纯css 来实现下拉菜单
    javascript模板引擎之
    jquery jsonp 跨域
    数据库增删改查
    Promise
    Vue.js
  • 原文地址:https://www.cnblogs.com/wangrd/p/6275604.html
Copyright © 2020-2023  润新知