• Hive基础之Hive表常用操作


    本案例使用的数据均来源于Oracle自带的emp和dept表

    创建表

    语法:

    CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
      [(col_name data_type [COMMENT col_comment], ...)]
      [COMMENT table_comment]
      [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
      [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] 
      [SKEWED BY (col_name, col_name, ...) ON ([(col_value, col_value, ...), ...|col_value, col_value, ...]) [STORED AS DIRECTORIES] (Note: Only available starting with Hive 0.10.0)]
      [
       [ROW FORMAT row_format] [STORED AS file_format]
       | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] (Note: Only available starting with Hive 0.6.0)
      ]
      [LOCATION hdfs_path]
      [TBLPROPERTIES (property_name=property_value, ...)] (Note: Only available starting with Hive 0.6.0)
      [AS select_statement] (Note: Only available starting with Hive 0.5.0, and not supported when creating external tables.)
    create table emp(
    empno int,
    ename string,
    job string,
    mgr int,
    hiredate string,
    sal double,
    comm double,
    deptno int
    )
    row format delimited fields terminated by '	' lines terminated by '
    ' 
    stored as textfile;
    
    create table dept( deptno int, dname string, loc string ) row format delimited fields terminated by ' ' lines terminated by ' ' stored as textfile;

    注:创建表时默认列分割符是01,行分隔符是  

    加载数据到hive表

    Hive操作的数据源:文件、其他表、其他数据库

    1)load:加载本地/HDFS文件到hive表

    LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename 
    [PARTITION (partcol1=val1, partcol2=val2 ...)]

    默认表数据存储在HDFS上的/user/hive/warehouse目录下,该目录可以在hive-site.xml中配置。

    load data inpath 加载hdfs文件到hive表中;

    load data local inpath 加载本地文件到hive表中;

    overwrite 是否会覆盖表里已有的数据

    load data local inpath '/home/spark/software/data/emp.txt' overwrite into table emp;
    load data local inpath '/home/spark/software/data/dept.txt' overwrite into table dept;

    2)insert:导入数据到表里/从表里导出到HDFS或者本地目录

    Standard syntax:
    INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
    INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
     
    Hive extension (multiple inserts):
    FROM from_statement
    INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
    [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
    [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
    FROM from_statement
    INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
    [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
    [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;
    
    Standard syntax:
    INSERT OVERWRITE [LOCAL] DIRECTORY directory1
      [ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
      SELECT ... FROM ...
     
    Hive extension (multiple inserts):
    FROM from_statement
    INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
    [INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...

    3)sqoop: 关系型数据库和HDFS文件导入/导出操作

    详见sqoop章节介绍。

    select操作

    select * from emp;
    7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
    7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
    7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
    7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
    7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
    7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    30
    7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
    7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20
    7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
    7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
    7876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    20
    7900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    30
    7902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20
    7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
    
    select * from dept; 10 ACCOUNTING NEW YORK 20 RESEARCH DALLAS 30 SALES CHICAGO 40 OPERATIONS BOSTON

    where使用

    select * from emp where deptno =10;
    7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
    7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
    7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
    
    select * from emp where deptno <>10;     
    7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
    7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
    7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
    7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
    7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
    7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    30
    7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20
    7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
    7876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    20
    7900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    30
    7902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20
    
    select * from emp where ename ='SCOTT';
    7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20
    
    select ename,sal from emp where sal between 800 and 1500;  
    SMITH   800.0
    WARD    1250.0
    MARTIN  1250.0
    TURNER  1500.0
    ADAMS   1100.0
    JAMES   950.0
    MILLER  1300.0

    limit使用

    select * from emp limit 4;
    7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
    7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
    7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
    7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20

    (not) in使用

    select ename,sal,comm from emp where ename in ('SMITH','KING');
    SMITH   800.0   NULL
    KING    5000.0  NULL
    
    select ename,sal,comm from emp where ename not in ('SMITH','KING');
    ALLEN   1600.0  300.0
    WARD    1250.0  500.0
    JONES   2975.0  NULL
    MARTIN  1250.0  1400.0
    BLAKE   2850.0  NULL
    CLARK   2450.0  NULL
    SCOTT   3000.0  NULL
    TURNER  1500.0  0.0
    ADAMS   1100.0  NULL
    JAMES   950.0   NULL
    FORD    3000.0  NULL
    MILLER  1300.0  NULL

    is (not) null使用

    select ename,sal,comm from emp where comm is null;
    SMITH   800.0   NULL
    JONES   2975.0  NULL
    BLAKE   2850.0  NULL
    CLARK   2450.0  NULL
    SCOTT   3000.0  NULL
    KING    5000.0  NULL
    ADAMS   1100.0  NULL
    JAMES   950.0   NULL
    FORD    3000.0  NULL
    MILLER  1300.0  NULL
    
    select ename,sal,comm from emp where comm is not null;
    ALLEN   1600.0  300.0
    WARD    1250.0  500.0
    MARTIN  1250.0  1400.0
    TURNER  1500.0  0.0

    order by的使用

    与关系型数据库的order by功能一致,按照某个字段或某几个字段排序输出;

    与关系型数据库区别在于:当hive.mapred.mode=strict模式下,必须指定limit否则执行报错;

    hive.mapred.mode默认值为nonstrict;

    select * from dept;
    10      ACCOUNTING      NEW YORK
    20      RESEARCH        DALLAS
    30      SALES   CHICAGO
    40      OPERATIONS      BOSTON
    select * from dept order by deptno desc;
    40      OPERATIONS      BOSTON
    30      SALES   CHICAGO
    20      RESEARCH        DALLAS
    10      ACCOUNTING      NEW YORK
    
    select ename,sal,deptno from emp order by deptno asc,ename desc;
    MILLER  1300.0  10
    KING    5000.0  10
    CLARK   2450.0  10
    SMITH   800.0   20
    SCOTT   3000.0  20
    JONES   2975.0  20
    FORD    3000.0  20
    ADAMS   1100.0  20
    WARD    1250.0  30
    TURNER  1500.0  30
    MARTIN  1250.0  30
    JAMES   950.0   30
    BLAKE   2850.0  30
    ALLEN   1600.0  30
    set hive.mapred.mode=strict;
    select * from emp order by empno desc;

    报错:FAILED: SemanticException 1:27 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token 'empno'

    正确写法:

    select * from emp order by empno desc limit 4;
    7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
    7902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20
    7900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    30
    7876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    20

    为什么会报错呢?

    在order by状态下所有数据会分发到一个节点上进行reduce操作也就只有一个reduce作业,如果在数据量大的情况下会出现无法输出结果的情况,如果进行limit n,那就只有n*map数个记录而已,只有一个reduce也可以处理的过来。

    select嵌套查询、别名

    from(select ename, sal from emp) e
    select e.ename, e.sal
    where e.sal>1000;

    等价于

    select ename, sal from emp where sal>1000;
    ALLEN   1600.0
    WARD    1250.0
    JONES   2975.0
    MARTIN  1250.0
    BLAKE   2850.0
    CLARK   2450.0
    SCOTT   3000.0
    KING    5000.0
    TURNER  1500.0
    ADAMS   1100.0
    FORD    3000.0
    MILLER  1300.0

    组函数:max(), min(), avg(), sum(), count()等

    select count(*) from emp where deptno=10;
    3
    
    select count(ename) from emp where deptno=10;  #count某个字段,如果这个字段不为空就算一个.
    3
    
    select count(distinct deptno) from emp;
    3
    
    select sum(sal) from emp;
    29025.0

    group by的使用

    出现在select中的字段,如果没出现在组函数中,必须出现在Group by语句中

    求每个部门的平均薪水:

    select deptno, avg(sal) from emp group by deptno;
    10      2916.6666666666665
    20      2175.0
    30      1566.6666666666667

    求每个部门中每个工作最高的薪水:

    select deptno,job,max(sal) from emp group by deptno,job;
    10      CLERK   1300.0
    10      MANAGER 2450.0
    10      PRESIDENT       5000.0
    20      ANALYST 3000.0
    20      CLERK   1100.0
    20      MANAGER 2975.0
    30      CLERK   950.0
    30      MANAGER 2850.0
    30      SALESMAN        1600.0

    having的使用

    对分组结果筛选,后跟聚合函数,hive0.11版本之后才支持;where是对单条纪录进行筛选,Having是对分组结果进行筛选。

    求每个部门的平均薪水大于2000的部门:

    select avg(sal),deptno from emp group by deptno having avg(sal)>2000;
    2916.6666666666665      10
    2175.0  20

    having是hive0.11后才支持的,如果不使用having而想达到having一样的功能,语句如何写?

    select deptno, e.avg_sal from (select deptno, avg(sal) as avg_sal from emp group by deptno) e where e.avg_sal > 2000;

    CASE...WHEN..THEN使用

    select ename, sal,
    case
    when sal > 1 and sal <=1000 then 'LOWER'
    when sal >1000 and sal <=2000 then 'MIDDLE'
    when sal >2000 and sal <=4000 then 'HIGH'
    ELSE 'HIGHEST' end
    from emp;
    
    SMITH   800.0   LOWER
    ALLEN   1600.0  MIDDLE
    WARD    1250.0  MIDDLE
    JONES   2975.0  HIGH
    MARTIN  1250.0  MIDDLE
    BLAKE   2850.0  HIGH
    CLARK   2450.0  HIGH
    SCOTT   3000.0  HIGH
    KING    5000.0  HIGHEST
    TURNER  1500.0  MIDDLE
    ADAMS   1100.0  MIDDLE
    JAMES   950.0   LOWER
    FORD    3000.0  HIGH
    MILLER  1300.0  MIDDLE
  • 相关阅读:
    第五章 Internet协议
    第四章 地址解析协议
    Learn the shell
    Linux学习前的准备
    第三章 链路层
    第二章 Internet 地址结构
    后台数据导出为Excel
    C#开发客户端、JAVA和tomcat开发服务端
    Java基础
    C++学习笔记--(1)
  • 原文地址:https://www.cnblogs.com/luogankun/p/3910442.html
Copyright © 2020-2023  润新知