• Hive LanguageManual DDL


    hive语法规则LanguageManual DDL

    SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL)

    一、数据库 增删改都在文档里说得也很明白,不重复造车轮

    二、表

    1.创建table重点解析如下

    Create Table

    eg1:基础创建方式
    create table if not exists default.cenzhongman
    (
    ip string COMMENT 'this is ip',
    name string
    )
    COMMENT 'this is log of cenzhongman.com'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
    --------------------------------------------
    eg2:常用于分表
    create table if not exists default.cenzhongman_2
    AS select ip,date from default.cenzhongman;
    --------------------------------------------
    eg3:常用于表复制
    create table if not exists default.cenzhongman_3
    like default.cenzhongman;
    
    CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
    
      #字段定义
      [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
    
      #表注释
      [COMMENT table_comment]
    
          #分区表,按指定字段进行分区,既按每一个字段按文件夹存储
      [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
    
      [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
    
      [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
         ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
         [STORED AS DIRECTORIES]
    
      #数据格式化
      [
       #行分割
       [ROW FORMAT row_format]
       #处理的文件格式
       [STORED AS file_format]
         | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
      ]
    
      #数据存储在hdfs文件系统位置
      [LOCATION hdfs_path]
    
      [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
      
      #根据另一张表查询结果创建
      [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
    
    #根据另一张表创建,字段一致
    CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
      LIKE existing_table_or_view_name
      [LOCATION hdfs_path];
     
    data_type
      : primitive_type
      | array_type
      | map_type
      | struct_type
      | union_type  -- (Note: Available in Hive 0.7.0 and later)
     
    primitive_type
      : TINYINT
      | SMALLINT
      | INT
      | BIGINT
      | BOOLEAN
      | FLOAT
      | DOUBLE
      | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
      | STRING
      | BINARY      -- (Note: Available in Hive 0.8.0 and later)
      | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)
      | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)
      | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)
      | DATE        -- (Note: Available in Hive 0.12.0 and later)
      | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)
      | CHAR        -- (Note: Available in Hive 0.13.0 and later)
     
    array_type
      : ARRAY < data_type >
     
    map_type
      : MAP < primitive_type, data_type >
     
    struct_type
      : STRUCT < col_name : data_type [COMMENT col_comment], ...>
     
    union_type
       : UNIONTYPE < data_type, data_type, ... >  -- (Note: Available in Hive 0.7.0 and later)
     
    row_format
      : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char] 		#行分隔符和列分隔符
            [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
            [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
      | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
     
    file_format:
      : SEQUENCEFILE
      | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)
      | RCFILE      -- (Note: Available in Hive 0.6.0 and later)
      | ORC         -- (Note: Available in Hive 0.11.0 and later)
      | PARQUET     -- (Note: Available in Hive 0.13.0 and later)
      | AVRO        -- (Note: Available in Hive 0.14.0 and later)
      | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
     
    constraint_specification:
      : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE ]
        [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE
    

    2.清除表的所有数据

    TRUNCATE TABLE table_name [PARTITION partition_spec];
         
    partition_spec:
      : (partition_column = partition_col_value, partition_column = partition_col_value, ...)
    

    三、Hive表的类型

    管理表MANAGED_TABLE

    表删除之后,表的数据同时删除
    

    托管表(外部表)EXTERNAL_TABLE

    一般通过LOCATION指定数据存储目录,以便共用
    表删除之后,表的数据不会删除(hdfs中的数据),只删除元数据(matestore)
    直接把需要加载的文件放到表所在文件夹中,自动加载
    

    分区表(此类型与上述类型非并列关系)

    #创建分区表
    create table emp_partition(ID int, name string, job string, mrg int, hiredate string, sal double, comm double, deptno int) partitioned by (mouth string);
    
    #加载数据
    load data local inpath '/opt/datas/xxx.txt' into table default.tableName partition (mouth = '201707' ,day = '14');
    
    #查询数据
    select * from emp_partition where mouth = '201707' and day = '14';
    
    #在实现上,分区表在(load)加载数据时候,会往 matestore 的数据库中的 partition 表中添加一行用于说明分区情况
    #在查询数据时,会读取 matestore 中的 partition 表中的信息
    #若用户自行 put 数据到hdfs文件系统,matestore 中的数据不会添加分区信息,则查询数据为空,此时可以使用 msck 修复表,详情见DDL官方文档
    msck repair table table_name;        #自动修复
    alter table tableName add partition(day = '20170714');     #手动修复(更常用)
    
    #显示分区
    show partitions tablename;
    

    4.查询语法

    LanguageManualSelect

    eg:全部查询
    select * from tablename ;
    
    eg2: t 是表的别名(为了方便书写,同时在存储和查看时显示)
    select t.id,t.name,t.xxx from tablename t;
    
    eg3:普通条件查询
    select * from tablename t where id = '1234';
    =  >=  <=  
    is null  /  is not null  /  in   /  not in 
    
    eg4:区间条件查询
    select * from tablename t where t.money between 800 and 1500;
    
    eg5:使用函数查询
    select count(*) from tablename;
    select max(*) from tablename;
    select min(*) from tablename;
    select sum(money) from tablename;
    select avg(*) from tablename;
    ....
    
    eg6:分组查询(**!不在函数中的字段必须在 group by 里面**)
    select t.deptId,avg(money) avg_money(注:别名,可选) from tablename t group by t.deptId; 	#通过 deptId 分组,从表中查询每个部门平均工资
    select t.job,t.deptId,avg(money) avg_money from tablename t group by t.deptId,t.job; 	#每个部门每个岗位的平均工资
    
    eg7:having 
    	where 针对单挑记录进行筛选
    	having 针对分组结果进行筛选 > 先分组,对组进行条件判断
    select deptid, avg(sal) avg_sal from tablename group by deptid having > 8000; 	#平均薪资大于 8000 的部门
    
    
    
    SELECT [ALL(默认值) | DISTINCT(不重复的)] select_expr, select_expr, ...
      FROM table_reference
      [WHERE where_condition]
      [GROUP BY col_list]#分组
      [ORDER BY col_list]#显示顺序
      [CLUSTER BY col_list
        | [DISTRIBUTE BY col_list] [SORT BY col_list]
      ]
     [LIMIT [offset,] rows]#限制显示行数
    

    join 链接查询:将 m n 两个数据库链接起来,组成一条记录

    等值 join

    select e.id, e.name, d.deptid, d.name from emp e join dept d on e.deptid = d.deptid; 	#显示e,d两个表 deptid 字段相同的信息在一个结果中
    

    左链接 left join 以 join 左边的表为准(允许有的员工没有部门,左表存在该字段则打印)

    select e.id, e.name, d.deptid, d.name from emp e left join dept d on e.deptid = d.deptid;
    

    右链接 right join 以 join 右边的表为准(允许有的部门没有员工,右表存在该字段则打印)

    select e.id, e.name, d.deptid, d.name from emp e right join dept d on e.deptid = d.deptid;
    

    全连接 full join 左 + 右 = 全

        select e.id, e.name, d.deptid, d.name from emp e fuill join dept d on e.deptid = d.deptid;
    

    Order, Sort, Cluster, and Distribute By

    #order by ( ASC | DESC )全局数据 升序 | 降序 ,仅仅只有一个reduce
    select * from tablename order by id desc;
    
    #sort by 每一个reduce内部数据进行排序
    set mapreduce.job.reduces = 3;
    select * from tablename sort by id desc;    #直接显示结果,效果不明显
    insert overwrite local directory '/opt/datas/sortby-res' select * from tablename sort by id decs;    #结果保存到本地文件系统中,分成三个结果文件存储
    
    #Cluster by 当 distribute by 和 sort by 字段相同时相当于 cluster by 根据字段(按照一定规则)根据 reduce 数分组并排序
    insert overwrite local directory '/opt/datas/sortby-res' select * from tablename cluster by id;
    
    #distribute by 分布式,指定分区方式,按某个字段进行分区
    insert overwrite local directory '/opt/datas/sortby-res' select * from tablename distribute by job sort by id decs;    #按岗位分区,内部按 ID 排序,结果保存到
    #!!注:若reduce分区数 > 字段数     存在空数据   若 reduce 数 < 字段数,部分结果会合并
    

    !!总结(重点):

    order by

    全局排序,一个Reduce
    

    sort by

    每个Reduce中进行排序,全局不排序
    

    distribute by

    类似MapReduce 中的 partition 进行分区,结合 sort by 使用
    

    cluster by

    当distribute by 和 sort by 字段相同时使用,按照根据该字段进行分区,并排序
    

    注:Hive 的虚拟属性

    可以使用虚拟列属性协助 Hive 工作

        select id,name,INPUT__FILE__NAME from tablename;
    

    即可显现 hive 文件所在文件

  • 相关阅读:
    git切换分支报错解决
    网上看到的一个布局神器
    js获取滚动条,文档,浏览器高度
    使用iView-cli创建一个pc端网页(1)
    ajax详解与用途
    JavaScript中for/in和for的区别
    排序
    Java中getResourceAsStream的用法
    java程序打包成exe的总结
    使用7zip把jre集成到绿色运行程序内
  • 原文地址:https://www.cnblogs.com/cenzhongman/p/7163257.html
Copyright © 2020-2023  润新知