• 【HIVE】(2)分区表、二级分区、动态分区、分桶、抽样


    1. 分区表:
    建表语句中添加:partitioned by (col1 string, col2 string)
    		create table emp_pt(id int, name string, job string, mgr int, hiredate string, salary double, bonus double, deptid int) partitioned by (year string) row format delimited fields terminated by "	";
    	 
    	 重点:partitioned by (year string)
    	 
    	导入数据:
    		 insert overwrite table emp_pt partition(year="1980") select * from emp where year(hiredate)=1980;    // hdfs会在emp_pt下,新建year=1980目录
    		 insert overwrite table emp_pt partition(year="1981") select * from emp where year(hiredate)=1981;
    		 insert overwrite table emp_pt partition(year="1982") select * from emp where year(hiredate)=1982;
    	 
    	 查看HDFS目录结构,每个分区一个目录,目录名称:year=1980
    		 hdfs dfs -ls /user/hive/warehouse/functiontest.db/emp_pt/
    		 hdfs dfs -ls /user/hive/warehouse/functiontest.db/emp_pt/year=1980
    	 
    	 查询:
    		 show partitions emp_pt;
    		 select * from emp_pt where year=1980;
    	 
    	 分区之前,select需要遍历所有数据进行查找;
    	 分区后,通过分区进行查找,只需要查找对应目录下的数据即可。
    	 
    	 删除分区:
    		alter table emp_pt drop partition( year=1980);
    

    2.二级分区:

    建表时,partitioned by 包含两个字段即可。
    		create table emp_pt2(id int, name string, job string, mgr int, hiredate string, salary double, bonus double, deptid int) partitioned by (year string, month string) row format delimited fields terminated by "	";
    	 
    	 导入数据:
    		 insert overwrite table emp_pt2 partition(year=1980, month=12) select * from emp where year(hiredate)=1980 and month(hiredate)=12;
    		 insert overwrite table emp_pt2 partition(year=1981, month=11) select * from emp where year(hiredate)=1981 and month(hiredate)=11;
    		 insert overwrite table emp_pt2 partition(year=1981, month=12) select * from emp where year(hiredate)=1981 and month(hiredate)=12;
    	 
    	 查询,根据分区进行查询。
    
    1. 动态分区
    静态分区需要手动指定分区名称,当分区非常多时,无法一一指定。此时可以使用动态分区表。
    	建表语句不变,导入数据的方式改变了。
    	
    	新建表:
    		create table emp_pt3 like emp_pt2;
    		
    	导入数据:
    		insert overwrite table emp_pt3 partition(year, month) select *, year(hiredate) as year, month(hiredate) as month from emp;
    		
    		提示:
    			Error: Error while compiling statement: FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict (state=42000,code=10096)
    		
    		再执行load,导入成功。
    	
    	查看HDFS目录结构:
    		hdfs dfs -ls -R /user/hive/warehouse/functiontest.db/emp_pt3
    	动态分区默认是以严格模式执行的,在这种模式下需要至少一个分区字段是静态的。这有助于阻止因设计错误导致导致查询差生大量的分区。列如:用户可能错误使用时间戳作为分区表字段。然后导致每秒都对应一个分区!
    	关闭严格分区模式
    		动态分区模式时是严格模式,也就是至少有一个静态分区。
    		set hive.exec.dynamic.partition.mode=nonstrict	 //分区模式,默认strict
    		set hive.exec.dynamic.partition=true			//开启动态分区,默认true
    		set hive.exec.max.dynamic.partitions=1000  // 默认1000
    
    
    1. 分桶
    建表:
    		create table emp_pt4(id int, name string, job string, mgr int, hiredate string, salary double, bonus double, deptid int) partitioned by (year string) clustered by (salary) sorted by (id) into 5 buckets row format delimited fields terminated by "	";
    	
    	重点:clustered by (salary) sorted by (id) into 5 buckets
    	
    	导入数据:
    		insert overwrite table emp_pt4 partition(year=1981) select * from emp where year(hiredate)=1981;
    	
    	查看HDFS目录结构,目录下产生多个文件,每个文件就是一个桶;
    		hdfs dfs -ls -R /user/hive/warehouse/functiontest.db/emp_pt4
    		 
    
    1. 抽样
    没有分桶的表也能抽样,但需全表扫描,效率不高。而如果通过分桶字段进行抽样,则避免了全表扫描。
    	
    	按桶抽样:
    		select * from emp tablesample(bucket 1 out of 5 on salary) s;			// 返回空,第一个桶本来就没数据
    		select * from emp tablesample(bucket 2 out of 5 on salary) s;			// 返回第二个桶的数据
    		hdfs dfs -cat /user/hive/warehouse/functiontest.db/emp_pt4/year=1981/000004_0		// 查看HDFS文件,跟上面select内容是一样的
    	
    	按百分比抽样,抽取一半数据:
    		select * from emp_pt4 tablesample(50 percent);
    		
    	按行抽样:
    		select * from emp_pt4 tablesample(2 rows);
    
  • 相关阅读:
    StratifiedKFold和KFold的区别(几种常见的交叉验证)
    剑指offer:用栈来建立队列
    剑指offer:斐波那契数列
    树状数组 gcd 查询 Different GCD Subarray Query
    Loadrunner的使用
    Loadrunner的使用
    MySQL Windows 环境安装
    RobotFrameWork 自动化环境搭建(基于 python3.6)
    MySQL Linux 环境安装
    【读书笔记】状态模式代码C#
  • 原文地址:https://www.cnblogs.com/BIG-BOSS-ZC/p/11807313.html
Copyright © 2020-2023  润新知