一、分区表及作用:
数据分区的概念以及存在很久了,通常使用分区来水平分散压力,将数据从物理上移到和使用最频繁的用户更近的地方,以及实现其目的。 hive中有分区表的概念,我们可以看到分区具重要性能优势,而且分区表还可以将数据以一种符合逻辑的方式进行组织,比如分层存储,分区表分别有静态分区和动态分区。
二、 静态分区
1,创建静态分区格式:
create table employees
(
name string,
salary float,
subordinated array<string>,
deductions map<string,float>,
address struct<street:string,city:string,state:string,zip:int>
)
partitioned by (country string,state string)
row format delimited
fields terminated by "\t"
collection items terminated by ","
map keys terminated by ":"
lines terminated by "\n"
stored as textfile;
select * from employees;
创建成果后发现他的存储路径和普通的内部表的路径是一样的而且多了分区表的字段,因为我们创建的分区表并没内容,事实上,除非需要优化查询性能,否则实现表的用户不需要关系"字段是否是分区字段"
2,添加分区表:
alter table employees add partition (country="china",state="Asia");
查看分区表信息: show partitions employees;
hdfs上的路径:/user/hive/warehouse/zxz.db/employees/country=china/state=Asia他们都是以目录及子目录形式存储的
3,插入数据
格式:
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row …];
格式2:(推荐使用)
load data local inpath '/home/had/data1.txt' into table employees partition (country =china,state=Asia)
4,利用分区表查询:(一般分区表都是利用where语句查询的)
select * from employees where salary>10000 and country='china';
5,AS语句和like建表
创建表,携带数据
create table employees1 as select * from employees1
创建表,携带表结构
create table employees2 like employees
6,外部分区表:
外部表同样可以使用分区,事实上,用户会发现,只是管理大型生产数据集最常见的情况,这种结合给用户提供一个和其他工具共享数据的方式,同时也可以优化查询性能
create external table employees_ex
(
name string,
salary float,
subordinated array<string>,
deductions map<string,float>,
address struct<street:string,city:string,state:string,zip:int>
)
partitioned by (country string,state string)
row format delimited
fields terminated by "\t"
collection items terminated by ","
map keys terminated by ":"
lines terminated by "\n"
stored as textfile;
location "/user/had/data/" //他其实和普通的静态分区表一样就是多了一个external关键字
这样我们就可以把数据路径改变而不影响数据的丢失,这是内部分区表远远不能做的事情:
1,(因为我们创建的是外部表)所有我们可以把表数据放到hdfs上的随便一个地方这里自动数据加载到/user/had/data/下(当然我们之前在外部表上指定了路径)
load data local inpath '/home/had/data.txt' into table employees_ex partition (country="china",state="Asia");
2,如果我们加载的数据要分离一些旧数据的时候就可以hadoop的distcp命令来copy数据到某个路径
hadoop distcp /user/had/data/country=china/state=Asia
/user/had/data_old/country=china/state=Asia
3,修改表,把移走的数据的路径在hive里修改
alter table employees partition(country="china",state="Asia") set location '/user/had/data_old/country=china/state=Asia'
4,使用hdfs的rm命令删除之前路径的数据
hdfs dfs -rmr /user/had/data/country=china/state=Asia
这样我们就完成一次数据迁移
如果觉得突然忘记了数据的位置使用使用下面的方式查看
describe extend employees_ex partition (country="china",state="Asia");
7,删除分区表
alter table employees drop partition(country="china",state="Asia");
8,众多修改语句
1,把一个分区打包成一个har包
alter table employees archive partition (country="china",state="Asia")
2, 把一个分区har包还原成原来的分区
alter table employees unarchive partition (country="china",state="Asia")
3, 保护分区防止被删除
alter table employees partition (country="china",state="Asia") enable no_drop
4,保护分区防止被查询
alter table employees partition (country="china",state="Asia") enable offline
5,允许分区删除和查询
alter table employees partition (country="china",state="Asia") disable no_drop
alter table employees partition (country="china",state="Asia") disable offline
9,通过查询语句向表中插入数据
insert overwrite/into table copy_employees partition (country="china",state="Asia") select * from employees es where es.country="china" and es.state ="Asia"
三、动态分区:
为什么要使用动态分区呢,我们举个例子,假如中国有50个省,每个省有50个市,每个市都有100个区,那我们都要使用静态分区要使用多久才能搞完。所有我们要使用动态分区。
动态分区默认是没有开启。开启后默认是以严格模式执行的,在这种模式下需要至少一个分区字段是静态的。这有助于阻止因设计错误导致导致查询差生大量的分区。列如:用户可能错误使用时间戳作为分区表字段。然后导致每秒都对应一个分区!这样我们也可以采用相应的措施:
关闭严格分区模式
动态分区模式时是严格模式,也就是至少有一个静态分区。
set hive.exec.dynamic.partition.mode=nonstrict //分区模式,默认nostrict
set hive.exec.dynamic.partition=true //开启动态分区,默认true
set hive.exec.max.dynamic.partitions=1000 //最大动态分区数,默认1000
1,创建一个普通动态分区表:
create table if not exists zxz_5(
name string,
nid int,
phone string,
ntime date
)
partitioned by (year int,month int)
row format delimited
fields terminated by "|"
lines terminated by "\n"
stored as textfile;
现在还看不出来有什么不一样
insert overwrite table zxz_5 partition (year,month) select name,nid,phone,ntime,year(ntime) as year ,month(ntime) as month from zxz_dy;
zxz_5 这个表里面存放着数据。
我们利用year,和month函数来获取ntime列的年和月来作为分区,这个是靠我们查询到数据来分区是不是很舒服
hive修改 表/分区语句
这里主要列一些常用操作。
添加分区
ALTER TABLE table_name ADD PARTITION (partCol = 'value1') location 'loc1'; //示例
ALTER TABLE table_name ADD IF NOT EXISTS PARTITION (dt='20130101') LOCATION '/user/hadoop/warehouse/table_name/dt=20130101'; //一次添加一个分区
ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808' PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809'; //一次添加多个分区
删除分区
ALTER TABLE login DROP IF EXISTS PARTITION (dt='2008-08-08');
ALTER TABLE page_view DROP IF EXISTS PARTITION (dt='2008-08-08', country='us');
修改分区
ALTER TABLE table_name PARTITION (dt='2008-08-08') SET LOCATION "new location";
ALTER TABLE table_name PARTITION (dt='2008-08-08') RENAME TO PARTITION (dt='20080808');
添加列
ALTER TABLE table_name ADD COLUMNS (col_name STRING); //在所有存在的列后面,但是在分区列之前添加一列
修改列
CREATE TABLE test_change (a int, b int, c int);
// will change column a's name to a1
ALTER TABLE test_change CHANGE a a1 INT;
// will change column a's name to a1, a's data type to string, and put it after column b. The new table's structure is: b int, a1 string, c int
ALTER TABLE test_change CHANGE a a1 STRING AFTER b;
// will change column b's name to b1, and put it as the first column. The new table's structure is: b1 int, a string, c int
ALTER TABLE test_change CHANGE b b1 INT FIRST;
修改表属性:
alter table table_name set TBLPROPERTIES ('EXTERNAL'='TRUE'); //内部表转外部表
alter table table_name set TBLPROPERTIES ('EXTERNAL'='FALSE'); //外部表转内部表
表的重命名
ALTER TABLE table_name RENAME TO new_table_name
例子:
use test_txd;
drop table test_partition;
create table if not exists test_partition(
name string,
nid int,
phone string,
ntime date
)
partitioned by (month int,day int)
row format delimited
fields terminated by "|"
lines terminated by "\n"
stored as textfile;
ALTER TABLE test_partition ADD PARTITION (month=202202,day=22) location 'loc1';
ALTER TABLE test_partition ADD PARTITION (month=202202,day=23) location 'loc1';
ALTER TABLE test_partition ADD PARTITION (month=202202,day=24) location 'loc1';
INSERT INTO TABLE test_partition PARTITION (month=202202,day=22) VALUES ('xm',1,'1517415415','2022-02-22 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=22) VALUES ('xm',2,'1517415415','2022-02-22 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=22) VALUES ('xm',3,'1517415415','2022-02-22 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=22) VALUES ('xm',4,'1517415415','2022-02-22 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=22) VALUES ('xm',5,'1517415415','2022-02-22 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=22) VALUES ('xm',6,'1517415415','2022-02-22 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=22) VALUES ('xm',7,'1517415415','2022-02-22 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=22) VALUES ('xm',8,'1517415415','2022-02-22 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=22) VALUES ('xm',9,'1517415415','2022-02-22 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=22) VALUES ('xm',10,'1517415415','2022-02-22 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=23) VALUES ('xm',11,'1517415415','2022-02-23 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=23) VALUES ('xm',12,'1517415415','2022-02-23 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=23) VALUES ('xm',13,'1517415415','2022-02-23 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=23) VALUES ('xm',14,'1517415415','2022-02-23 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=23) VALUES ('xm',15,'1517415415','2022-02-23 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=23) VALUES ('xm',16,'1517415415','2022-02-23 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=23) VALUES ('xm',17,'1517415415','2022-02-23 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=23) VALUES ('xm',18,'1517415415','2022-02-23 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=23) VALUES ('xm',19,'1517415415','2022-02-23 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=23) VALUES ('xm',20,'1517415415','2022-02-23 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=24) VALUES ('xm',21,'1517415415','2022-02-24 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=24) VALUES ('xm',22,'1517415415','2022-02-24 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=24) VALUES ('xm',23,'1517415415','2022-02-24 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=24) VALUES ('xm',24,'1517415415','2022-02-24 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=24) VALUES ('xm',25,'1517415415','2022-02-24 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=24) VALUES ('xm',26,'1517415415','2022-02-24 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=24) VALUES ('xm',27,'1517415415','2022-02-24 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=24) VALUES ('xm',28,'1517415415','2022-02-24 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=24) VALUES ('xm',29,'1517415415','2022-02-24 00:00:01');
INSERT INTO TABLE test_partition PARTITION (month=202202,day=24) VALUES ('xm',30,'1517415415','2022-02-24 00:00:01');
项目中实例:
1. Hive:
非分区:
CREATE TABLE `hive_test`(
`id` bigint COMMENT 'ID,主键',
`name` string COMMENT '姓名',
`address` string COMMENT '地址',
`mobile` int COMMENT '手机号',
`other_info` string COMMENT '其他信息',
`tenant_id` string COMMENT '租户ID',
`created_by` string COMMENT '创建者',
`created_time` string COMMENT '创建时间',
`updated_time` string COMMENT '更新时间',
`updated_by` string COMMENT '更新者',
`version` bigint COMMENT '版本号,乐观锁',
`is_deleted` bigint COMMENT '是否删除')
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://nameservice1/user/hive/warehouse/modify_test.db/hive_test'
TBLPROPERTIES (
'last_modified_by'='daas',
'last_modified_time'='1629881529',
'transient_lastDdlTime'='1629881529');
写数据:
insert into hive_test values(1,'new','cq',15824678542,'other',54614541,'txd','20211216','20211216','txd','v1','0');
insert into hive_test values(2,'new','cq',15824678542,'other',54614541,'txd','20211216','20211216','txd','v1','0');
insert into hive_test values(3,'new','cq',15824678542,'other',54614541,'txd','20211216','20211216','txd','v1','0');
insert into hive_test values(4,'new','cq',15824678542,'other',54614541,'txd','20211216','20211216','txd','v1','0');
insert into hive_test values(5,'new','cq',15824678542,'other',54614541,'txd','20211216','20211216','txd','v1','0');
分区:
USE default;
CREATE TABLE `hive_test_partition2`(
`req_date` STRING COMMENT '请求日期',
`req_time` STRING COMMENT '请求时间',
`user_id` BIGINT COMMENT '用户id',
`url` STRING COMMENT 'url',
`module` STRING COMMENT '模块名',
`operation` STRING COMMENT '操作名',
`task_id` BIGINT COMMENT '任务id'
)
COMMENT '清洗后的日志'
PARTITIONED BY(day STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE ;
修改表名:
ALTER TABLE table_name RENAME TO new_table_name;
修改字段:
ALTER TABLE hive_test_partition2 change `request_date` `req_date` int;
新增字段
ALTER table hive_test add columns(comment string);
修改分区
ALTER TABLE hive_test_partition2 PARTITION (day='2021-08-25') RENAME TO PARTITION (`day`='20210826');
分区表写数据:
use modify_test;
insert into hive_test_partition2 PARTITION(day='20210825') VALUES ('20210825','17:40',1,'dass.com','modify','check',1);
修改分区:
ALTER TABLE hive_test_partition2 PARTITION (day='20210825') RENAME TO PARTITION (day='2021-08-25');