Hive系列博文,持续更新~~~
大数据系列之数据仓库Hive原理
大数据系列之数据仓库Hive安装
大数据系列之数据仓库Hive中分区Partition如何使用
大数据系列之数据仓库Hive命令使用及JDBC连接
本文介绍Hive的使用原理及命令行、Java JDBC对于Hive的使用。
在Hadoop项目中,HDFS解决了文件分布式存储的问题,MapReduce解决了数据处理分布式计算问题,之前介绍过Hadoop生态中MapReduce(以下统称MR)的使用,大数据系列之分布式计算批处理引擎MapReduce实践。HBase解决了一种数据的存储和检索。那么要对存在HDFS上的文件或HBase中的表进行查询时,是要手工写一堆MapReduce类的。一方面,很麻烦,另一方面只能由懂MapReduce的程序员类编写。对于业务人员或数据科学家,非常不方便,这些人习惯了通过sql与rdbms打交道,因此如果有sql方式查询文件和数据就很有必要,这就是hive要满足的要求。
比如说采用MR处理WordCount统计词频时,我们如果用hql语句进行处理如下:
select word,count(*) as totalNum from t_word group by word order by totalNum desc
关于Hive的典型应用场景:
1.日志分析
2.统计网站一个时间段的pv,uv;
3.多维度数据分析;
4.海量结构化数据离线分析;
5.低成本进行数据分析(无须编写MR).
介绍Hive中分区-Partition的意义
1.Hive的数据类型
1.1 基本数据类型:
1.2 hive的集合类型:
2.hive的命令练习:
连接hive:
beeline !connect jdbc:hive2://master:10000/dbmfz mfz 111111
2.1 基本数据类型命令使用
#用户表创建 create table if not EXISTS user_dimension( uid String, name string, gender string, birth date, province string ) row format delimited fields terminated by ','; describe user_dimension; show create table user_dimension; #品牌表创建 create table if not EXISTS brand_dimension( bid string, category string, brand string )row format delimited fields terminated by ','; #交易表创建 create table if not EXISTS record_dimension( rid string, uid string, bid string, price int, source_province string, target_province string, site string, express_number string, express_company string, trancation_date date )row format delimited fields terminated by ','; show tables; #创建数据 user.DATA brand.DATA record.DATA #载入数据 LOAD DATA LOCAL INPATH '/home/mfz/apache-hive-2.1.1-bin/hivedata/user.data' OVERWRITE INTO TABLE user_dimension; LOAD DATA LOCAL INPATH '/home/mfz/apache-hive-2.1.1-bin/hivedata/brand.data' OVERWRITE INTO TABLE brand_dimension; LOAD DATA LOCAL INPATH '/home/mfz/apache-hive-2.1.1-bin/hivedata/record.data' OVERWRITE INTO TABLE record_dimension; #验证 select * from user_dimension; select * from brand_dimension; select * from record_dimension; #载入HDFS上数据 load data inpath 'user.data_HDFS_PATH' OVERWRITE INTO TABLE user_dimension; #查询 select count(*) from record_dimension where trancation_date = '2017-09-01'; +-----+--+ | c0 | +-----+--+ | 6 | +-----+--+ #不同年龄消费的情况 select cast(datediff(CURRENT_DATE ,birth)/365 as int ) as age,sum(price) as totalPrice from record_dimension rd JOIN user_dimension ud on rd.uid = ud.uid group by cast(datediff(CURRENT_DATE ,birth)/365 as int) order by totalPrice DESC ; +------+-------------+--+ | age | totalprice | +------+-------------+--+ | 5 | 944 | | 25 | 877 | | 24 | 429 | | 28 | 120 | +------+-------------+--+ #不同品牌被消费的情况 select brand,sum(price) as totalPrice from record_dimension rd join brand_dimension bd on bd.bid = rd.bid group by bd.brand order by totalPrice desc; +------------+-------------+--+ | brand | totalprice | +------------+-------------+--+ | SAMSUNG | 944 | | OPPO | 625 | | WULIANGYE | 429 | | DELL | 252 | | NIKE | 120 | +------------+-------------+--+ #统计2017-09-01 当天各个品牌的交易笔数,按照倒序排序 select brand,count(*) as sumCount from record_dimension rd join brand_dimension bd on bd.bid=rd.bid where rd.trancation_date='2017-09-01' group by bd.brand order by sumCount desc +------------+-----------+--+ | brand | sumcount | +------------+-----------+--+ | SAMSUNG | 2 | | WULIANGYE | 1 | | OPPO | 1 | | NIKE | 1 | | DELL | 1 | +------------+-----------+--+ #不同性别消费的商品类别情况 select ud.gender as gender,bd.category shangping,sum(price) totalPrice,count(*) FROM record_dimension rd join user_dimension ud on rd.uid = ud.uid join brand_dimension bd on rd.bid = bd.bid group by ud.gender,bd.category; +---------+------------+-------------+-----+--+ | gender | shangping | totalprice | c3 | +---------+------------+-------------+-----+--+ | F | telephone | 944 | 2 | | M | computer | 252 | 1 | | M | food | 429 | 1 | | M | sport | 120 | 1 | | M | telephone | 625 | 1 | +---------+------------+-------------+-----+--+
2.3. 集合数据类型的命令操作
#data employees.txt create database practice2; show databases; use practice2; create table if not EXISTS employees( name string, salary string, subordinates array<String>, deductions map<String,Float>, address struct<street:string,city:string,state:string,zip:int> ) row format delimited fields terminated by '