hive: ================ 数据仓库,用于分析海量数据 底层使用hadoop hive一般不支持实务操作,行级更新,必要时候也能支持事务性 数据仓库和关系型数据库适用场景: 数仓:OLAP 关系型数据库:OLTP //事务 Array: select array[0] from t; Map: select map['height'] from t; Struct: select struct.DB from t; Union select union['height'][0] from t hive操作流程: ======================= 用户在界面输入HQL(hiveQL)语句,被编译器解释为MR作业,通过执行引擎传给hadoop hive1 //推荐使用MR作为执行引擎 hive2 //推荐使用spark作为执行引擎 hive的数据结构: ========================== 元数据:mysql 真实数据:HDFS 表:文件夹 数据库:文件夹 数据:HDFS文件,默认文本文件 hive:0001 ============================ create table t1(id int, name string) row format delimited fields terminated by ' ' lines terminated by ' ' collection items terminated by ',' map keys terminated by ':' stored as textfile; hive脚本分析: ================================ hive --version --service --rcfilecat --orcfiledump --llapdump --help --debug 二级命令使用帮助 hive --service serviceName --help hive服务:hive --serveice ==================================== beeline 二代用户接口 cli 一代用户接口 help hiveserver2 二代hive服务 version beeline和hiveserver2 =================================== hiveserver2 //hive的jdbc服务 hive --service hiveserver2 //启动hiveserver2 hiveserver2 //启动hiveserver2 s101:10002 //hiveserver2的webUI s101:10000 //hiveserver2的RPC地址 beeline:hive2新客户端 出现 User: centos is not allowed to impersonate anonymous (state=08S01,code=0)问题: =========================================================================================== 解决方案:在core-site.xml中添加如下配置 <property> <name>hadoop.proxyuser.centos.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.centos.groups</name> <value>*</value> </property> 添加之后,分发到其他机器,重启hadoop 问题2:配置文件后,创建异常· //anonymous用户没有写权限的问题 1、第一种方案,修改hive-site.xml配置文件,将其改为centos <property> <name>hive.server2.thrift.client.user</name> <value>centos</value> <description>Username to use against thrift client</description> </property> 2、第二种修改方案,在连接串中将用户名指定 //-u为url,-n为name ~]$ beeline -u jdbc:hive2://localhost:10000 -n centos 连接hive的JDBC服务流程: ======================================= 1、启动hiveserver2 hive --service hiveserver2 //启动hiveserver2 hiveserver2 //启动hiveserver2 2、启动客户端(IDEA的JDBC beeline) 代码: <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>2.1.1</version> </dependency> public static void main(String[] args) throws Exception { String driver = "org.apache.hive.jdbc.HiveDriver"; String url = "jdbc:hive2://192.168.23.101:10000/myhive"; //注册驱动 Class.forName(driver); //通过DriverManager获取连接 Connection conn = DriverManager.getConnection(url); //通过连接创建statement Statement st = conn.createStatement(); //通过st执行增删改查 //其中查询需要使用executeQuery返回结果集 ResultSet rs = st.executeQuery("select * from tt"); while (rs.next()){ int id = rs.getInt(1); String name = rs.getString(2); int age = rs.getInt(3); System.out.println(id + "/" + name+"/"+age); } } beeline: 1、在linux命令行输入beeline ~]$ beeline 2、在beeline命令行进行连接 beeline> !connect jdbc:hive2://localhost:10000 托管表(内部表)和外部表: ========================================== 1、默认创建托管表(内部表),删除表之后会删除数据 create table xxx 2、外部表,删除表之后不会删除数据,只删除元数据信息 //相当于只把表的信息从mysql中删除 create external table xxx hive的函数: ================================== desc function [extended] substr; //显示函数的使用[扩展]方法 select current_database(); //显示当前数据库 select current_date(); //显示当前日期 select current_timestamp(); //显示详细时间(戳) select explode(map or array) from xxx ; //将map或array炸开,单行转多行 select substr(str,index,len) //select('hello',2,3) ====> ell //select substr('hello',-4); ====> ello //select substr('hello',-4,3); ====> ell select split('hello world',' ') //返回array类型 1、普通函数 2、表生成函数 3、聚合函数 hive的explode函数实现wordcount: ============================================= 1、字段:line =====> array类型 ====> collection items terminated by '' create table t5(line ARRAY<string>) row format delimited fields terminated by ' ' collection items terminated by ' ' lines terminated by ' ' stored as textfile; 炸开:select explode(line) as word from t5; 进行分组:select word, count(1) from (select explode(line) as word from t5) t6 group by word; 分组+排序:select word, count(1) as c from (select explode(line) as word from t5) t6 group by word order by c desc; 2、字段:line =====> string类型 ====> split(line,' ') create table t6(line string) row format delimited fields terminated by ' ' lines terminated by ' ' stored as textfile; 截串操作:select split(line,' ') from t6; 炸开:select explode(split(line,' ')) as word from t6; 进行分组:select word, count(1) from (select explode(split(line,' ')) as word from t6) t7 group by word; 分组+排序:select word, count(1) as c from (select explode(split(line,' ')) as word from t6) t7 group by word order by c desc; !!!!!!!!!!!!!函数中不允许加select,嵌套函数直接写即可 hive的.hiverc文件: =================================== 是hive的环境变量,在用户的家目录(~),每次启动hive,hive会自动识别并加载此文件 编辑内容: set hive.cli.print.header=true 使用hive完成最高气温统计: =================================== ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ //ncdc气象局气象数据 截串,取得年份和气温 ===> 将气温进行类型转换 ====> 分组聚合(max) 1、建表 create table temp(line string) row format delimited fields terminated by ' ' lines terminated by ' ' stored as textfile; 2、加载数据 load data local inpath 'Temp/*' into table temp; 3、编写sql 0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999 cast(col as int) //cast(line as int) 将line列转换为int值 获取年份:year:substr(line,16,4) 获取气温:temp: cast(substr(line,88,5) as int) != 9999 获取气温和年份:select substr(line,16,4) as year, cast(substr(line,88,5) as int) as temp from temp; 过滤无效气温值:select substr(line,16,4) as year, cast(substr(line,88,5) as int) as temp from temp where cast(substr(line,88,5) as int) != 9999; select a.year, max(a.temp) from (select substr(line,16,4) as year, cast(substr(line,88,5) as int) as temp from temp where cast(substr(line,88,5) as int) != 9999) as a group by a.year; 表的相关操作: ========================================= like create table ep like employee; //创建表结构 as create table ep2 as select * from employee; //复制表 create table ep3 as select ep2.name, ep2.sex_age from ep2; //复制表,指定字段 truncate truncate table ep3 //截断(清空)表 column 添加列 alter table ep3 add columns(id int , age int); 删除列 alter table drop column(id) //删除列不成功 修改列 alter table ep3 change id no string ; //将id列变为no并指明数据格式string 替换列 alter table ep3 replace columns(name string, id int, age int) //全部替换列 分区表:目录 select ... where ; 1,tom,20,henan,kaifeng 创建分区表: create table users(id int, name string, age int) partitioned by (province string, city string) row format delimited fields terminated by ' ' lines terminated by ' ' stored as textfile; 在分区表中静态添加分区: alter table users add partition(province='henan', city='luoyang') ; alter table users add partition(province='beijing', city='beijing') partition(province='hebei', city='xiongan') partition(province='hebei', city='shijiazhuang') ; 删除分区:[IF EXISTS] 如果存在则删除,不存在不报错 ALTER TABLE users DROP [IF EXISTS] PARTITION (province='beijing', city='beijing') ; 查看分区表的分区: show partitions users; load数据到分区表: load data local inpath '' overwrite into table users partition(province='beijing', city='beijing'); insert: =========================================== 使用insert命令动态插入分区: 1、设置动态分区严格模式关闭: 原因:避免因为分区数过多引发性能问题,严格模式声明至少存在一个静态分区 关闭严格模式 set hive.exec.dynamic.partition.mode=nonstrict 未分区表:id,name,age.province,city 分区表: id,name,age partitions:province,city 2、动态插入数据,默认分区字段是指定的select语句中最后n个字段 insert into table users partition(province , city) select use0.age, user0.province, user0.city,user0.id,user0.name from user0; 使用insert命令进行数据的导出:注意,数据只能覆盖,不能拿掉overwrite insert overwrite local directory '/home/centos/users' row format delimited fields terminated by ' ' select * from users; 将数据插入到多个目录下 from users insert overwrite local directory '/home/centos/users' row format delimited fields terminated by ' ' select * from users; insert overwrite local directory '/home/centos/user2' row format delimited fields terminated by ' ' select * from users; 使用insert命令将一个表数据插入到另一个表: insert into t1 select t4.name from t4; order by:全排序 使用一个reduce 1、在hive中手动指定reduce个数: set mapreduce.job.reduces=2 2、使用group by操作 select province, count(*) from users group by province; 3、使用order by操作 select province, count(*) as c from users group by province order by c; sort by:部分排序、 1、在hive中手动指定reduce个数: set mapreduce.job.reduces=2 2、使用group by + sort by 操作 select province, count(*) as c from users group by province sort by c; tianjin 1 hebei 2 beijing 9 henan 2 distribute by:指定字段 对数据进行分区 select province, age from users distribute by province sort by age; select province, count(*) as c from users group by province distribute by province sort by c; distribute by 和group by的区别 1、group by 和聚合函数一起使用,分区判定未知 2、distribute by 可以不和聚合函数使用,分区判定为指定字段 cluster by :相当于 distribute by + sort by 指定分区的部分排序,但是分区字段和排序字段需要一致 select province, count(*) as c from users group by province cluster by province; Hive作业和MR流程的对应关系: ========================================== select ... from ... where ... having ... group by ... order by ... limit map input map reduce partition 第二个job 第二个job的cleanup limit:分页查询 ========================== 将输出数据指定输出数量 select * from users limit 10 //前十行 select * from users limit 10,10 //从第十个开始,往下十行 join: ============================== 1、内连接 select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a, orders b where a.id=b.uid; select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a inner join orders b on a.id=b.uid; 2、左外连接 select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a left outer join orders b on a.id=b.uid; 3、右外连接 select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a left outer join orders b on a.id=b.uid; 4、全外连接 mysql不支持 select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a full outer join orders b on a.id=b.uid; user0表: 100 yyy 100 beijing beijing 7 tt 7 tianjin tianjin 1 tom 20 hebei langfang 2 tomas 30 henan xinyang 3 tomson 45 hebei shijiazhuang 4 tomason 6 beijing beijing 5 tomasLee 7 henan kaifeng orders表: 1 order001 20 100 2 order002 30 5 3 order003 90 7 4 order004 30.6 6 5 order005 4.2 1 6 order006 5 2 手动指定大小表,reduce端join的实现