《OD大数据实战》Hive入门实例

官方参考文档：https://cwiki.apache.org/confluence/display/Hive/LanguageManual

一、命令行和客户端

1. 命令窗口

1）进入命令窗口

hive

2）在hive cli命令窗口查看hdfs文件系统：

dfs -ls /

3）在hive cli命令窗口如何查看本地文件系统

 !ls /

2. hive交互方式

1） hive -e "select * from test.student;"

$HIVE_HOME/bin/hive -e 'select a.col from tab1 a'

2）Example of running a script non-interactively from local disk

hive -f <filename>

hive -f hivef.sql > result.txt

$HIVE_HOME/bin/hive -f /home/my/hive-script.sql

3）Example of running an initialization script before entering interactive mode

$HIVE_HOME/bin/hive -i /home/my/hive-init.sql

hive -i <filename>

二、文件格式

三、数据类型

四、DDL

1. database

1）通过命令行创建

create database hive_0805;

2）在脚本中创建

新建一个create_database.sh文件，运行命令：

sh create_database.sh

create_database.sh文件内容如下：

#!/bin/bash

HIVE_HOME=/opt/modules/cdh/hive-0.13.1-cdh5.3.6

DATABASE_NAME=hive_0805

$HIVE_HOME/bin/hive -e "drop database if exists $DATABASE_NAME;create database if not exists $DATABASE_NAME;" > /dev/null 2>&1

echo "create database $DATABASE_NAME success"

3）通过外部sql文件创建

新建一个create_database.sql文件，运行命令

hive -f create_database.sql

create_database.sql文件内容如下：

create database db_hive_01;
create database if not exists db_hive_02;
create database if not exists db_hive_03 location '/user/hive/warehouse/db_hive_03.db';

create table db_hive_03.user(ip int, name string);

show databases;
show databases like 'db_hive*';

desc database extended db_hive_03;

use db_hive_03;

drop database db_hive_03;

drop database db_hive_03 cascade;

drop database if exists db_hive_03;

2. table

1）通过命令行创建

create table docs(line string);

2）通过shell脚本创建

新建一个create_table.sh文件，运行命令sh create_table.sh

#!/bin/bash

HIVE_HOME=/opt/modules/cdh/hive-0.13.1-cdh5.3.6

$HIVE_HOME/bin/hive --hiveconf table_name=cust_info --hiveconf data_input_path=/opt/datas/hive/hiveshell/cust.info -f '/opt/datas/hive/hiveshell/create_table.sql';

create_table.sh文件内容如下：

use default;

create table if not exists "${hiveconf:table_name}"(
custNo bigint,
custName string,
gender char(1),
custBirth string,
salary double
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '	'
STORED AS TEXTFILE;

load data local inpath "${hiveconf:data_input_path}" overwrite into "${hiveconf:table_name}";

3）通过外部SQL文件创建

hive -f dept.sql

hive -f emp.sql

dept.sql文件内容如下：

create table IF NOT EXISTS default.dept
(
deptno int,
dname string,
loc string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '	';

emp.sql文件内容如下：

create table IF NOT EXISTS default.emp
(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '	';

4. Hive表结构操作命令

除了创建hive表的命令外，常用的hive表相关DDL命令有：describe(查看表结构信息)、drop(删除表)、truncate(重构表，也就是先删除再重新建立)、alter(修改表结构)等。常用命令的格式如下所示：

describe (extended|formatted) table_name;
drop table [if exists] table_name;
truncate table table_name;
alter table table_name rename to new_table_name;
alter table table_name add colums (new-cls type,....);
alter table table_name replace colums (new-cls type,....);

参考链接：https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli

五、DML

六、 Hive表介绍

Hive中的表可以分为内部表(托管表)和外部表，区别在于：

外部表的数据不是有hive进行管理的，也就是说当删除外部表的时候，外部表的数据不会从hdfs中删除。

而内部表是由hive进行管理的，在删除表的时候，数据也会删除。

一般情况下，我们在创建外部表的时候会将表数据的存储路径定义在hive的数据仓库路径之外。

Hive创建表主要有三种方式，

（1）第一种直接使用create table命令

（2）第二种使用create table .... AS select ....(会产生数据)

（3）第三种使用create table tablename like exist_tablename.命令

1. 案例一

分别使用默认命令创建内部表test_manager、外部表test_external以及指定location的内部表test_location，然后分别删除这两个表，查看变化情况：命令如下：

create table test_manager(id int);
create external table test_external(id int);
create table test_location(id int) location '/test_location';
drop table test_manager;
drop table test_external;
drop table test_location;

删除表的时候，内部表不管是否指定location，均会删除文件夹，外部表一定不会删除。

[ROW FORMAT row_format], row_format格式：
delimited fields terminated by '01' collection terminated by '02' map keys terminated by '03' lines terminated by '04' NULL DEFINED AS 'N'
[STORED AS file_format], file_format格式：
sequencefile
textfile(default)
rcfile
orc
parquet
avro

2. 案例二

先将customers.txt文件上传到hdfs的/customers文件夹中。

hdfs dfs -mkdir -p /user/hive/warehouse/customers

hdfs dfs -put customers.txt /user/hive/warehouse/customers

分别使用三种命令格式创建客户表customers，customers2，customers3，然后分别查看三张表(全部为内部表)的区别，命令如下：

CREATE TABLE customers(id int,name string,phone string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://beifeng-hadoop-02:9000/user/hive/warehouse/customers';
CREATE TABLE customers2 like customers;
CREATE TABLE customers3 AS SELECT * FROM customers;

3. 案例三

先将complex_table_test.txt.txt文件上传到hdfs的/complex_table_test文件夹中。

hdfs dfs -mkdir -p /user/hive/warehouse/complex_table_test
hdfs dfs -put complex_table_test.txt /user/hive/warehouse/complex_table_test

创建一个有复杂数据类型的hive表，并明确指定表数据的分隔符号，命令如下：

create table complex_table_test(id int, name string, flag boolean, score array<int>, tech map<string,string>, other struct<phone:string,email:string>) row format delimited fields terminated by ';' collection items terminated by ',' map keys terminated by ':' LOCATION 'hdfs://beifeng-hadoop-02:9000/user/hive/warehouse/complex_table_test';

4. 案例四

创建一个使用hbase的外部表，也就是说hive表数据为hbase的数据，创建命令格式如下：
CREATE EXTERNAL TABLE `hive_table_name`(key string, col1 type comment, col2 type comment,.....) -- 指定hive表的列名和表名
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping'=':key,family1:col1,f1:col2,...fn:coln') -- 指定格式化信息
TBLPROPERTIES ('hbase.table.name'='hbase_tablename'); -- 指定hbase表名
注意：必须指定hbase的rowkey。

create external table hive_users(key string,id int, name string, phone string) row format serde 'org.apache.hadoop.hive.hbase.HBaseSerDe' stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties('hbase.columns.mapping'=':key,f:id,f:name,f:phone') tblproperties('hbase.table.name'='users');

七、Hive导入数据

1. 导入数据

1）前置准备

create database hive_0901;
create table students(studentId int comment 'this is student id, is not null', classId int comment 'this is class id, can set to null', studentName string comment 'this is student name') row format delimited fields terminated by ',';
create table classes(classId int comment 'this is class id, is not null', className string comment 'this is class name') row format delimited fields terminated by ',';

2）分别导入local和hdfs的数据
a. 分别从linux机器上导入数据

load data local inpath '/home/hadoop/datas/13/classes.txt' into table classes;
load data local inpath '/home/hadoop/datas/13/students.txt' into table students;
load data local inpath '/home/hadoop/datas/13/classes.txt' overwrite into table classes;

b. 从hdfs上导入数据

load data inpath '/beifeng/13/students.txt' into table students;
dfs -put /home/hadoop/datas/13/students.txt /beifeng/13/
load data inpath '/beifeng/13/students.txt' overwrite into table students;

3）导入其他表的数据(多表插入)
将学生表的学生id和classid分别导出到不同表中，

create table test1(id int);
create table test2(id int);
from students insert into table test1 select studentid insert overwrite table test2 select distinct classid where classid is not null;

八、select语法介绍

1.from语法

1）正常from:

select * from students;

2）from语句提前:

from students select *;

2. cte语法

1）获取班级号为1的学生信息:

with tmp as (select studentid as sid,classid as cid,studentname as name from students where classid=1) from tmp select *;

2）获取总学生数、已经分配班级的学生数、未分配班级的学生数(作业1)。
分析；
总学生数：studentid的总数
分配班级的学生数：classid不为空的学生总数
未分配的学生数: classid为空的学生数
结果: 12 7 5

with tmp as (select '1' as id,'total',count(distinct studentid) as count from students), tmp2 as (select '1' as id,'depart',count(distinct studentid) as count from students where classid is not null), tmp3 as (select '1' as id,'non-depart', (tmp.count - tmp2.count) as count from tmp, tmp2) select tmp.count, tmp2.count ,tmp3.count from tmp join tmp2 on tmp.id=tmp2.id join tmp3 on tmp.id=tmp3.id;

3. where & group by语法实例

group语句只能返回对于的group列&进行聚合的value。
1）获取学生数大于3的班级id

from students select classid where classid is not null group by classid having count(studentid) > 3;

排序语法：

sort by 保证单reduce有序，order by保证全局有序。
1）使用order by根据学生id倒序。

select * from students order by studentid desc;

2）设置hive.mapred.mode为strict，然后在进行order by操作。

set hive.mapred.mode=strict;
select * from students order by studentid desc; -- 会出现异常
select * from students order by studentid desc limit 5;

3）使用sort by根据学生id排序。

select * from students sort by studentid desc;

4）设置mapreduce.job.reduces个数为两个，然后再使用sort by进行排序。

set mapreduce.job.reduces=2;
select * from students sort by studentid desc;

九、join语法

1. 内连接语法

1）获取学生和班级之间完全匹配的数据。

select students.*,classes.* from classes join students on classes.classid=students.classid;
select students.*,classes.* from classes cross join students on classes.classid=students.classid;

2. 外链接语法

1）获取全部学生的班级信息，如果该学生没有分配班级，那么班级信息显示为null。

select students.*, classes.* from students left join classes on students.classid = classes.classid;

2）获取全部班级的学生信息，如果某个班级没有学生，那么学生信息显示为null。(作业2)

select * from students right join classes on students.classid=classes.classid;

3）获取全部信息，如果没有匹配数据的显示null。(作业3)

select * from students full join classes on students.classid=classes.classid;

3. 半连接

1）获取学生表中班级id在班级表中的所有学生信息。

sql: select students.* from students where classid in (select distinct classid from classes);
原hql: select students.* from students join classes on students.classid = classes.classid;
新hql: select students.* from students left semi join classes on students.classid=classes.classid;

4. mapjoin

select /*+ mapjoin(classes) */ * from students join classes on students.classid=classes.classid;

十、子查询

1. 获取学生数最多的班级，学生的个数。
第一步：获取每个班级的学生总数
第二步：选择学生数最多的班级学生数

from (select count(studentid) as sc from students where classid is not null group by classid) as tmp select max(sc);

2. 获取学生数最多的班级信息。(作业4)
第一步：获取每个班级的学生总数
第二步：选择学生数最多的班级学生数
第三步：根据最多的学生数和第一步获取的表数据进行比较，获取班级信息。

with tmp1 as (select count(studentid) as student_count,classid from students group by classid having classid is not null), tmp2 as (select max(student_count) as student_count from tmp1) from classes join tmp1 on classes.classid=tmp1.classid join tmp2 on tmp1.student_count=tmp2.student_count select classes.*;

十一、导出数据

1. 导出表关联后的班级名称和学生名称(loca&hdfs)。（导出全部不为空的信息）
班级1，学生1

from (select classes.classname as col1, students.studentname as col2 from classes join students on classes.classid = students.classid) as tmp insert overwrite local directory '/home/hadoop/result/13/01' select col1,col2 insert overwrite directory '/beifeng/result/13/01/' select col1,col2 ;

格式化：

from (select classes.classname as col1, students.studentname as col2 from classes join students on classes.classid = students.classid) as tmp insert overwrite local directory '/home/hadoop/result/13/01' row format delimited fields terminated by ',' select col1,col2 ;

2. 同时分别将已经分配班级的学生和未分配班级的学生导出到不同的文件夹中。(作业5)

from students insert overwrite local directory  '/home/hadoop/result/13/02' select students.studentid,students.classid,students.studentname where students.studentid is not null insert overwrite local directory '/home/hadoop/result/13/03' select students.studentid,students.classid,students.studentname where students.studentid is null;

十二、其他命令

1. 在students和classes表上创建一个视图，视图包含两列分别是：班级名称，学生名称

create view viewname as select classes.classname as cname, students.studentname as sname from classes join students on classes.classid = students.classid

2. 在linux系统中通过命令hive -f/-e将所有学生信息保存到一个文件中。
新建一个文件，文件内容为:select * from students
执行：hive --database beifeng13 -f test.sql >> result.txt

十三、Hive函数

1. HQL内嵌函数

195个函数（包括操作符，使用命令show functions查看）。

2. Hive自定义函数

UDF(User-Defined Function)

UDAF(User-Defined Aggregate Function)

UDTF(User-Defined Table-Generating Function)

1）函数创建

（1）添加jar(0.13.*不支持hdfs上的jar添加，14版本才开始支持)

add jar linux_jar_path

（2）创建function，语法规则如下：

create [temporary][function][dbname.]function_name AS class_name;

（3）使用function，和使用其他函数一样。

2）删除函数

drop [temporary] function [if exists] [dbname.] function_name;

相关阅读:
selenium 难定位元素、时间插件
 列表生成式
 三元表达式
 监控日志
 非空即真
 深拷贝浅拷贝
 元组
 list字典嵌套
 2021
布尔类型
原文地址：https://www.cnblogs.com/yeahwell/p/5742158.html