初步使用hive - 润新知

初步使用hive
- 登陆 bin/hive
- create database hadoop1 //创建数据库
- show databases; //显示所有创建的数据库
- use hadoop1 //使用之前建立的数据库
- create table stu_info(.//创建表
　　　　id int,

　　　　name string

　　)

　　row format delimited fields terminated by ' '； //显示表的每一列的分隔符
- load data local inpath '/opt/datas/students.txt' into table stu_info;//这里的local的含义是加载本地文件到hive中，
- select * from stu_info; //查看数据
- hive在hdfs的目录结构：
user/hive/warehouse/ dousil.db /stu_info

仓库地址　　　　数据库表名
- hive 和mysql是不同的
1. hive属于读时模式，你在想表中插入数据的时不会做校验，在读数据的时候会校验。
2. mysql oracle数据库属于写实模式在想表里插入数据得时候就会做数据的校验，一旦数据模式不匹配，我们就出错
3. bin/hive --datase 'dousil' 会用某个数据库
4. bin/hive -e 'show databses;' 他可以在linux命令下使用hive的sql语句。当我们用i定义化的脚本执行的改语句很管用。
5. bin/hive -e "show database;>/tmp/dousil" 重定向到某个文件的下边
6. bin/hive -f /opt.hql //注意，每条sql语句结束之后一定要加分号。
7. --hiveconif可以修改hive临时启动参数
- Hive常用的扩展配置
  ---------------------------
  1、日志输出目录
  -》默认位置：/tmp/coder/hive.log
  -》在 conf/hive-log4j.properties 文件里配置，内容如下：
  hive.log.dir=/opt/modules/hive-1.2.1/logs
  2、显示正在使用的数据库名称
  -》在 conf/hive-site.xml 文件里配置，添加如下内容：
  
  
  <property>
  <name>hive.cli.print.current.db</name>
  <value>true</value>
  </property>
  3、显示表的列名
  -》在 conf/hive-site.xml 文件里配置，添加如下内容：
  
  
  <property>
  <name>hive.cli.print.header</name>
  <value>true</value>
  </property>
  
  Hive常用的Linux Shell参数
  -----------------------------------
  1、查看支持哪些linux参数：bin/hive -help
  2、--database 当我们连接hive时可以指定使用哪一个数据库
  $ bin/hive --database hadoop31
  3、-e 在linux命令行执行Hive的SQL语句
  -》示例1：$ bin/hive -e 'show databases;'
  -》示例2：$ bin/hive -e 'show databases;use hadoop31;select * from new_stu_info;'
  -》示例3：$ bin/hive -e 'select * from hadoop31.new_stu_info;' > /tmp/hivefile.txt
  4、-f 执行文件里的SQL语句
  -》示例：$ bin/hive -f /opt/datas/file.hql
  -》/opt/datas/file.hql文件内容如下：
  show databases;
  use hadoop31;
  select * from new_stu_info;
  
  #注意事项：每条SQL语句结束之后一定要加分号(;)
  5、--hiveconf 启动hive时，可以临时修改hive的配置参数
  -》示例：$ bin/hive --hiveconf hive.cli.print.current.db=false
  -》临时修改hive的配置参数，还有一个更为常用的方法
  - set hive.cli.print.current.db=false; #临时修改属性的值
  - set hive.cli.print.current.db; #查看属性的值
  6、--hivevar 传递参数
  -》示例：$ bin/hive --hivevar v_name='zhangsan' -f /opt/datas/file.hql
  -》/opt/datas/file.hql 文件内容如下：
  
  select * from hadoop31.stu_info where name='${hivevar:v_name}';
  
  Hive常用的三种建表方式
  -------------------------------
  1、普通建表方式
  
  create table stu_info(
  id int,
  name string
  )
  row format delimited fields terminated by ' ';
  
  2、子查询方式建表（克隆一张的字段结构并且复制这张表的数据赋予新表）
  
  create table stu_info_sel as select * from stu_info;
  create table stu_info_sel2 as select * from stu_info where id < 3;
  create table stu_info_sel3 as select name from stu_info;
  
  3、like方式建表（克隆一张表的所有字段结构，但是不复制表的数据）
  
  create table stu_info_like like stu_info;
  
  4、创建员工表和部门表
  
  # 建库
  create database company;
  
  # 创建员工表
  create table emp(
  empno int,
  ename string,
  job string,
  mgr int,
  hiredate string,
  sal double,
  comm double,
  deptno int
  )
  row format delimited fields terminated by ' ';
  
  # 向emp表加载数据
  load data local inpath '/opt/datas/emp.txt' into table emp;
  
  # 创建部门表
  create table dept(
  deptno int,
  dname string,
  loc string
  )
  row format delimited fields terminated by ' ';
  
  # 向dept表加载数据
  load data local inpath '/opt/datas/dept.txt' into table dept;
  
  5、hive是允许向表中重复加载同一份数据的
  
  Hive外部表
  -------------------------
  1、hive常见的表的类型
  -》托管表（MANAGED_TABLE）：在删除（drop）表的时候，既删除表的元数据信息，同时也删除表的目录（数据）
  -》外部表（EXTERNAL_TABLE）：在删除表的时候，仅删除表的元数据信息，不删除表的目录（数据）
  2、如何查看表的类型
  hive> desc formatted emp;
  
  # 打印的表的信息（截取一部分，其中Table Type就是表的类型信息）
  # Detailed Table Information
  Database: company
  Owner: coder
  CreateTime: Sun Oct 13 11:02:31 CST 2019
  LastAccessTime: UNKNOWN
  Protect Mode: None
  Retention: 0
  Location: hdfs://bd-server1:8020/user/hive/warehouse/company.db/emp
  Table Type: MANAGED_TABLE
  Table Parameters:
  COLUMN_STATS_ACCURATE true
  numFiles 1
  numRows 0
  rawDataSize 0
  totalSize 656
  transient_lastDdlTime 1570936384
  3、创建外部表
  
  # 创建员工表的外部表（和员工表共用同一份数据）
  create external table emp_external(
  empno int,
  ename string,
  job string,
  mgr int,
  hiredate string,
  sal double,
  comm double,
  deptno int
  )
  row format delimited fields terminated by ' '
  location '/user/hive/warehouse/company.db/emp';
  
  # hive在建表的时候，是允许我们重新指定表的目录位置的，如果不指定，那么就是默认的位置
  # 创建外部表时，通常会使用location重新指定表的目录
  
  Hive分区表
  ----------------------------
  1、应用场景
  -》通常都是应对日志文件分析，日志文件都是用日期来命令的
  
  # 普通表的目录结构
  /usr/hive/warehouse/company.db/web_logs/2019100100
  ...
  2019100106
  ...
  2019100123
  2019100200
  ....
  2019100223
  ....
  
  # 需求：统计2019年10月1日6点钟的日志数据有多少条
  select count(1) from web_logs where date='2019100106';
  
  # 注意：上面这条SQL语句的效率极其低
  
  # 针对上面的问题，创建分区表可以轻松的解决，因为分区表可以在表的目录下继续创建二级目录（分区目录）
  
  # 分区表的目录结构（假如创建2个分区：day和hour）
  
  表目录一级分区目录二级分区目录日志数据
  /opt/datas/web_logs /20191001 /00 /2019100100
  /01 /2019100101
  /02 /2019100102
  ...
  /06 /2019100106
  ...
  /23 /2019100123
  # 分区表可以提高where根据日期过滤查询的效率
  2、演示分区表用法
  
  # 创建分区表
  create table web_logs(
  id string,
  url string,
  referer string,
  keyword string,
  type string,
  guid string,
  pageId string,
  moduleId string,
  linkId string,
  attachedInfo string,
  sessionId string,
  trackerU string,
  trackerType string,
  ip string,
  trackerSrc string,
  cookie string,
  orderCode string,
  trackTime string,
  endUserId string,
  firstLink string,
  sessionViewNo string,
  productId string,
  curMerchantId string,
  provinceId string,
  cityId string,
  fee string,
  edmActivity string,
  edmEmail string,
  edmJobId string,
  ieVersion string,
  platform string,
  internalKeyword string,
  resultSum string,
  currentPage string,
  linkPosition string,
  buttonPosition string
  )
  partitioned by(day string, hour string)
  row format delimited fields terminated by ' ';
  
  # 向分区表加载数据
  load data local inpath '/opt/datas/2015082818' into table web_logs partition(day='20150828', hour='18');
  load data local inpath '/opt/datas/2015082819' into table web_logs partition(day='20150828', hour='19');
  
  3、关于分区表的注意事项
  -》分区列不能和表的列重名
  -》分区列并不属于表的列
  -》分区列在查询语句使用的时候，就可以当做表的列去使用
  
  4、【扩展】外部分区表
  
  # 建表
  create external table stu_info(
  id int,
  name string
  )
  partitioned by(day string)
  row format delimited fields terminated by ' ';
  
  # 加载数据（外部分区表比较特殊）
  alter table stu_info add partition(day='20191001') location '/user/hive/warehouse/company.db/web_logs/day=20150828/hour=18';
  
  练习（分析函数）
  ---------------------
  1、练习所使用的表是员工表（emp）
  emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
  7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
  7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
  7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
  7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
  7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
  7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
  7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
  7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
  7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
  7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
  7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
  7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
  7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
  7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
  
  # empno 员工编号
  # ename 员工姓名
  # sal 薪资
  # comm 奖金
  # deptno 部门编号
  
  2、练习题1：查询部门编号为10的员工的信息（员工编号，部门编号，姓名，薪资），按照薪资降序排列
  
  SQL：select empno,deptno,ename,sal from emp where deptno='10' order by sal desc;
  
  结果： empno deptno ename sal
  7839 10 KING 5000.0
  7782 10 CLARK 2450.0
  7934 10 MILLER 1300.0
  
  3、练习题2：查询所有部门员工信息，按照部门薪资降序排列，在最后一列显示每个部门的最高薪资
  
  SQL：select empno,ename,deptno,sal,max(sal) over(partition by deptno order by sal desc) as max_sal from emp;
  
  结果： empno ename deptno sal max_sal
  7839 KING 10 5000.0 5000.0
  7782 CLARK 10 2450.0 5000.0
  7934 MILLER 10 1300.0 5000.0
  
  7788 SCOTT 20 3000.0 3000.0
  7902 FORD 20 3000.0 3000.0
  7566 JONES 20 2975.0 3000.0
  7876 ADAMS 20 1100.0 3000.0
  7369 SMITH 20 800.0 3000.0
  
  7698 BLAKE 30 2850.0 2850.0
  7499 ALLEN 30 1600.0 2850.0
  7844 TURNER 30 1500.0 2850.0
  7654 MARTIN 30 1250.0 2850.0
  7521 WARD 30 1250.0 2850.0
  7900 JAMES 30 950.0 2850.0
  
  4、练习题2：查询所有部门员工信息，按照部门薪资降序排列，并且在最后一列显示排名编号
  
  SQL：select empno,ename,deptno,sal,row_number() over(partition by deptno order by sal desc) as rank from emp;
  
  结果： empno ename deptno sal rank
  7839 KING 10 5000.0 1
  7782 CLARK 10 2450.0 2
  7934 MILLER 10 1300.0 3
  
  7788 SCOTT 20 3000.0 1
  7902 FORD 20 3000.0 2
  7566 JONES 20 2975.0 3
  7876 ADAMS 20 1100.0 4
  7369 SMITH 20 800.0 5
  
  7698 BLAKE 30 2850.0 1
  7499 ALLEN 30 1600.0 2
  7844 TURNER 30 1500.0 3
  7654 MARTIN 30 1250.0 4
  7521 WARD 30 1250.0 5
  7900 JAMES 30 950.0 6
  
  5、练习题3：查询所有部门员工信息，按照部门薪资降序排列，并且筛选每个部门排名前2名员工信息
  
  SQL：select empno,ename,deptno,sal from (select empno,ename,deptno,sal,row_number() over(partition by deptno order by sal desc) as rank from emp) temp where rank <= 2;
  
  结果： empno ename deptno sal
  7839 KING 10 5000.0
  7782 CLARK 10 2450.0
  
  7788 SCOTT 20 3000.0
  7902 FORD 20 3000.0
  
  7698 BLAKE 30 2850.0
  7499 ALLEN 30 1600.0
  
  向Hive表导入数据的多种方式
  --------------------------------
  1、本地加载（数据是复制的过程）
  load data local inpath 'directory path' into table table_name;
  2、从HDFS上向Hive表加载数据（注意：数据是移动到表的目录下，不是复制）
  load data inpath 'hdfs path' into table table_name;
  3、直接使用hdfs的-put命令
  bin/hdfs dfs -put /opt/datas/students.txt /user/hive/warehouse/hadoop31.db/stu_info
  4、overwrite（覆盖加载）
  load data local inpath '/opt/datas/students.txt' overwrite into table stu_info;
  5、子查询
  6、insert（比较常用，和mysql的有区别）
  
  # 建表
  create table stu_info2 like stu_info;
  
  # insert加载数据
  insert into table stu_info2 select * from stu_info where id < 4;
  
  7、使用协作框架上传数据：sqoop、flume
  
  从Hive表导出数据的多种方式
  ------------------------------
  1、insert overwrite
  -》命令格式：insert overwrite [local] directory 'path' select ...
  -》导出到本地
  insert overwrite local directory '/tmp/stu_info' select * from stu_info where name='zhangsan';
  -》思考：输出目录可以提前存在吗？可以的，只不过会被覆盖
  -》指定导出数据的分隔符
  insert overwrite local directory '/tmp/stu_info' row format delimited fields terminated by ',' select * from stu_info;
  
  查询结果：
  [coder@bd-server1 stu_info]$ cat 000000_0
  1,zhangsan
  2,lisi
  3,wangwu
  4,zhaoliu
  5,dongqi
  2、通过hive -e将查询的结果输出重定向指定到本地文件下
  bin/hive -e 'select * from hadoop31.stu_info' > /tmp/stu_info.txt
  3、通过hdfs的-get命令（几乎不会使用）
  4、使用sqoop工具
  
  Hive常用的SQL语句
  -------------------------------
  1、过滤
  -》where
  -》limit
  -》distinct
  -》between and
  -》is null 或 is not null
  -》having
  # 求平均薪资大于2000的部门
  select deptno,avg(sal) as avg_sal from emp group by deptno having avg_sal > 2000;
  
  # 注意：这里只能使用having，不能使用where。having是针对分组之后的数据做过滤查询，
  而where针对分组之前的数据做过滤查询
  2、聚合函数
  -》count
  -》sum
  -》avg
  -》max
  -》min
  3、JOIN
  -》内关联（交集）
  select a.id,a.name,b.city from tab_a a join tab_b b on a.id=b.id;
  -》左关联
  select a.id,a.name,b.city from tab_a a left join tab_b b on a.id=b.id;
  -》右关联
  select a.id,a.name,b.city from tab_a a right join tab_b b on a.id=b.id;
  -》全关联（并集）
  select a.id,a.name,b.city from tab_a a full join tab_b b on a.id=b.id;
  
  tab_a tab_b
  id name id city
  1 zhangsan 1 shanghai
  2 lisi 2 beijing
  3 wangwu 4 shenzhen
  
  Hive排序函数讲解
  ------------------------------
  1、hive与mapreduce相关的参数
  -》设置每个reduce task所能处理的数据量大小
  set hive.exec.reducers.bytes.per.reducer=<number>
  -》最多能运行的reduce task的数量是多少（默认值1009）
  set hive.exec.reducers.max=<number>
  -》设置当前运行的reduce task的数量（默认值1个）
  set mapreduce.job.reduces=<number>
  2、order by（全局排序）
  3、sort by（局部排序）
  -》设置mapreduce使用3个reduce task
  set mapreduce.job.reduces=3;
  -》查询结果导出到本地文件
  insert overwrite local directory '/opt/datas/hive_export/sortby' row format delimited fields terminated by ' ' select * from emp sort by sal;
  
  [coder@bd-server1 sortby]$ cat 000000_0
  7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
  7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
  7782 CLARK MANAGER 7839 1981-6-9 2450.0 N 10
  7698 BLAKE MANAGER 7839 1981-5-1 2850.0 N 30
  7788 SCOTT ANALYST 7566 1987-4-19 3000.0 N 20
  7839 KING PRESIDENT N 1981-11-17 5000.0 N 10
  
  [coder@bd-server1 sortby]$ cat 000001_0
  7900 JAMES CLERK 7698 1981-12-3 950.0 N 30
  7876 ADAMS CLERK 7788 1987-5-23 1100.0 N 20
  7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
  7934 MILLER CLERK 7782 1982-1-23 1300.0 N 10
  7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
  7566 JONES MANAGER 7839 1981-4-2 2975.0 N 20
  
  [coder@bd-server1 sortby]$ cat 000002_0
  7369 SMITH CLERK 7902 1980-12-17 800.0 N 20
  7902 FORD ANALYST 7566 1981-12-3 3000.0 N 20
  4、验证设置3个reduce task，并且使用order by函数排序
  # SQL语句：
  insert overwrite local directory '/opt/datas/hive_export/orderby' row format delimited fields terminated by ' ' select * from emp order by sal;
  
  [coder@bd-server1 orderby]$ cat 000000_0
  7369 SMITH CLERK 7902 1980-12-17 800.0 N 20
  7900 JAMES CLERK 7698 1981-12-3 950.0 N 30
  7876 ADAMS CLERK 7788 1987-5-23 1100.0 N 20
  7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
  7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
  7934 MILLER CLERK 7782 1982-1-23 1300.0 N 10
  7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
  7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
  7782 CLARK MANAGER 7839 1981-6-9 2450.0 N 10
  7698 BLAKE MANAGER 7839 1981-5-1 2850.0 N 30
  7566 JONES MANAGER 7839 1981-4-2 2975.0 N 20
  7788 SCOTT ANALYST 7566 1987-4-19 3000.0 N 20
  7902 FORD ANALYST 7566 1981-12-3 3000.0 N 20
  7839 KING PRESIDENT N 1981-11-17 5000.0 N 10
  5、总结：如果想对reduce多个分区文件进行排序的话，必须使用sort by函数；或者设置多个reduce task的时候，排序只能
  使用sort by函数；如果reduce task数量是1个的话，那么使用order by和sort by没有任何区别
  6、distribute by（指定分区字段）
  
  SQL语句：
  insert overwrite local directory '/opt/datas/hive_export/distribute' row format delimited fields terminated by ' ' select * from emp distribute by deptno sort by sal;
  
  [coder@bd-server1 distribute]$ cat 000000_0
  7900 JAMES CLERK 7698 1981-12-3 950.0 N 30
  7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
  7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
  7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
  7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
  7698 BLAKE MANAGER 7839 1981-5-1 2850.0 N 30
  
  [coder@bd-server1 distribute]$ cat 000001_0
  7934 MILLER CLERK 7782 1982-1-23 1300.0 N 10
  7782 CLARK MANAGER 7839 1981-6-9 2450.0 N 10
  7839 KING PRESIDENT N 1981-11-17 5000.0 N 10
  
  [coder@bd-server1 distribute]$ cat 000002_0
  7369 SMITH CLERK 7902 1980-12-17 800.0 N 20
  7876 ADAMS CLERK 7788 1987-5-23 1100.0 N 20
  7566 JONES MANAGER 7839 1981-4-2 2975.0 N 20
  7788 SCOTT ANALYST 7566 1987-4-19 3000.0 N 20
  7902 FORD ANALYST 7566 1981-12-3 3000.0 N 20
  7、cluster by（同时指定分区和排序字段，注意：只有分区字段和排序字段是同一个字段时才会使用）
  
  SQL语句：
  insert overwrite local directory '/opt/datas/hive_export/cluster' row format delimited fields terminated by ' ' select * from emp cluster by sal;
  
  [coder@bd-server1 cluster]$ cat 000000_0
  7369 SMITH CLERK 7902 1980-12-17 800.0 N 20
  7900 JAMES CLERK 7698 1981-12-3 950.0 N 30
  7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
  
  [coder@bd-server1 cluster]$ cat 000001_0
  7934 MILLER CLERK 7782 1982-1-23 1300.0 N 10
  7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
  7698 BLAKE MANAGER 7839 1981-5-1 2850.0 N 30
  7788 SCOTT ANALYST 7566 1987-4-19 3000.0 N 20
  7902 FORD ANALYST 7566 1981-12-3 3000.0 N 20
  7839 KING PRESIDENT N 1981-11-17 5000.0 N 10
  
  [coder@bd-server1 cluster]$ cat 000002_0
  7876 ADAMS CLERK 7788 1987-5-23 1100.0 N 20
  7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
  7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
  7782 CLARK MANAGER 7839 1981-6-9 2450.0 N 10
  7566 JONES MANAGER 7839 1981-4-2 2975.0 N 20
相关阅读:
树莓派配置Jdk环境并设置开机启动jar
缓存架构之路（一）缓存概述及应用
 并发编程学习历程（零）JMM内存模型
 并发编程学习历程（一）Synchronized
深入Disruptor源码分析（一）入门指南
 JAVA Synchronized和Lock实现原理
 Vscode断点调试PHP
php环境搭建
 github使用总结
 面试笔记
原文地址：https://www.cnblogs.com/dousil/p/12199531.html