• hive常用函数-建表-jdbc


    hive:
    ================
        数据仓库,用于分析海量数据
        底层使用hadoop
        hive一般不支持实务操作,行级更新,必要时候也能支持事务性
    
    数据仓库和关系型数据库适用场景:
        数仓:OLAP
        关系型数据库:OLTP    //事务
    
    
    Array:    select array[0] from t;
    Map:    select map['height'] from t;
    Struct: select struct.DB from t;
    Union    select union['height'][0] from t
    
    
    hive操作流程:
    =======================
        用户在界面输入HQL(hiveQL)语句,被编译器解释为MR作业,通过执行引擎传给hadoop
    
    
    hive1    //推荐使用MR作为执行引擎
    hive2    //推荐使用spark作为执行引擎
    
    
    hive的数据结构:
    ==========================
        元数据:mysql
        真实数据:HDFS
    
        表:文件夹
        数据库:文件夹
        数据:HDFS文件,默认文本文件
    
    
    hive:0001
    ============================
        create table t1(id int, name string) row format delimited
        fields terminated by '	'
        lines terminated by '
    '
        collection items terminated by ','
        map keys terminated by ':' 
        stored as textfile;
    
    
    hive脚本分析:
    ================================
    hive
        --version
         
        --service
         
        --rcfilecat
         
        --orcfiledump
        
        --llapdump
        
        --help
        
        --debug
    
    
        二级命令使用帮助
        hive --service serviceName --help
    
    
    hive服务:hive --serveice 
    ====================================
        beeline        二代用户接口
        
        cli        一代用户接口
        
        help        
        
        hiveserver2    二代hive服务
        
        version
    
    
    beeline和hiveserver2    
    ===================================
        hiveserver2    //hive的jdbc服务
    
        hive --service hiveserver2    //启动hiveserver2
        hiveserver2            //启动hiveserver2
            
         
        s101:10002    //hiveserver2的webUI
        s101:10000    //hiveserver2的RPC地址
    
    
        beeline:hive2新客户端
    
    出现 User: centos is not allowed to impersonate anonymous (state=08S01,code=0)问题:
    ===========================================================================================
        解决方案:在core-site.xml中添加如下配置
        <property>     
            <name>hadoop.proxyuser.centos.hosts</name>     
            <value>*</value> 
        </property> 
        <property>     
            <name>hadoop.proxyuser.centos.groups</name>     
            <value>*</value> 
        </property>
    
        添加之后,分发到其他机器,重启hadoop
    
    
        问题2:配置文件后,创建异常·    //anonymous用户没有写权限的问题
            1、第一种方案,修改hive-site.xml配置文件,将其改为centos
              <property>
                <name>hive.server2.thrift.client.user</name>
                <value>centos</value>
                <description>Username to use against thrift client</description>
              </property>
    
            2、第二种修改方案,在连接串中将用户名指定    //-u为url,-n为name
               ~]$  beeline -u jdbc:hive2://localhost:10000 -n centos
    
            
    
    
    
    连接hive的JDBC服务流程:
    =======================================
        1、启动hiveserver2
            hive --service hiveserver2    //启动hiveserver2
            hiveserver2            //启动hiveserver2
        
        2、启动客户端(IDEA的JDBC  beeline)
            
            代码:
            <dependency>
                <groupId>org.apache.hive</groupId>
                <artifactId>hive-jdbc</artifactId>
                <version>2.1.1</version>
            </dependency>
    
            public static void main(String[] args) throws Exception {
                String driver = "org.apache.hive.jdbc.HiveDriver";
                String url = "jdbc:hive2://192.168.23.101:10000/myhive";
                
                //注册驱动
                Class.forName(driver);
                //通过DriverManager获取连接
                Connection conn = DriverManager.getConnection(url);
                //通过连接创建statement
                Statement st = conn.createStatement();
                //通过st执行增删改查
                //其中查询需要使用executeQuery返回结果集
                ResultSet rs = st.executeQuery("select * from tt");
                while (rs.next()){
                    int id = rs.getInt(1);
                    String name = rs.getString(2);
                    int age = rs.getInt(3);
                    System.out.println(id + "/" + name+"/"+age);
                }
            }
    
            beeline:
                1、在linux命令行输入beeline
                    ~]$ beeline
                2、在beeline命令行进行连接
                    beeline> !connect jdbc:hive2://localhost:10000
                
            
    托管表(内部表)和外部表:
    ==========================================
        1、默认创建托管表(内部表),删除表之后会删除数据
            create table xxx
    
        2、外部表,删除表之后不会删除数据,只删除元数据信息    //相当于只把表的信息从mysql中删除
            create external table xxx
    
    
    
    hive的函数:
    ==================================
         desc function [extended] substr; //显示函数的使用[扩展]方法
        
         select current_database();    //显示当前数据库
         select current_date();        //显示当前日期
         select current_timestamp();    //显示详细时间(戳)
    
         select explode(map or array) from xxx    ; //将map或array炸开,单行转多行
    
         select substr(str,index,len)    //select('hello',2,3) ====> ell
                        //select substr('hello',-4); ====> ello
                        //select substr('hello',-4,3); ====> ell
    
         select split('hello world',' ') //返回array类型
    
         1、普通函数
         2、表生成函数
         3、聚合函数
    
    
    hive的explode函数实现wordcount:
    =============================================
    
        1、字段:line =====> array类型 ====> collection items terminated by ''
    
        create table t5(line ARRAY<string>) row format delimited
        fields terminated by '	'
        collection items terminated by ' '
        lines terminated by '
    '
        stored as textfile;
    
    
        炸开:select explode(line) as word from t5;
        进行分组:select word, count(1) from (select explode(line) as word from t5) t6 group by word;
        分组+排序:select word, count(1) as c from (select explode(line) as word from t5) t6 group by word order by c desc;
    
    
        2、字段:line =====> string类型 ====> split(line,' ')
        create table t6(line string) row format delimited
        fields terminated by '	'
        lines terminated by '
    '
        stored as textfile;
    
        截串操作:select split(line,' ') from t6;
        炸开:select explode(split(line,' ')) as word from t6;                                                                      
        进行分组:select word, count(1) from (select explode(split(line,' ')) as word from t6) t7 group by word;                
        分组+排序:select word, count(1) as c from (select explode(split(line,' ')) as word from t6) t7 group by word order by c desc;
    
    
        !!!!!!!!!!!!!函数中不允许加select,嵌套函数直接写即可
    
    hive的.hiverc文件:
    ===================================
        是hive的环境变量,在用户的家目录(~),每次启动hive,hive会自动识别并加载此文件
        编辑内容:
            set hive.cli.print.header=true
    
        
    
    使用hive完成最高气温统计:
    ===================================
        ftp://ftp.ncdc.noaa.gov/pub/data/noaa/    //ncdc气象局气象数据
    
        
        截串,取得年份和气温 ===> 将气温进行类型转换 ====> 分组聚合(max) 
        
        1、建表
            create table temp(line string) row format delimited
            fields terminated by '	'                        
            lines terminated by '
    '                         
            stored as textfile;                              
    
        2、加载数据
            load data local inpath 'Temp/*' into table temp;
    
        3、编写sql
            0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999
        
            cast(col as int)        //cast(line as int) 将line列转换为int值 
    
            获取年份:year:substr(line,16,4)
            获取气温:temp: cast(substr(line,88,5) as int)   != 9999
            获取气温和年份:select substr(line,16,4) as year, cast(substr(line,88,5) as int) as temp from temp;
            过滤无效气温值:select substr(line,16,4) as year, cast(substr(line,88,5) as int) as temp from temp where cast(substr(line,88,5) as int) != 9999;
    
            select a.year, max(a.temp) from 
            (select substr(line,16,4) as year, cast(substr(line,88,5) as int) as temp from temp where cast(substr(line,88,5) as int) != 9999) as a 
            group by a.year;
        
    
    
    表的相关操作:
    =========================================
        like        
            create table ep like employee;        //创建表结构
            
        as    
            create table ep2 as select * from employee;            //复制表
            create table ep3 as select ep2.name, ep2.sex_age from ep2;    //复制表,指定字段
    
        truncate
            truncate table ep3            //截断(清空)表
    
        column
            添加列
                alter table ep3 add columns(id int , age int);
    
            删除列
                alter table drop column(id)            //删除列不成功
    
            修改列
                alter table ep3 change id no string ;        //将id列变为no并指明数据格式string
    
            替换列
                alter table ep3 replace columns(name string, id int, age int) //全部替换列
    
    分区表:目录
        
        select ... where ; 
    
    
    1,tom,20,henan,kaifeng
    
        
        创建分区表:
            create table users(id int, name string, age int) partitioned by (province string, city string)
            row format delimited 
            fields terminated by '	'                        
            lines terminated by '
    '                         
            stored as textfile;   
    
        
        在分区表中静态添加分区:
            alter table users add partition(province='henan', city='luoyang') ;
    
            alter table users add 
            partition(province='beijing', city='beijing') 
            partition(province='hebei', city='xiongan') 
            partition(province='hebei', city='shijiazhuang') 
            ;
    
        删除分区:[IF EXISTS] 如果存在则删除,不存在不报错
            ALTER TABLE users DROP [IF EXISTS] PARTITION (province='beijing', city='beijing') ;
    
        查看分区表的分区:
            show partitions users;
    
        
        load数据到分区表:
            load data local inpath '' overwrite into table users partition(province='beijing', city='beijing');
    
            
    
    insert:
    ===========================================
        使用insert命令动态插入分区:
            
            1、设置动态分区严格模式关闭:
                原因:避免因为分区数过多引发性能问题,严格模式声明至少存在一个静态分区
                关闭严格模式
                set hive.exec.dynamic.partition.mode=nonstrict 
            
            未分区表:id,name,age.province,city
            分区表:  id,name,age    partitions:province,city
    
            2、动态插入数据,默认分区字段是指定的select语句中最后n个字段
                insert into table users partition(province , city) 
                select use0.age, user0.province, user0.city,user0.id,user0.name from  user0;
    
            
        
        使用insert命令进行数据的导出:注意,数据只能覆盖,不能拿掉overwrite
            
            insert overwrite local directory '/home/centos/users' row format delimited 
            fields terminated by '	'
            select * from users;
    
            将数据插入到多个目录下
            from users
            insert overwrite local directory '/home/centos/users' 
            row format delimited fields terminated by '	'
            select * from users;
            insert overwrite local directory '/home/centos/user2' 
            row format delimited fields terminated by '	'
            select * from users;
    
    
        使用insert命令将一个表数据插入到另一个表:
            insert into t1 select t4.name from t4;
        
    
    order by:全排序
        使用一个reduce
        1、在hive中手动指定reduce个数:
            set mapreduce.job.reduces=2
    
        2、使用group by操作
            select province, count(*) from users group by province;
    
        3、使用order by操作
            select province, count(*) as c from users group by province order by c;
    
    
    sort by:部分排序、
        1、在hive中手动指定reduce个数:
            set mapreduce.job.reduces=2
    
        2、使用group by + sort by 操作
            select province, count(*) as c from users group by province sort by c;
        
        tianjin  1 
        hebei    2 
        beijing  9 
        henan    2 
    
    
    
    distribute by:指定字段 对数据进行分区
    
        select province, age  from users distribute by province sort by age;
        
        select province, count(*) as c from users group by province distribute by province sort by c;
    
    
    
    distribute by 和group by的区别
        1、group by 和聚合函数一起使用,分区判定未知
        2、distribute by 可以不和聚合函数使用,分区判定为指定字段
    
    
    
    cluster by :相当于 distribute by + sort by  指定分区的部分排序,但是分区字段和排序字段需要一致
    
        select province, count(*) as c from users group by province cluster by province;
    
        
    
    
    
    Hive作业和MR流程的对应关系:
    ==========================================
        select ... from ... where ... having ... group by ... order by ... limit
        
        map       input    map          reduce     partition    第二个job       第二个job的cleanup
    
        
    limit:分页查询
    ==========================
        将输出数据指定输出数量
        select * from users limit 10    //前十行
        select * from users limit 10,10    //从第十个开始,往下十行
    
    
    join:
    ==============================
        1、内连接
            select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a, orders b where a.id=b.uid;
            select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a inner join orders b on a.id=b.uid; 
        2、左外连接
            select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a left outer join orders b on a.id=b.uid; 
        3、右外连接
            select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a left outer join orders b on a.id=b.uid; 
        4、全外连接    mysql不支持
            select a.id, a.name,a.age,a.province,b.orderno, b.price from user0 a full outer join orders b on a.id=b.uid; 
    
    
    user0表:
    100    yyy    100    beijing    beijing
    7    tt    7    tianjin    tianjin
    1    tom    20    hebei    langfang
    2    tomas    30    henan    xinyang
    3    tomson    45    hebei    shijiazhuang
    4    tomason    6    beijing    beijing
    5    tomasLee    7    henan    kaifeng
    
    
    orders表:
    1    order001    20    100
    2    order002    30    5
    3    order003    90    7
    4    order004    30.6    6
    5    order005    4.2    1
    6    order006    5    2
    
    
    
    手动指定大小表,reduce端join的实现
  • 相关阅读:
    Python DayDayUp系列 —— 字符串操作(一)
    Condition对象以及ArrayBlockingQueue阻塞队列的实现(使用Condition在队满时让生产者线程等待, 在队空时让消费者线程等待)
    ReadWriteLock: 读写锁
    优秀的github项目学习
    synchronized:内部锁
    ReentreantLock:重入锁
    好的文章
    Java内存模型与volatile关键字
    GitHub远程库的搭建以及使用
    使用Executor框架创建线程池
  • 原文地址:https://www.cnblogs.com/zyde/p/9225214.html
Copyright © 2020-2023  润新知