• hive中 udf,udaf,udtf


    1.hive中基本操作;

      DDL,DML

    2.hive中函数

    User-Defined Functions : UDF(用户自定义函数,简称JDF函数)
    UDF: 一进一出  upper  lower substring(进来一条记录,出去还是一条记录)
    UDAF:Aggregation(用户自定的聚合函数)  多进一出  count max min sum ...
    UDTF: Table-Generation  一进多出

    3.举例

    show functions显示系统支持的函数

    行数举例:split(),explode()

    exercise:使用hive统计单词出现次数

    explode把数组转成多行的数据

    [hadoop@hadoop000 data]$ vi hive-wc.txt
    hello,world,welcome
    hello,welcome
    hive> create table hive_wc(sentence string);
    OK
    Time taken: 1.083 seconds
     
    hive> load data local inpath '/home/hadoop/data/hive-wc.txt' into table hive_wc;
    Loading data to table default.hive_wc
    Table default.hive_wc stats: [numFiles=1, totalSize=35]
    OK
    Time taken: 1.539 seconds
     
    hive> select * from hive_wc;
    OK
    hello,world,welcome
    hello,welcome
     
    Time taken: 0.536 seconds, Fetched: 3 row(s)
    hive> select split(sentence,",") from hive_wc;
    OK
    ["hello","world","welcome"]
    ["hello","welcome"]
    [""]
    Time taken: 0.161 seconds, Fetched: 3 row(s)
    "hello"
    "world"
    "welcome"    
    "hello"
    "welcome"

    用一个SQL完成wordcount统计:

    hive> select word, count(1) as c 
        > from (select explode(split(sentence,",")) as word from hive_wc) t
        > group by word ;
    Query ID = hadoop_20180613094545_920c2e72-5982-47eb-9a9c-5e5a30ebb1ae
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks not specified. Estimated from input data size: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Starting Job = job_1528851144815_0001, Tracking URL = http://hadoop000:8088/proxy/application_1528851144815_0001/
    Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1528851144815_0001
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    2018-06-13 10:18:53,155 Stage-1 map = 0%,  reduce = 0%
    2018-06-13 10:18:59,605 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.42 sec
    2018-06-13 10:19:07,113 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.31 sec
    MapReduce Total cumulative CPU time: 4 seconds 310 msec
    Ended Job = job_1528851144815_0001
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.31 sec   HDFS Read: 7333 HDFS Write: 29 SUCCESS
    Total MapReduce CPU Time Spent: 4 seconds 310 msec
    OK
            1
    hello   2
    welcome 2
    world   1
    Time taken: 26.859 seconds, Fetched: 4 row(s)

    4.json类型数据

    使用到的文件: rating.json 

    创建一张表 rating_json,上传数据,并查看前十行数据信息:

    hive> create table rating_json(json string);
    OK
     
    hive> load data local inpath '/home/hadoop/data/rating.json' into table rating_json;
    Loading data to table default.rating_json
    Table default.rating_json stats: [numFiles=1, totalSize=34967552]
    OK
     
     
    hive> select * from rating_json limit 10;
    OK
    {"movie":"1193","rate":"5","time":"978300760","userid":"1"}
    {"movie":"661","rate":"3","time":"978302109","userid":"1"}
    {"movie":"914","rate":"3","time":"978301968","userid":"1"}
    {"movie":"3408","rate":"4","time":"978300275","userid":"1"}
    {"movie":"2355","rate":"5","time":"978824291","userid":"1"}
    {"movie":"1197","rate":"3","time":"978302268","userid":"1"}
    {"movie":"1287","rate":"5","time":"978302039","userid":"1"}
    {"movie":"2804","rate":"5","time":"978300719","userid":"1"}
    {"movie":"594","rate":"4","time":"978302268","userid":"1"}
    {"movie":"919","rate":"4","time":"978301368","userid":"1"}
    Time taken: 0.195 seconds, Fetched: 10 row(s)

    对json的数据进行处理,json_tuple 是一个UDTF是 Hive0.7版本引进的:

    hive> select 
        > json_tuple(json,"movie","rate","time","userid") as (movie,rate,time,userid) 
        > from rating_json limit 10;
    OK
    1193    5       978300760       1
    661     3       978302109       1
    914     3       978301968       1
    3408    4       978300275       1
    2355    5       978824291       1
    1197    3       978302268       1
    1287    5       978302039       1
    2804    5       978300719       1
    594     4       978302268       1
    919     4       978301368       1
    Time taken: 0.189 seconds, Fetched: 10 row(s)

    5.时间类型的转换:

    [hadoop@hadoop000 data]$ more hive_row_number.txt 
    1,18,ruoze,M
    2,19,jepson,M
    3,22,wangwu,F
    4,16,zhaoliu,F
    5,30,tianqi,M
    6,26,wangba,F
    [hadoop@hadoop000 data]$ 
    hive> create table hive_rownumber(id int,age int, name string, sex string)
        > row format delimited fields terminated by ',';
    OK
    Time taken: 0.451 seconds
    hive> load data local inpath '/home/hadoop/data/hive_row_number.txt' into table hive_rownumber;
    Loading data to table hive3.hive_rownumber
    Table hive3.hive_rownumber stats: [numFiles=1, totalSize=84]
    OK
    Time taken: 1.381 seconds
    hive> select * from hive_rownumber ;
    OK
    1       18      ruoze   M
    2       19      jepson  M
    3       22      wangwu  F
    4       16      zhaoliu F
    5       30      tianqi  M
    6       26      wangba  F
    Time taken: 0.455 seconds, Fetched: 6 row(s)

    需求查询出每种性别中年龄最大的两条数据 -- > topn:

    分析:order by 是全局的排序,是做不到分组内的排序的 ;组内进行排序,就要用到窗口函数or分析函数 

    select id,age,name.sex

    from

    (select id,age,name,sex,   

    row_number() over(partition by sex order by age desc)

    from hive_rownumber) t

    where rank<=2;

    hive> select id,age,name,sex 
        > from
        > (select id,age,name,sex,
        > row_number() over(partition by sex order by age desc) as rank
        > from hive_rownumber) t
        > where rank<=2;
    Query ID = hadoop_20180614202525_9829dc42-3c37-4755-8b12-89c416589ebc
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks not specified. Estimated from input data size: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Starting Job = job_1528975858636_0001, Tracking URL = http://hadoop000:8088/proxy/application_1528975858636_0001/
    Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1528975858636_0001
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    2018-06-14 20:26:18,582 Stage-1 map = 0%,  reduce = 0%
    2018-06-14 20:26:24,010 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.48 sec
    2018-06-14 20:26:31,370 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.86 sec
    MapReduce Total cumulative CPU time: 3 seconds 860 msec
    Ended Job = job_1528975858636_0001
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.86 sec   HDFS Read: 8586 HDFS Write: 56 SUCCESS
    Total MapReduce CPU Time Spent: 3 seconds 860 msec
    OK
    6       26      wangba  F
    3       22      wangwu  F
    5       30      tianqi  M
    2       19      jepson  M
    Time taken: 29.262 seconds, Fetched: 4 row(s)
  • 相关阅读:
    最长上升子序列(实验回顾)
    数据库应用开发一、vs
    全文检索
    mangtomant 增删改查
    django
    SQLAlchemy 增删改查 一对多 多对多
    Flask-Sqlalchemy—常用字段类型说明
    flask
    文件下载
    python连接mongodb
  • 原文地址:https://www.cnblogs.com/gxc2015/p/9330133.html
Copyright © 2020-2023  润新知