• [Hive_add_11] Hive 使用 UDTF 实现日志降维



    0. 说明

      对日志进行降维处理,将日志分为几个小表

      通过编写 UDTF ,对日志降维,将日志聚合体相关字段抽取出来,形成新表。


    1. 操作流程

      1.0 日志部分内容

    1532868065.518#192.168.23.1#1532868067548#200#{"appChannel":"appstore","appErrorLogs":

    [{"createdAtMs":1530455040000,"errorBrief":"at cn.lift.appIn.control.CommandUtil.getInfo(CommandUtil.java:67)",
    "errorDetail":"at cn.lift.dfdfdf.control.CommandUtil.getInfo(CommandUtil.java:67) at sun.reflect.DelegatingMethodAccessorImpl.
    invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606)"},{"createdAtMs":1530393180000,
    "errorBrief":"at cn.lift.appIn.control.CommandUtil.getInfo(CommandUtil.java:67)","errorDetail":
    "at cn.lift.dfdfdf.control.CommandUtil.getInfo(CommandUtil.java:67)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)"}
    ],"appEventLogs":[{"createdAtMs":1530412800000,"eventId":
    "share","logType":"event","mark":"4","musicID":"傲红尘"}
    ],"appPageLogs":[{"createdAtMs":1530449520000,
    "logType":"page","nextPage":"list.html","pageId":"list.html","pageViewCntInSession":0,"visitIndex":"2"}
    ],
    "appPlatform":"ios","appStartupLogs":[{"brand":"联想","carrier":"中国联通","country":"china","createdAtMs":
    1530385560000,"logType":"startup","network":"cell","province":"hebei","screenSize":"960 * 640"},
    {"brand":"魅族","carrier":"中国铁通","country":"america","createdAtMs":1530412860000,"logType":"startup",
    "network":"3g","province":"guangxi","screenSize":"480 * 320"}
    ],"appUsageLogs":[{"createdAtMs":1530379200000,
    "logType":"usage","singleDownloadTraffic":"12800","singleUploadTraffic":"128","singleUseDurationSecs":"123"}
    ],
    "appVersion":"1.0.0","deviceId":"Device000099","deviceStyle":"oppo 1","osType":"1.4.0"}

      1.1 创建 logAgg表

      创建 logAgg表,分区表 => year, month, day

        create table logAgg(serverTime string,remoteIp string,clientTime string,status string, json string)
        partitioned by(year string, month string, day string)
        row format delimited
        fields terminated by '#' ;

      1.2 load 数据到 logAgg表

      load data local inpath '/home/centos/files/2018-07-01.log' into table logagg partition(year='2018',month='07',day='01');

      1.3 降维处理

      1. 代码编写

      ParseJsonUtil.java

      ParseEvent.java

      2. 上传并同步

      先打包再放入 /soft/hive/lib 中

        cp /soft/hive/lib/myhive-1.0-SNAPSHOT.jar /soft/hadoop/share/hadoop/common/lib/
    xsync.sh /soft/hadoop/share/hadoop/common/lib/myhive-1.0-SNAPSHOT.jar

      3. 注册临时函数

        create temporary function parseEvent as 'com.share.udtf.ParseEvent';

      4.测试

        select parseEvent(json) from logAgg;

      1.4 创建 logEvent表

        create table logevent(deviceId string, createdAtMs string, eventId string, logType string , mark string, musicID string)
        stored as parquet tblproperties('parquet.compression'='GZIP');

      1.5 转储

        insert into logevent 
        select parseEvent(json) from logagg
        where year='2018' and month='07' and day='01';

      1.6 对 logEvent表进行操作

      1. 计算每个用户对每首歌的评分

    select deviceid, musicid, sum(cast(mark as int)) as sum
    from logevent
    where musicId is not null
    group by deviceid, musicid;

      2. 计算每个用户对每首歌的评分与最高评分

    select deviceid , musicid, sum, max(sum)over(partition by deviceid) as sum2
    from (
    select deviceid, musicid, sum(cast(mark as int)) as sum
    from logevent where musicId is not null group by deviceid, musicid
    )a;

      3. 使用 sql 计算出,每个用户最喜欢(评分最高)的歌曲,及其评分

    select deviceid , musicid, sum from (
    select deviceid , musicid, sum, max(sum)over(partition by deviceid) as sum2
    from (
    select deviceid, musicid, sum(cast(mark as int)) as sum
    from logevent where musicId is not null group by deviceid, musicid
    )a
    )b 
    where sum=sum2 ;

    2. 附加内容 

      2.1 Hive 加载 jar 方式

      1. 将 jar 放在 Hive 的 lib 下,并重启 Hive
        /soft/hive/lib

      2. 通过配置文件指定 jar,并重启 Hive
        hive.aux.jars.path=/x/x/x.jar

      3. 临时加载 jar
        hive> add jar /x/x/x.jar

      2.2 注册函数方法

      1. create function as '';    // 永久

      2. create temporary function as '';    // 临时

      3. create function xxx as '' using jar 'hdfs://mycluster/xxx.jar';    // 将 jar 包放入 HDFS 中,避免重启


  • 相关阅读:
    summary
    谷歌浏览器Software Reporter Tool长时间占用CPU解决办法
    进栈 出栈 标准用法
    C语言保证,0永远不是有效的数据地址,因此,返回址0可用来表示发生的异常事件
    寄存器是内存阶层中的最顶端,也是系统获得操作资料的最快速途径。 存于寄存器内的地址可用来指向内存的某个位置,即寻址
    对内存分配的理解 自动变量 局部变量 临时变量 外部变量 字符串长度 C语言可以看成由一些列的外部对象构成
    ORM Active Record Data Mapper
    summary
    c预处理器
    #include<stdio.h> #include "stdio.h"
  • 原文地址:https://www.cnblogs.com/share23/p/10361382.html
Copyright © 2020-2023  润新知