• Hive自己定义函数的使用——useragent解析


    想要从日志数据中分析一下操作系统、浏览器、版本号使用情况。可是hive中的函数不能直接解析useragent,于是能够写一个UDF来解析。useragent用于表示用户的当前操作系统,浏览器版本号信息,形如:

    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 180.173.196.29

    当中解析ua能够用一个开源的工具包,叫做useragentutils.jar来处理,可是不能直接引入这个包,由于hadoop和hive都不支持直接引用第三方的包,要导入源代码。项目结构应该例如以下图


    以下的代码用来打印出操作系统、浏览器版本号信息:

    import org.apache.hadoop.hive.ql.exec.UDF;
    import org.apache.hadoop.io.Text;
    
    import eu.bitwalker.useragentutils.UserAgent;
    
    public class ParseUserAgent_UDF extends UDF{
    	public Text evaluate(final Text userAgent){
    		StringBuilder builder = new StringBuilder();
    		UserAgent ua = new UserAgent(userAgent.toString());
    		builder.append(ua.getOperatingSystem()+"	"+ua.getBrowser()+"	"+ua.getBrowserVersion());
    		return new Text(builder.toString());
    	}
    }
    
    使用:打成jar包,hive中add jar xx.jar;

    create temporary function ua_parse as 'com.xx.ParseUserAgent_UDF';

    select ua_parse(ua) from table_name limit 3;

    结果:

    WINDOWS_7       CHROME21        21.0.1180.89
    WINDOWS_7       CHROME33        33.0.1750.146
    WINDOWS_7       CHROME21        21.0.1180.89

    此种方式仅仅能处理一行。生成一行,无法进行统计分析。

    以下使用UDTF(User Defined Table Generating Function),处理一行,生成多列。

    import java.util.ArrayList;
    
    import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
    import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
    import org.apache.hadoop.hive.ql.metadata.HiveException;
    import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
    import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
    import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
    import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
    import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
    
    import eu.bitwalker.useragentutils.UserAgent;
    
    public class ParseUserAgent_UDTF extends GenericUDTF{
    	@Override
    	public StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException {
    		if (args.length != 1) {
    			throw new UDFArgumentLengthException("ExplodeMap takes only one argument");
    		}
    		if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {
    			throw new UDFArgumentException("ExplodeMap takes string as a parameter");
    		}
    		ArrayList<String> fieldNames = new ArrayList<String>();
    		ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();
    		fieldNames.add("system");
    		fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
    		fieldNames.add("browser");
    		fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
    		fieldNames.add("version");
    		fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
    		return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
    	}
    	@Override
    	public void process(Object[] arg){
    		try {
    			if(arg == null || arg.length == 0)
    				return;
    			String input = arg[0].toString();
    			String result[] = ua_parse(input).split("	");
    			forward(result);
    		} catch (Exception e) {
    			e.printStackTrace();
    		}
    	}
    	
    	@Override
    	public void close() throws HiveException {
    		
    	}
    	public String ua_parse(String userAgent){
    		StringBuilder builder = new StringBuilder();
    		UserAgent ua = new UserAgent(userAgent.toString());
    		builder.append(ua.getOperatingSystem()+"	"+ua.getBrowser()+"	"+ua.getBrowserVersion());
    		return builder.toString();
    	}
    }
    
    select t.browser,count(*) c from (select ua_parse(ua) as (system,browser,version) from table_name) t group by t.browser order by c desc;
    前十名:

    CHROME31        987220571
    UNKNOWN 708890045
    IE8     420021677
    IE7     411500373
    MOBILE_SAFARI   291920740
    IE6     217574865
    IE11    179582201
    IE9     165160040
    CHROME30        158623163
    CHROME21        155192489

    未识别的还是非常多!


    參考:http://blog.csdn.net/ruidongliu/article/details/8791865

    http://computerdragon.blog.51cto.com/6235984/1288567


  • 相关阅读:
    SQLHelper访问类
    visual studio 2017安装教程以及各类问题解决方案
    EasyUI表格删除多个表的多条数据
    配置Java环境JDK与jre
    javascript动态结算购物车
    Linux 命令--vi/vim/yum
    Linux 命令--磁盘管理
    Linux 命令--用户和用户组管理
    Linux 命令--文件与目录管理
    Linux 目录结构说明
  • 原文地址:https://www.cnblogs.com/mengfanrong/p/5125607.html
Copyright © 2020-2023  润新知