• Hive UDF处理特殊字符[x22、urlencode等编码问题


    如果你的函数读和返回都是基础数据类型(Hadoop&Hive 基本writable类型,如Text,IntWritable,LongWriable,DoubleWritable等等),那么简单的API(org.apache.hadoop.hive.ql.exec.UDF)可以胜任

    但是,如果你想写一个UDF用来操作内嵌数据结构,如Map,List和Set,那么你要去熟悉org.apache.hadoop.hive.ql.udf.generic.GenericUDF这个API
    简单API: org.apache.hadoop.hive.ql.exec.UDF
    复杂API:  org.apache.hadoop.hive.ql.udf.generic.GenericUDF
    接下来我将通过一个示例为上述两个API建立UDF,我将为接下来的示例提供代码与测试 。
    注://事实上UDF有一个bug,不会去检查null参数,null在大数据集当中是非常常见的,所以要严谨点。作为回应,这边加了一个null的检查
    pom文件参考:
        <dependencies>
        <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>2.1.1</version>
        </dependency>
    
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.1</version>
        </dependency>
    
    <!--        <dependency>-->
    <!--            <groupId>com.aliyun.odps</groupId>-->
    <!--            <artifactId>odps-sdk-udf</artifactId>-->
    <!--            <version>0.29.10-public</version>-->
    <!--        </dependency>-->
    
    
        </dependencies>
    <build>
            <pluginManagement>
                    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.0.0</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <artifactSet>
                        </artifactSet>
                    </configuration>
                </execution>
            </executions>
        </plugin>
                    </plugins>
            </pluginManagement>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>6</source>
                    <target>6</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
    

      

    DEMO:

    package udf;
    
    import jodd.util.URLDecoder;
    import org.apache.hadoop.hive.ql.exec.UDF;
    
    import java.io.UnsupportedEncodingException;
    
    public class TestDecodeX extends UDF {
    
        public static void decodeX (String s) throws UnsupportedEncodingException {
    
            String s1 = s.replaceAll("\\x", "%");
            String decode = URLDecoder.decode(s1, "utf-8");
            System.out.println(decode);
    
        }
    
        public String evaluate(String input) throws Exception {
    //事实上UDF有一个bug,不会去检查null参数,null在大数据集当中是非常常见的,所以要严谨点。作为回应,这边加了一个null的检查
            if (input == null) return null ;
            String decode = null ;
            try {
    
                String s1 = input.replaceAll("\\x", "%");
                 decode = URLDecoder.decode(s1, "utf-8");
    //            System.out.println(decode);
    
            } catch (Exception e) {
                //  e.printStackTrace();
            }
            System.out.println(decode);
            return decode ;
        }
    
    
    
        public static void main(String[] args) throws Exception {
    
            String s1 = "G977N%7C7.1.2%7Cwifi%7C%7Cgamepubgoogle%7CGetHashed%7Ccom.gamepub.ft2.g%7Candroid%7C%7C%7C1.0.2%7Csamsung%7C1547548%7C1%7CAsia%2FSeoul%7CARM%7C%7C19d1b5cdf01341e99c670f254765148d%22%5D" ;
            String s = "172.31.35.210|21/04/2021:10:59:01|[\x22TakeSample|0bb9f14b1041a8d9|32550283-4DF6-4CC5-9922-E4F9CFAFD7FD|iPhone13,1|14.2.1|wifi||gamepubappstore|GetHashed|com.gamepub.fr2|ios|BAB3A467-A4D0-4900-80F7-BCB9D53757B1||0.26.87|\xE8\x8B\xB9\xE6\x9E\x9C|3.63|0|Asia/Seoul|ARM64||\x22]
    " ;
            TestDecodeX t = new TestDecodeX() ;
            t.evaluate(s1) ;
    
        }
    
    
    }  

    result结果示例:

    G977N|7.1.2|wifi||gamepubgoogle|GetHashed|com.gamepub.ft2.g|android|||1.0.2|samsung|1547548|1|Asia/Seoul|ARM||19d1b5cdf01341e99c670f254765148d"]
    
    Process finished with exit code 0

    在hive客户端:

    hive> ADD JAR target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;  
    hive> CREATE TEMPORARY FUNCTION decodeX as 'udf.TestDecodeX';  

    参考:

    Hive UDF开发指南
  • 相关阅读:
    linux 11201(11203) ASM RAC 安装
    [学习笔记]多项式对数函数
    linux 10201 ASM RAC 安装+升级到10205
    tar
    [学习笔记]多项式开根
    gzip
    小朋友和二叉树
    zip
    bzoj5016 一个简单的询问
    unzip
  • 原文地址:https://www.cnblogs.com/-courage/p/14714727.html
Copyright © 2020-2023  润新知