背景
URL 的编码 是ASCII十六进制格式。数仓接受到前端上报的URL,要对URL字段解码。
如要将
https%3A%2F%2Fmywebsite%2Fdocs%2Fenglish%2Fsite%2"Fmybook.do%3Frequest_type%3D%26type%3Dprivate
解码为:
https://mywebsite/docs/english/site/mybook.do?request_type=&type=private
方式
hive sql中,通过反射调用java.net.URLDecoder方法
select reflect('java.net.URLDecoder', 'decode',url, 'UTF-8') from table_name;
问题
实践中发现,部分https开头的url解析不完整。例如解码下面的URL,
https%253A%252F%252Fwww.cnblogs.com/drjava
通过上面的方法,结果是:
https%3A%2F%2Fwww.cnblogs.com/drjava
解决方案
通过google找到了可行的方法,解码两次,由于sql写起来比较难以阅读,所以封装了udf。代码如下:
1 import org.apache.hadoop.hive.ql.exec.UDF; 2 import java.io.UnsupportedEncodingException; 3 import java.net.URLDecoder; 4 5 public class UrlDecode extends UDF { 6 /** 7 * 8 * 9 * @param component 编码的url 10 * @return 解码url 11 */ 12 public String evaluate(String component) { 13 if(component == null || component.length() <= 0){ 14 return ""; 15 } 16 17 String result = ""; 18 component = component.replaceAll("%(?![0-9a-fA-F]{2})", "%25"); 19 try { 20 result = URLDecoder.decode(component, "UTF-8"); 21 result = URLDecoder.decode(result, "UTF-8"); 22 } catch (UnsupportedEncodingException e) { 23 result = component; 24 } 25 return result; 26 } 27 }
做一下单元测试
1 import org.junit.Assert; 2 import org.junit.Test; 3 import udf.UrlDecode; 4 5 public class UrlDecodeTest { 6 @Test 7 public void UrlDecodeMethod(){ 8 UrlDecode ud = new UrlDecode(); 9 Assert.assertEquals(ud.evaluate("https%3A%2F%2Fmywebsite%2Fdocs%2Fenglish%2Fsite%2" + 10 "Fmybook.do%3Frequest_type%3D%26type%3Dprivate"), 11 "https://mywebsite/docs/english/site/mybook.do?request_type=&type=private"); 12 13 Assert.assertEquals(ud.evaluate("你是谁,为了谁"),"你是谁,为了谁"); 14 15 Assert.assertEquals(ud.evaluate(null),""); 16 } 17 }