• Java实现过滤中文乱码


    最近在日志数据清洗时遇到中文乱码,如果只要有非中文字符就将该字符串过滤掉,这种方法虽简单但并不可取,因为比如像Xperia™主題天天四川麻将Ⅱ这样的字符串也会被过滤掉。

    1. Unicode编码

    Unicode编码是一种涵盖了世界上所有语言、标点等字符的编码方式,简单一点说,就是一种通用的世界码;其编码范围:U+0000 .. U+10FFFF。按Unicode硬编码的区间进行划分,Unicode编码被分成若干个block ( Unicode block);每一个Unicode编码专属于唯一的Unicode block,Unicode block之间互不重叠。从码字的本身的属性出发,Unicode编码被分成了若干script ( Unicode script);比如,与中文相关的字符、标点的scriptHan包括block如下:

    • CJK Radicals Supplement
    • Kangxi Radicals
    • CJK Symbols and Punctuation中的15个字符
    • CJK Unified Ideographs Extension A
    • CJK Unified Ideographs
    • CJK Compatibility Ideographs
    • CJK Unified Ideographs Extension B
    • CJK Unified Ideographs Extension C
    • CJK Unified Ideographs Extension D
    • CJK Unified Ideographs Extension E
    • CJK Compatibility Ideographs Supplement

    其中,常见的中文字符在CJK Unified Ideographs block;此外,考虑繁体字及不常见字等,CJK还有A、B、C、D、E五个extension。Basic Latin block完整地包含了ASCII码的控制字符、标点字符与英文字母字符。

    Unicode编码与block、script之间的映射关系,具体可参看这里

    2. Java的字符编码

    JDK完整实现Unicode的block与script:

    Char c = '☎'
    Character.UnicodeBlock ub = Character.UnicodeBlock.of(c)
    Character.UnicodeScript uc = Character.UnicodeScript.of(c);
    

    Java中的字符char内置的编码方式是UTF-16,当char强转成int类型时,其返回值是unicode编码值,只有当getbyte时才返回的是utf-8编码的byte:

    String s = "u00a0";
    String.format("\u%04x", (int) s.charAt(0)) // --> u00a0
    import org.apache.commons.codec.binary.Hex;
    Hex.encodeHex(s.getBytes()) // --> c2a0
    

    UTF-8是Unicode字符的变长前缀编码的一种实现,二者之间的对应关系在这里.现在我们回到开篇过滤中文乱码的问题,有一个基本解决思路:

    • 去掉各种标点字符、控制字符,
    • 计算剩下字符中非中文字符所占的比例,如果超过阈值,则认为该字符串为乱码串

    完整代码如下:

    public class ChineseUtill {
    	 
        private static boolean isChinese(char c) {
        	Character.UnicodeScript sc = Character.UnicodeScript.of(c);
            if (sc == Character.UnicodeScript.HAN) {
                return true;
            }
            return false;
        }
        
        public static boolean isPunctuation(char c) {
            Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
            if (    // punctuation, spacing, and formatting characters
            		ub == Character.UnicodeBlock.GENERAL_PUNCTUATION
            		// symbols and punctuation in the unified Chinese, Japanese and Korean script
                    || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION
                    // fullwidth character or a halfwidth character
                    || ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS
                    // vertical glyph variants for east Asian compatibility
                    || ub == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS
                    // vertical punctuation for compatibility characters with the Chinese Standard GB 18030
                    || ub == Character.UnicodeBlock.VERTICAL_FORMS
                    // ascii
                    || ub == Character.UnicodeBlock.BASIC_LATIN
                    ) {
                return true;
            } else {
                return false;
            }
        }
        
        private static Boolean isUserDefined(char c) {
        	Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
        	if (ub == Character.UnicodeBlock.NUMBER_FORMS
        			|| ub == Character.UnicodeBlock.ENCLOSED_ALPHANUMERICS
        			|| ub == Character.UnicodeBlock.LETTERLIKE_SYMBOLS
        			|| c == 'ufeff'
        			|| c == 'u00a0'
        			)
        		return true;
        	return false;
        }
        
        public static Boolean isMessy(String str)  {
        	float chlength = 0;
        	float count = 0;
        	for(int i = 0; i < str.length(); i++) {
        		char c = str.charAt(i);
        		if(isPunctuation(c) || isUserDefined(c))
        			continue;
        		else {
        			if(!isChinese(c)) {
        				count = count + 1;
        			}
        			chlength ++;
        		}
        	}
        	float result = count / chlength;
        	if(result > 0.3)
        		return true;
        	return false;
        }
        
    }
    

    为了得到更为完整的可接受的字符表,定义isUserDefined方法(具体字符表与日志中的字符有关系);加上了Number FormsEnclosed AlphanumericsLetterlike Symbols这三个block,以及u00a0(Non-breaking space)字符与ufeff(ZERO WIDTH NO-BREAK SPACE)字符。

    3. 参考资料

    [1] Wikipedia, Unicode block.
    [2] Tong Zeng, Java 中文字符判断 中文标点符号判断.

  • 相关阅读:
    浏览器—CORS 通信的学习总结
    前端算法
    移动端适配时对meta name="viewport" content="width=device-width,initial-scale=1.0"的理解
    react和vue的区别
    对xss攻击和csrf攻击的理解
    前端如何解决跨域
    你没那么重要
    五福
    天道
    决策
  • 原文地址:https://www.cnblogs.com/en-heng/p/5320024.html
Copyright © 2020-2023  润新知