• java过滤四字节和六字节特殊字符


    java7版本中可以这样写:

    source.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]""*");

    java6和java7版本中可以这样写:

    source.replaceAll("[ud800udc00-udbffudfffud800-udfff]", "*");

    Matching characters in astral planes (code points U+10000 to U+10FFFF) has been an under-documented feature in Java regex.

    This answer mainly deals with Oracle's implementation (reference implementation, which is also used in OpenJDK) for Java version 6 and above.

    Please test the code yourself if you happen to use GNU Classpath or Android, since they use their own implementation.

    Behind the scene

    Assuming that you are running your regex on Oracle's implementation, your regex

    "([ud800-udbffudc00-udfff])"
    

    is compiled as such:

    StartS. Start unanchored match (minLength=1)
    java.util.regex.Pattern$GroupHead
    Pattern.union. A ∪ B:
      Pattern.union. A ∪ B:
        Pattern.rangeFor. U+D800 <= codePoint <= U+10FC00.
        BitClass. Match any of these 1 character(s):
          [U+002D]
      SingleS. Match code point: U+DFFF LOW SURROGATES DFFF
    java.util.regex.Pattern$GroupTail
    java.util.regex.Pattern$LastNode
    Node. Accept match
    

    The character class is parsed as ud800-udbffudc00, -, udfff. Since udbffudc00 forms a valid surrogate pairs, it represent the code point U+10FC00.

    Wrong solution

    There is no point in writing:

    "[ud800-udbff][udc00-udfff]"
    

    Since Oracle's implementation matches by code point, and valid surrogate pairs will be converted to code point before matching, the regex above can't match anything, since it is searching for 2 consecutive lone surrogate which can form a valid pair.

    Solution

    If you want to match and remove all code points above U+FFFF in the astral planes (formed by a valid surrogate pair), plus the lone surrogates (which can't form a valid surrogate pair), you should write:

    input.replaceAll("[ud800udc00-udbffudfffud800-udfff]", "");
    

    This solution has been tested to work in Java 6 and 7 (Oracle implementation).

    The regex above compiles to:

    StartS. Start unanchored match (minLength=1)
    Pattern.union. A ∪ B:
      Pattern.rangeFor. U+10000 <= codePoint <= U+10FFFF.
      Pattern.rangeFor. U+D800 <= codePoint <= U+DFFF.
    java.util.regex.Pattern$LastNode
    Node. Accept match
    

    Note that I am specifying the characters with string literal Unicode escape sequence, and not the escape sequence in regex syntax.

    // Only works in Java 7
    input.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "")
    

    Java 6 doesn't recognize surrogate pairs when it is specified with regex syntax, so the regex recognize \ud800 as one character and tries to compile the range \udc00-\udbff where it fails. We are lucky that it throws an Exception for this input; otherwise, the error will go undetected. Java 7 parses this regex correctly and compiles to the same structure as above.


    From Java 7 and above, the syntax x{h..h} has been added to support specifying characters beyond BMP (Basic Multilingual Plane) and it is the recommended method to specify characters in astral planes.

    input.replaceAll("[\x{10000}-\x{10ffff}ud800-udfff]", "");
    

    This regex also compiles to the same structure as above.

     

    本文转自:http://stackoverflow.com/questions/27820971/why-a-surrogate-java-regexp-finds-hypen-minus

  • 相关阅读:
    ELF BIN HEX[zz]
    电路、信号和PCB设计
    ADHelper中AD属性赋值的修正
    发布一个图片库轮显WebPart
    MOSS的ItemUpdated执行了10次,您碰到了吗?
    InfoPath 保存时自动生成文件名
    最新版WSS3.0 SDK
    自定义MOSS网站的masterpage
    自定义InfoPath数据验证有效性
    (暂时)解决InfoPath一直显示“installing”问题 续
  • 原文地址:https://www.cnblogs.com/nizuimeiabc1/p/6825681.html
Copyright © 2020-2023  润新知