• UTF8 与 UTF16 编码


    Unicode 的发展,英文好的直接去 unicode.org 上去看吧,不好的可以移步到这里 看dengyunze的总结:《关于UTF8,UTF16,UTF32,UTF16-LE,UTF16-BE 》 。此文讲的清除明白:为了能把世界上的所有字符都表示,理论上需要用 UTF-16,但是由于“大部分”(当然这是欧美那边技术宅男拍脑袋想出来的大部分啦~)的字符只需要 1 个字节就搞定了,用 UTF16 实在太浪费啦,于是他们就用了 UTF8. 对于那些个“少数”(比如中日韩)的字符,就通过一个 UTF8-UTF16 的转换来表示。

    UTF8 和 UTF16 都是变长表示的,为啥欧美技术宅会觉得太浪费了咧?因为欧美字符 0x0000 - 0x00FF 就搞定了,UTF8 最小变长是 1 个字节,而 UTF16 变长是 2 个字节,所以……(↓看下图中 code unit size)

    注意:上面这个图中,UTF-16 和 UTF-16LE 是一样的,因为…… UTF16 默认就是 UTF-16LE

    那么,UTF8是如何表示 的咧?↓看下图

    ↓↓ 举例

    表示的方法跟上上个图对应,第一个字节中,从左往右第一个 10 前面的 “1” 的个数表示后面还有这么多个的字节在表示这个字符。UTF8 最多可以表示 31 bit 的字符。

    UTF16 编码的过程

    v  = 0x64321
    v′ = v - 0x10000
       = 0x54321
       = 0101 0100 0011 0010 0001
    vh = v′ >> 10
       = 01 0101 0000 // higher 10 bits of v′
    vl = v′ & 0x3FF
       = 11 0010 0001 // lower  10 bits of v′
    w1 = 0xD800 + vh
       = 1101 1000 0000 0000
       +        01 0101 0000
       = 1101 1001 0101 0000
       = 0xD950 // first code unit of UTF-16 encoding
    w2 = 0xDC00 + vl
       = 1101 1100 0000 0000
       +        11 0010 0001
       = 1101 1111 0010 0001
       = 0xDF21 // second code unit of UTF-16 encoding
    

    附一段 java 版本的 UTF8 与 UTF16 的相互转换,代码来源于 Lucene3.6

    /**
    	 * Interprets the given byte array as UTF-8 and converts to UTF-16. The
    	 * {@link CharsRef} will be extended if it doesn't provide enough space to
    	 * hold the worst case of each byte becoming a UTF-16 codepoint.
    	 * <p>
    	 * NOTE: Full characters are read, even if this reads past the length passed
    	 * (and can result in an ArrayOutOfBoundsException if invalid UTF-8 is
    	 * passed). Explicit checks for valid UTF-8 are not performed.
    	 */
    	// TODO: broken if chars.offset != 0
    	public static void UTF8toUTF16(byte[] utf8, int offset, int length,
    			CharsRef chars) {
    		int out_offset = chars.offset = 0;
    		final char[] out = chars.chars = ArrayUtil.grow(chars.chars, length);
    		final int limit = offset + length;
    		while (offset < limit) {
    			int b = utf8[offset++] & 0xff;
    			if (b < 0xc0) {
    				assert b < 0x80;
    				out[out_offset++] = (char) b;
    			} else if (b < 0xe0) {
    				out[out_offset++] = (char) (((b & 0x1f) << 6) + (utf8[offset++] & 0x3f));
    			} else if (b < 0xf0) {
    				out[out_offset++] = (char) (((b & 0xf) << 12)
    						+ ((utf8[offset] & 0x3f) << 6) + (utf8[offset + 1] & 0x3f));
    				offset += 2;
    			} else {
    				assert b < 0xf8 : "b=" + b;
    				int ch = ((b & 0x7) << 18) + ((utf8[offset] & 0x3f) << 12)
    						+ ((utf8[offset + 1] & 0x3f) << 6)
    						+ (utf8[offset + 2] & 0x3f);
    				offset += 3;
    				if (ch < UNI_MAX_BMP) {
    					out[out_offset++] = (char) ch;
    				} else {
    					int chHalf = ch - 0x0010000;
    					out[out_offset++] = (char) ((chHalf >> 10) + 0xD800);
    					out[out_offset++] = (char) ((chHalf & HALF_MASK) + 0xDC00);
    				}
    			}
    		}
    		chars.length = out_offset - chars.offset;
    	}
    
     /** Encode characters from a char[] source, starting at
       *  offset for length chars. After encoding, result.offset will always be 0.
       */
     public static void UTF16toUTF8(final char[] source, final int offset, final int length, BytesRef result) {
    
        int upto = 0;
        int i = offset;
        final int end = offset + length;
        byte[] out = result.bytes;
        // Pre-allocate for worst case 4-for-1
        final int maxLen = length * 4;
        if (out.length < maxLen)
          out = result.bytes = new byte[maxLen];
        result.offset = 0;
    
        while(i < end) {
          
          final int code = (int) source[i++];
    
          if (code < 0x80)
            out[upto++] = (byte) code;
          else if (code < 0x800) {
            out[upto++] = (byte) (0xC0 | (code >> 6));
            out[upto++] = (byte)(0x80 | (code & 0x3F));
          } else if (code < 0xD800 || code > 0xDFFF) {
            out[upto++] = (byte)(0xE0 | (code >> 12));
            out[upto++] = (byte)(0x80 | ((code >> 6) & 0x3F));
            out[upto++] = (byte)(0x80 | (code & 0x3F));
          } else {
            // surrogate pair
            // confirm valid high surrogate
            if (code < 0xDC00 && i < end) {
              int utf32 = (int) source[i];
              // confirm valid low surrogate and write pair
              if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) { 
                utf32 = (code << 10) + utf32 + SURROGATE_OFFSET;
                i++;
                out[upto++] = (byte)(0xF0 | (utf32 >> 18));
                out[upto++] = (byte)(0x80 | ((utf32 >> 12) & 0x3F));
                out[upto++] = (byte)(0x80 | ((utf32 >> 6) & 0x3F));
                out[upto++] = (byte)(0x80 | (utf32 & 0x3F));
                continue;
              }
            }
            // replace unpaired surrogate or out-of-order low surrogate
            // with substitution character
            out[upto++] = (byte) 0xEF;
            out[upto++] = (byte) 0xBF;
            out[upto++] = (byte) 0xBD;
          }
        }
        //assert matches(source, offset, length, out, upto);
        result.length = upto;
      }
  • 相关阅读:
    Django中自定义标签的所有步骤
    django的settings详解(1)
    django中@property装饰器的运用
    设计没有标准,只有目标
    小学的题 大学的解法
    中国移动创新系列丛书《OPhone应用开发权威指南》读者交流活动圆满结束
    中国移动创新系列丛书《OPhone应用开发权威指南》读者交流活动
    开源软件技巧与精髓分享
    博文视点大讲堂28期:助你赢在软件外包行业
    网络营销大师力作,亚马逊营销图书传奇
  • 原文地址:https://www.cnblogs.com/vivizhyy/p/3394868.html
Copyright © 2020-2023  润新知