有关emoji表情以及utf-16编码

昨日IOS组的同事遇到一个棘手的问题：当输入框内含有emoji表情时，如何获取文本框内的字符数（一个emoji表情算一个字符）。

先从我最近接触的JAVA说起，JAVA中，在使用String的length方法时，如果是普通的中英文字符，没有问题，但是如果该字符的Unicode编码大于0xFFFF，这个length方法就不能正确的获取字符数量了，事实上会把这样的特殊字符计算成2个字符。当然，JAVA已有现成的方法解决这个问题：codePointCount。

可惜的是，找了很久，在Objective-c中没有找到类似的方案。（似乎SubString后，数组长度就是准确的字符数，有待验证）

我不是IOS程序员，暂时不能提供OC中的解决方案。但在昨日的摸索中，也有一点点收获，拿出来分享一下。

1. emoji表情大部分的unicode编码大于0xFFFF，也就是UTF16编码后占用4个字节，仅小部分表情Unicode小于0xFFFF，这部分UTF16编码后占用2个字节。

2. 不管是Android还是IOS，从文本框中读取到的字符串，在内存中都是UTF-16编码(大端)形式存放的。（默认情况下）

3. 顺便摘录utf-16编码的规则（看明白这个规则，IOS中自行解决code point count的问题也就迎刃而解了）：

   1) If U < 0x10000, encode U as a 16-bit unsigned integer and
      terminate.

   2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF,
      U' must be less than or equal to 0xFFFFF. That is, U' can be
      represented in 20 bits.

   3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and
      0xDC00, respectively. These integers each have 10 bits free to
      encode the character value, for a total of 20 bits.

   4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order
      bits of W1 and the 10 low-order bits of U' to the 10 low-order
      bits of W2. Terminate.

   Graphically, steps 2 through 4 look like:
   U' = yyyyyyyyyyxxxxxxxxxx
   W1 = 110110yyyyyyyyyy
   W2 = 110111xxxxxxxxxx

相关阅读:
SP1812 LCS2
SP1811 LCS
P3804 【模板】后缀自动机
P3808 【模板】AC自动机（简单版）
P3879 [TJOI2010]阅读理解
P2602 [ZJOI2010]数字计数
P4719 【模板】动态dp
P1122 最大子树和
P3554 [POI2013]LUK-Triumphal arch
P3565 [POI2014]HOT-Hotels

原文地址：https://www.cnblogs.com/shenzhigang/p/5015113.html