python 中文字符的处理

python 中文字符的处理
刚开始学习python的时候，都是对这英文的翻译书学习的。没有解除到中文编码的相关问题，直到自己用python去做相关的项目的时候才发先中文编码问题真的非常头疼啊。这里分享一下本人所了解的一些经验。

读取utf-8个格式存储的文件

1. 假如现在有一个文件test.txt，里面有内容“python学习”，该文件以utf-8格式存储。那么读取并输出该字符串的方法如下：
```
filehandle=open("test.txt","r")
## the file is saved as utf-8 without bom
print filehandle.read().decode("utf-8").encode("gbk")
filehandle.close()
```
上面的代码decode("utf-8")是把utf-8格式的内容解码成unicode编码，然后通过encode("gbk")转换成GBK格式输出。

2. 假如test.txt是以utf-8 含有BOM的格式存储，读入方式又不一样，这种格式会在文件最开始的地方插入看不见的字符BOM（即0xEF 0xBB 0xBF),需要用到codecs。（用notepad++可以选择将文件保存为utf-8,utf-8无BOM等个存储格式）
```
filehandle=open("test.txt","r")
## the file is saved as utf-8 with bom
content = filehandle.read()
if content[:3]==codecs.BOM:
    content=content[3:]
print content.decode("utf-8")#.encode("gbk")
filehandle.close()
```
这边为什么不需要用到encode("gbk")？很费解

读取ASNI格式存储的文件

这种就非常简单了，不需要任何转换
```
filehandle=open("test.txt","r")
## the file is saved as ASNI
content = filehandle.read()
print content
filehandle.close()
```
python脚本中包含hardcode的中文
```
#!/usr/bin/env python
def main():
    s="python学习"
    print s

if __name__ == '__main__':
    main()
```
python中默认的编码方式是ASCII（可以通过sys.getdefaultencoding()），上面的test.py文件是以ASCII格式保存的，当调用print的时候会隐式地进行从ASCII到系统默认编码（Windows上为CP936，可以通过sys.stout.encoding）的转换，中文字符并不是ASCII，所以需要在test.py文件中进行编码声明。需要在开头加上一句 "# coding=utf-8"即可（最好用文本编辑器或notepad++，不然可能会有意想不到的输出）

总之，最好避免在脚本源文件中试用hardcode的字符串，尤其是中文字符。

普通字符和中文字符进行字符串连接
```
# coding=utf-8

def main():
    s="python学习"+u"hello"
    print s


if __name__ == '__main__':
    main()
```
使用+操作符连接字符串的时候，左边为str类型，右边为unicode类型。python会见左边的中文字符串转换为Unicode后再与右边的Unicode连接，将str转换为Unicode的时候试用系统默认的ASCII编码对字符串进行解码，所以可能会产生UnicodeDecodeError异常。下面的解决方法：
```
s="python学习".decode("gbk")+u"hello" 
```
```
或者
```
```
s="python学习"+u"hello".encode("utf-8")
```
字符串行为与python3一致

最后提一点，从python2.6以后可以通过下面的方式将定义的普通字符串识别为Unicode字符串，这样字符串的行为将与python3保持一致
```
from __future__ import unicode_literals
```
相关阅读:
leveldb实现类sql查询
 系统设计
 Code Complete
工具 VSCode快捷键
 C/C++ extern
C/C++ 宏字符串拼接
 【Java】字符串
 【Java】常用类-sysytem-math
【Java】枚举
 【Java】内部类
原文地址：https://www.cnblogs.com/streakingBird/p/4040247.html

python 中文字符的处理

读取utf-8个格式存储的文件

读取ASNI格式存储的文件

python脚本中包含hardcode的中文

普通字符和中文字符进行字符串连接

字符串行为与python3一致