python判断文件的编码格式是否为UTF8 无BOM格式

首先普及下知识：

1、BOM: Byte Order Mark

　　BOM签名的意思就是告诉编辑器当前文件采用何种编码,方便编辑器识别,但是BOM虽然在编辑器中不显示,但是会产生输出,就像多了一个空行。

　　Byte-order mark Description
　　　　EF BB BF UTF-8
　　　　FF FE UTF-16 aka UCS-2, little endian
　　　　FE FF UTF-16 aka UCS-2, big endian
　　　　00 00 FF FE UTF-32 aka UCS-4, little endian.
　　　　00 00 FE FF UTF-32 aka UCS-4, big-endian.

　　所以对于UTF8只要判断文件头包含EF BB BF，就可以判断它是有BOM的了。

2、再了解下UTF8的具体编码格式，UTF8算是一种自适应的，长度不定，兼容ASCII编码。

unicode(U+)	utf-8
U+00000000 - U+0000007F:	0xxxxxxx
U+00000080 - U+000007FF:	110xxxxx10xxxxxx
U+00000800 - U+0000FFFF:	1110xxxx10xxxxxx10xxxxxx
U+00010000 - U+001FFFFF:	11110xxx10xxxxxx10xxxxxx10xxxxxx

　　也就是说，在Unicode的编码的基础上规定了一种编码格式，根据每个字节的开头的固定格式，我们就可以判断是否是UTF8的编码

OK 基础知识大致普及完毕，然后看一看代码的实现。

#!/usr/bin/env python
#coding:utf-8
import  sys,codecs

def detectUTF8(file_name):
    state = 0
    line_num = 0
    file_obj = open(file_name)
    all_lines = file_obj.readlines()
    file_obj.close()
    for line in all_lines:
        line_num += 1
        line_len = len(line)
        for index in range(line_len):
            if state == 0:
                if ord(line[index])&0x80 == 0x00:#上表中的第一种情况
                    state = 0
                elif ord(line[index])&0xE0 == 0xC0:#上表中的第二种情况
                    state = 1
                elif ord(line[index])&0xF0 == 0xE0:#第三种
                    state = 2
                elif ord(line[index])&0xF8 == 0xF0:#第四种
                    state = 3
                else:
                    print "%s isn't a utf8 file,line:	"%file_name+str(line_num)
                    sys.exit(1)
            else:
                if not ord(line[index])&0xC0 == 0x80:
                    print "%s isn't a utf8 file in line:	"%file_name+str(line_num)
                    sys.exit(1)
                state -= 1
    if existBOM(file_name):
        print "%s isn't a standard utf8 file,include BOM header."%file_name
        sys.exit(1)

def existBOM(file_name):
    file_obj = open(file_name,'r')
    code = file_obj.read(3)
    file_obj.close()
    if code == codecs.BOM_UTF8:#判断是否包含EF BB BF
        return  True
    return False

if __name__ == "__main__":
    file_name = 'code.txt'
    detectUTF8(file_name)

OK，大致就是这些，只要熟悉编码格式，python代码的实现也就不算难。

PS：python的编码真是太痛苦了，不同版本还有所不同。如果在导入其它的模块也可能出现编码问题。。。

相关阅读:
「NOTE」支配树
「SOL」支配 (2021省选A卷)
「SOL」矩阵游戏 (2021省选A卷)
「SOL」最差记者2 (LOJ / JOISC2016)
「SOL」Nondivisible Prefix Sums(AtCoder)
「SOL」Spaceship(LOJ/USACO)
「NOTE」可持久化非旋Treap
「SOL」事情的相似度(LOJ)
FTP（File Transfer Protocol）——文件传输协议详解
DHCP(Dynamic Host Configutation Protocol)——动态主机配置协议详解

原文地址：https://www.cnblogs.com/ferraborghini/p/4951102.html