代码
def toUni (text): str = text try: charstyle = chardet.detect(text) # print 'confidence: ', charstyle['confidence'] # 猜测精度 if ( charstyle['encoding'] == 'GB2312' ): str = text.decode( charstyle['encoding'], 'replace') elif ( charstyle['encoding'] == 'gbk' ): str = text.decode( charstyle['encoding'], 'replace' ) elif ( charstyle['encoding'] == 'utf-8' ): str = text.decode( charstyle['encoding'], 'replace' ) else: str = text.decode( charstyle['encoding'], 'replace' ) except Exception, e: print ('[changeToUni.except] %s' % str(e) ) str = text return str
另外说一句,这个是非常耗费时间的,一般网页要1-3秒钟。。。非常不划算。