为什么会有ISO-8859-1这样的字符集编码
requests会从服务器返回的响应头的 Content-Type 去获取字符集编码,如果content-type有charset字段那么requests才能正确识别编码,否则就使用默认的 ISO-8859-1. 一般那些不规范的页面往往有这样的问题.
equestsutils.py
def get_encoding_from_headers(headers): """Returns encodings from given HTTP Header Dict. :param headers: dictionary to extract encoding from. :rtype: str """ content_type = headers.get('content-type') if not content_type: return None content_type, params = cgi.parse_header(content_type) if 'charset' in params: return params['charset'].strip("'"") if 'text' in content_type: return 'ISO-8859-1'
如何获取正确的编码
requests的返回结果对象里有个apparent_encoding函数, apparent_encoding通过调用chardet.detect()来识别文本编码. 但是需要注意的是,这有些消耗计算资源.
equestsmodels.py
@property def apparent_encoding(self): """The apparent encoding, provided by the chardet library.""" return chardet.detect(self.content)['encoding']
requests的text() 跟 content() 有什么区别?
requests在获取网络资源后,我们可以通过两种模式查看内容。 一个是r.text,另一个是r.content,那他们之间有什么区别呢?
分析requests的源代码发现,r.text返回的是处理过的Unicode型的数据,而使用r.content返回的是bytes型的原始数据。也就是说,r.content相对于r.text来说节省了计算资源,r.content是把内容bytes返回. 而r.text是decode成Unicode. 如果headers没有charset字符集的化,text()会调用chardet来计算字符集,这又是消耗cpu的事情.
通过看requests代码来分析text() content()的区别.
# r.text @property def text(self): """Content of the response, in unicode. If Response.encoding is None, encoding will be guessed using ``chardet``. The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set ``r.encoding`` appropriately before accessing this property. """ # Try charset from content-type content = None encoding = self.encoding if not self.content: return str('') # Fallback to auto-detected encoding. if self.encoding is None: encoding = self.apparent_encoding # Decode unicode from given encoding. try: content = str(self.content, encoding, errors='replace') except (LookupError, TypeError): # A LookupError is raised if the encoding was not found which could # indicate a misspelling or similar mistake. # # A TypeError can be raised if encoding is None # # So we try blindly encoding. content = str(self.content, errors='replace') return content
# Content
@property def content(self): """Content of the response, in bytes.""" if self._content is False: # Read the contents. if self._content_consumed: raise RuntimeError( 'The content for this response was already consumed') if self.status_code == 0 or self.raw is None: self._content = None else: self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes() self._content_consumed = True # don't need to release the connection; that's been handled by urllib3 # since we exhausted the data. return self._content
requests中文乱码解决方法
方法一: 直接encode成utf-8格式.
r.content.decode(r.encoding).encode('utf-8')
r.encoding = 'utf-8'
方法二:如果headers头部没有charset,那么就从html的meta中抽取.