0.
requests不设置UA 访问baidu 得到 r.headers['Content-Type'] 是text/html 使用chrome UA: Content-Type:text/html; charset=utf-8
1.参考
iso-8859是什么? 他又被叫做Latin-1或“西欧语言”
补丁:
import requests def monkey_patch(): prop = requests.models.Response.content def content(self): _content = prop.fget(self) if self.encoding == 'ISO-8859-1': encodings = requests.utils.get_encodings_from_content(_content) if encodings: self.encoding = encodings[0] else: self.encoding = self.apparent_encoding _content = _content.decode(self.encoding, 'replace').encode('utf8', 'replace') self._content = _content return _content requests.models.Response.content = property(content) monkey_patch()
2.原因
In [291]: r = requests.get('http://cn.python-requests.org/en/latest/') In [292]: r.headers.get('content-type') Out[292]: 'text/html; charset=utf-8' In [293]: r.encoding Out[293]: 'utf-8' In [294]: rc = requests.get('http://python3-cookbook.readthedocs.io/zh_CN/latest/index.html') In [296]: rc.headers.get('content-type') Out[296]: 'text/html' In [298]: rc.encoding Out[298]: 'ISO-8859-1'
response text 异常
In [312]: rc.text Out[312]: u' <!DOCTYPE html> <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]--> <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]--> <head> <meta charset="ut f-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Python Cookbook 3rd Edition Documentation — python3-cookbook 2.0.0 xe6x96x87xe6xa1xa3</tit le> <link rel="stylesheet" href="https://media.readthedocs.org/css/sphinx_rtd_theme.css" type="text/css" /> <l ink rel="index" title="xe7xb4xa2xe5xbcx95" href="genindex.html"/> <link rel="search" title="xe6x90x9cxe7xb4xa2" href="search.html"/> <link rel="copyright" title="xe7x89x88xe6x9dx83xe6x89x80xe6x9cx89" href="copyright.html"/> <link rel="top" title="python3-cookbook 2.0.0 xe6x96x87xe6xa1xa3" href="#"/> <link rel="next" title In [313]: rc.content Out[313]: ' <!DOCTYPE html> <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]--> <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]--> <head> <meta charset="utf -8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Python Cookbook 3rd Edition Documentation — python3-cookbook 2.0.0 xe6x96x87xe6xa1xa3</titl e> <link rel="stylesheet" href="https://media.readthedocs.org/css/sphinx_rtd_theme.css" type="text/css" /> <li nk rel="index" title="xe7xb4xa2xe5xbcx95" href="genindex.html"/> <link rel="search" title="xe6x90x9cxe7xb4xa2" href="search.html"/> <link rel="copyright" t itle="xe7x89x88xe6x9dx83xe6x89x80xe6x9cx89" href="copyright.html"/> <link rel="top" title="python3-cookbook 2.0.0 xe6x96x87xe6xa1xa3" href="#"/> <link rel="next" title=
response headers有'content-type'而且没有charset而且有'text',同时满足三个条件导致判定'ISO-8859-1'
参考文章说 python3 没有问题,实测有。
C:Program FilesAnaconda2Libsite-packages equestsutils.py
20180102 补充:# "Content-Type": "application/json" 对应 r.encoding 为 None
def get_encoding_from_headers(headers): """Returns encodings from given HTTP Header Dict. :param headers: dictionary to extract encoding from. :rtype: str """ content_type = headers.get('content-type') if not content_type: return None content_type, params = cgi.parse_header(content_type) if 'charset' in params: return params['charset'].strip("'"") if 'text' in content_type: return 'ISO-8859-1'
C:Program FilesAnaconda2Libsite-packages equestsadapters.py
class HTTPAdapter(BaseAdapter): def build_response(self, req, resp): # Set encoding. response.encoding = get_encoding_from_headers(response.headers)
3.解决办法
参考文章打补丁或:
20180102 补充: if resp.encoding == 'ISO-8859-1': 修改为 if r.encoding == 'ISO-8859-1' and not 'ISO-8859-1' in headers.get('content-type', ''): 即只处理按照协议最后返回的 'ISO-8859-1'
if r.encoding == 'ISO-8859-1' and not 'ISO-8859-1' in headers.get('content-type', ''): encodings = requests.utils.get_encodings_from_content(resp.content) #re.compile(r'<meta.*?charset #源代码没有利用这个方法 if encodings: resp.encoding = encodings[0] else: resp.encoding = resp.apparent_encoding #models.py chardet.detect(self.content)['encoding'] 消耗计算 # resp.text >>> if self.encoding is None: encoding = self.apparent_encoding print 'ISO-8859-1 changed to %s'%resp.encoding