• requests之headers 'Content-Type': 'text/html'误判encoding为'ISO-8859-1'导致中文text解码错误


    0.

    requests不设置UA 访问baidu 得到 r.headers['Content-Type'] 是text/html  使用chrome UA: Content-Type:text/html; charset=utf-8 

    1.参考

    代码分析Python requests库中文编码问题

    iso-8859是什么?  他又被叫做Latin-1或“西欧语言”

    补丁:

    import requests
    def monkey_patch():
        prop = requests.models.Response.content
        def content(self):
            _content = prop.fget(self)
            if self.encoding == 'ISO-8859-1':
                encodings = requests.utils.get_encodings_from_content(_content)
                if encodings:
                    self.encoding = encodings[0]
                else:
                    self.encoding = self.apparent_encoding
                _content = _content.decode(self.encoding, 'replace').encode('utf8', 'replace')
                self._content = _content
            return _content
        requests.models.Response.content = property(content)
    monkey_patch()

    2.原因

    In [291]: r = requests.get('http://cn.python-requests.org/en/latest/')
    
    In [292]: r.headers.get('content-type')
    Out[292]: 'text/html; charset=utf-8'
    
    In [293]: r.encoding
    Out[293]: 'utf-8'
    
    
    In [294]: rc = requests.get('http://python3-cookbook.readthedocs.io/zh_CN/latest/index.html')
    
    In [296]: rc.headers.get('content-type')
    Out[296]: 'text/html'
    
    In [298]: rc.encoding
    Out[298]: 'ISO-8859-1'

    response text 异常

    In [312]: rc.text
    Out[312]: u'
    
    <!DOCTYPE html>
    <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
    <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
    <head>
      <meta charset="ut
    f-8">
      
      <meta name="viewport" content="width=device-width, initial-scale=1.0">
      
      <title>Python Cookbook 3rd Edition Documentation &mdash; python3-cookbook 2.0.0 xe6x96x87xe6xa1xa3</tit
    le>
      
    
      
      
      
      
    
      
    
      
      
        
    
      
    
      
      
    
      
        <link rel="stylesheet" href="https://media.readthedocs.org/css/sphinx_rtd_theme.css" type="text/css" />
      
    
      
            <l
    ink rel="index" title="xe7xb4xa2xe5xbcx95"
                  href="genindex.html"/>
            <link rel="search" title="xe6x90x9cxe7xb4xa2" href="search.html"/>
            <link rel="copyright"
    title="xe7x89x88xe6x9dx83xe6x89x80xe6x9cx89" href="copyright.html"/>
        <link rel="top" title="python3-cookbook 2.0.0 xe6x96x87xe6xa1xa3" href="#"/>
            <link rel="next" title
    
    In [313]: rc.content
    Out[313]: '
    
    <!DOCTYPE html>
    <!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
    <!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
    <head>
      <meta charset="utf
    -8">
      
      <meta name="viewport" content="width=device-width, initial-scale=1.0">
      
      <title>Python Cookbook 3rd Edition Documentation &mdash; python3-cookbook 2.0.0 xe6x96x87xe6xa1xa3</titl
    e>
      
    
      
      
      
      
    
      
    
      
      
        
    
      
    
      
      
    
      
        <link rel="stylesheet" href="https://media.readthedocs.org/css/sphinx_rtd_theme.css" type="text/css" />
      
    
      
            <li
    nk rel="index" title="xe7xb4xa2xe5xbcx95"
                  href="genindex.html"/>
            <link rel="search" title="xe6x90x9cxe7xb4xa2" href="search.html"/>
            <link rel="copyright" t
    itle="xe7x89x88xe6x9dx83xe6x89x80xe6x9cx89" href="copyright.html"/>
        <link rel="top" title="python3-cookbook 2.0.0 xe6x96x87xe6xa1xa3" href="#"/>
            <link rel="next" title=

    response headers有'content-type'而且没有charset而且有'text',同时满足三个条件导致判定'ISO-8859-1'

    参考文章说 python3 没有问题,实测有。

    C:Program FilesAnaconda2Libsite-packages equestsutils.py

    20180102 补充:# "Content-Type": "application/json" 对应 r.encoding 为 None

    def get_encoding_from_headers(headers):
        """Returns encodings from given HTTP Header Dict.
    
        :param headers: dictionary to extract encoding from.
        :rtype: str
        """
    
        content_type = headers.get('content-type')
    
        if not content_type:
            return None
    
        content_type, params = cgi.parse_header(content_type)
    
        if 'charset' in params:
            return params['charset'].strip("'"")
    
        if 'text' in content_type:
            return 'ISO-8859-1'

    C:Program FilesAnaconda2Libsite-packages equestsadapters.py

    class HTTPAdapter(BaseAdapter):
        def build_response(self, req, resp):
            # Set encoding.
            response.encoding = get_encoding_from_headers(response.headers)

    3.解决办法

    参考文章打补丁或:

    20180102 补充: if resp.encoding == 'ISO-8859-1':   修改为 if r.encoding == 'ISO-8859-1' and not 'ISO-8859-1' in headers.get('content-type', ''):  即只处理按照协议最后返回的 'ISO-8859-1'

        if r.encoding == 'ISO-8859-1' and not 'ISO-8859-1' in headers.get('content-type', ''):
            encodings = requests.utils.get_encodings_from_content(resp.content)  #re.compile(r'<meta.*?charset  #源代码没有利用这个方法
            if encodings:
                resp.encoding = encodings[0]
            else:
                resp.encoding = resp.apparent_encoding  #models.py  chardet.detect(self.content)['encoding'] 消耗计算 # resp.text >>> if self.encoding is None: encoding = self.apparent_encoding
            print 'ISO-8859-1 changed to %s'%resp.encoding
  • 相关阅读:
    java事件处理(贪吃蛇)
    九九乘法表
    使用文件包含指令include
    jsp页面的基本构成
    软件测试博客

    成功职业女性处世的10大秘诀
    再见啦,冬冬妹
    网摘——杜晓眼眼中的尹珊珊:什么都要,什么都要得到
    网摘——事关“工程师思维”
  • 原文地址:https://www.cnblogs.com/my8100/p/requests_encoding_bug.html
Copyright © 2020-2023  润新知