• urllib.request.urlopen(req).read().decode解析http报文报“utf-8 codec can not decode”错处理


    老猿前期执行如下代码时报“‘utf-8’ codec can’t decode byte”错,代码及错误信息如下:

    >>> import urllib.request
    >>> def mkhead():
        header = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
        'Accept-Encoding':'gzip',
        'Accept-Language':'zh-CN,zh;q=0.9',
        'Connection':'keep-alive',
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
            
        return header
    
    >>> def readweb(site):
        header = mkhead()
        try:
            req = urllib.request.Request(url=site,headers=header)
            text = urllib.request.urlopen(req).read().decode()
        except Exception as e:
            print(e)
            return None
        else:return text
    
        
    >>> readweb(r'https://blog.csdn.net/LaoYuanPython')
    'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
    >>>
    

    才开始以为是decode编码的问题,试了gbk等方式还是不行,最后发现是因为http请求报文头“‘Accept-Encoding’:‘gzip’”导致服务器返回的报文压缩了,把这个报文头信息去掉再执行就ok了,如下:

    >>> import urllib.request
    >>> def mkhead():
        header = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
        'Accept-Language':'zh-CN,zh;q=0.9',
        'Connection':'keep-alive',
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
            
        return header
    
    >>> def readweb(site):
        header = mkhead()
        try:
            req = urllib.request.Request(url=site,headers=header)
            text = urllib.request.urlopen(req).read().decode()
        except Exception as e:
            print(e)
            return None
        else:return text
    
        
    >>> readweb(r'https://blog.csdn.net/LaoYuanPython')
    Squeezed text(273 lines)
    >>> readweb(r'https://blog.csdn.net/LaoYuanPython')[0:100]
    '<!DOCTYPE html>
    <html lang="zh-CN">
    <head>
        <meta charset="UTF-8">
        <link rel="canonical" href'
    >>>
    

    如果希望处理压缩报文,请参考《第14.7节 Python模拟浏览器访问实现http报文体压缩传输》。
    老猿Python,跟老猿学Python!
    博客地址:https://blog.csdn.net/LaoYuanPython

    老猿Python博客文章目录:https://blog.csdn.net/LaoYuanPython/article/details/98245036
    请大家多多支持,点赞、评论和加关注!谢谢!

  • 相关阅读:
    textarea中的空格与换行
    js判断微信内置浏览器
    关于express4不再支持body-parser
    html5 geolocation API
    屏幕密度与分辨率
    nodebeginer
    手机浏览器下IScroll中click事件
    iphone手机上的click和touch
    AngularJS学习笔记一
    不用bootstrap实现居中适应
  • 原文地址:https://www.cnblogs.com/LaoYuanPython/p/13643576.html
Copyright © 2020-2023  润新知