• 爬虫基础 urllib 库的学习记录


    爬虫基础 urllib 库的学习记录

    说明:urllib是一个收集几个模块的以处理URL的包

    • urilib.request 用于打开和阅读URL
    • urllib.error 包含由...提出的例外
    • urllib.parse 用于解析URL
    • urllib.robotparser 用于解析 robots.txt 文件

    1 urllib.request

    1.1 发送请求 urllib.request

    • urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
      • url: 需要打开的网址
      • data:post提交的数据
      • timeout:设置网站的访问超时时间

    1.1.1 发送 get 请求

    基本的使用

    import urllib.request
    response = urllib.request.urlopen('http://www.baidu.com')
    # response.read() -- > bytes 
    print(response.read().decode('utf-8'))
    

    1.1.2 发送 post 请求

    # 1. 对 data 进行处理 
    import urllib.parse
    data = bytes(urllib.parse.urlencode({'hello':'word'}),encoding='utf-8')
    # 处理后的结果 b'hello=word'
    # 2. 使用request.urlopen() 发送请求
    response = urllib.request.urlopen('http://www.httpbin.org/post',data=data)
    response._method  # 获取请求方法
    'POST'
    response.url   # 获取请求的url
    'http://www.httpbin.org/post'
    

    1.1.3 超时设置

    import urllib.request
    response = urllib.request.urlopen('http://httpbin.org/get',timeout=1)
    print(response.read())
    
    import socket
    import urllib.request
    import urllib.error
    
    try:
        response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
    except urllib.error.URLError as e:
        if isinstance(e.reason,socket.timeout):
            print('TIME OUT')
    

    1.2 响应对象 Reponse

    1.2.1 响应对象

    print(dir(response)
    ['__abstractmethods__', 
    '__class__',
    '__del__',
    '__delattr__',
    '__dict__',
    '__dir__',
    '__doc__',
    '__enter__', 
    '__eq__',
    '__exit__',
    '__format__',
    '__ge__',
    '__getattribute__',
    '__gt__',
    '__hash__',
    '__init__',
    '__init_subclass__',
    '__iter__', 
    '__le__',
    '__lt__',
    '__module__',
    '__ne__',
    '__new__',
    '__next__',
    '__reduce__',
    '__reduce_ex__',
    '__repr__',
    '__setattr__', 
    '__sizeof__',
    '__str__',
    '__subclasshook__',
    '_abc_cache',
    '_abc_negative_cache',
    '_abc_negative_cache_version',
    '_abc_registry', 
    '_checkClosed',
    '_checkReadable',
    '_checkSeekable',
    '_checkWritable',
    '_check_close',
    '_close_conn',
    '_get_chunk_left',
    '_method', 
    '_peek_chunked',
    '_read1_chunked',
    '_read_and_discard_trailer',
    '_read_next_chunk_size',
    '_read_status',
    '_readall_chunked', 
    '_readinto_chunked',
    '_safe_read',
    '_safe_readinto',
    'begin',
    'chunk_left',
    'chunked',
    'close',
    'closed',
    'code', 
    'debuglevel',
    'detach',
    'fileno',
    'flush',
    'fp',
    'getcode',
    'getheader',
    'getheaders',
    'geturl',
    'headers', 
    'info',
    'isatty',
    'isclosed',
    'length',
    'msg',
    'peek',
    'read',
    'read1',
    'readable',
    'readinto',
    'readinto1', 
    'readline',
    'readlines',
    'reason',
    'seek',
    'seekable',
    'status',
    'tell',
    'truncate',
    'url',
    'version',
    'will_close', 
    'writable',
    'write',
    'writelines'
    ]
    

    1.2.2 常用的几个方法

    response.status          # 200
    response.getcode()       # 200
    response.code            # 200 
    response.url             # 'http://www.baidu.com'
    response._method         # 'GET'
    response.getheaders()
    [
        ('Access-Control-Allow-Credentials', 'true'),
        ('Access-Control-Allow-Origin', '*'), 
        ('Content-Type', 'application/json'), 
        ('Date', 'Fri, 14 Jun 2019 02:33:18 GMT'),
        ('Referrer-Policy', 'no-referrer-when-downgrade'), 
        ('Server', 'nginx'),
        ('X-Content-Type-Options', 'nosniff'), 
        ('X-Frame-Options', 'DENY'), 
        ('X-XSS-Protection', '1; mode=block'), 
        ('Content-Length', '226'), 
        ('Connection', 'Close')
    ]
    response.getheader('Server')  # nginx
    

    1.3 构造请求 Request

    源码

    class Request:
        def __init__(self, url, data=None, headers={},
                     origin_req_host=None, unverifiable=False,
                     method=None):
                     
    In [3]: url = 'https://www.baidu.com/'
    
    In [4]: req = urllib.request.Request(url=url)
    
    In [5]: dir(req)
    Out[5]:
    ['__class__',
     '__delattr__',
     '__dict__',
     '__dir__',
     '__doc__',
     '__eq__',
     '__format__',
     '__ge__',
     '__getattribute__',
     '__gt__',
     '__hash__',
     '__init__',
     '__init_subclass__',
     '__le__',
     '__lt__',
     '__module__',
     '__ne__',
     '__new__',
     '__reduce__',
     '__reduce_ex__',
     '__repr__',
     '__setattr__',
     '__sizeof__',
     '__str__',
     '__subclasshook__',
     '__weakref__',
     '_data',
     '_full_url',
     '_parse',
     '_tunnel_host',
     'add_header',
     'add_unredirected_header',
     'data',
     'fragment',
     'full_url',
     'get_full_url',
     'get_header',
     'get_method',
     'has_header',
     'has_proxy',
     'header_items',
     'headers',
     'host',
     'origin_req_host',
     'remove_header',
     'selector',
     'set_proxy',
     'type',
     'unredirected_hdrs',
     'unverifiable']
    
     
    
    • data post数据的处理
    • headers={} 构造请求头信息

    1.3.1 基本的使用

    import urllib.request
    import urllib.parse
    
    url = 'http://httpbin.org/post'
    # 请求头
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
    headers['Host'] = 'httpbin.org'
    # data -- bytes 
    dict = {'name': 'Germey'}
    data = urllib.parse.urlencode(dict).encode('utf-8')
    
    # 实例化请求对象 传入参数
    request = urllib.request.Request(url=url, data=data, headers=headers, method='POST')
    print(request) 
    # <urllib.request.Request object at 0x000002404A9689E8>
    ##############################################################
    # --------以上只不过是创建请求对象但并没有发送请求------------
    #############################################################
    # 发送请求对象并返回响应对象
    response = urllib.request.urlopen(request)
    print(response)
    # <http.client.HTTPResponse object at 0x000002404AFBC358>
    

    1.3.2 添加Header

    • 构造请求时进行添加
    # 增加header
    from urllib import request, parse
    url = 'http://httpbin.org/post'
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
        'Host':'httpbin.org'
    }
    # 构造POST表格
    dict = {
        'name':'Germey'
    }
    data = bytes(parse.urlencode(dict),encoding='utf8')
    req = request.Request(url=url,data=data,headers=headers,method='POST')
    response = request.urlopen(req)
    print(response.read()).decode('utf-8')
    
    • 使用 add_header 方法添加
    import urllib.request
    req = urllib.request.Request('http://www.example.com/')
    req.add_header('Referer', 'http://www.python.org/')
    # Customize the default User-Agent header value:
    req.add_header('User-Agent', 'urllib-example/0.1 (Contact: . . .)')
    r = urllib.request.urlopen(req)
    

    1.3.3 设置Cookie

    import http.cookiejar, urllib.request
    
    cookie = http.cookiejar.CookieJar()
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open("http://www.baidu.com")
    for item in cookie:
        print(item.name+"="+item.value)
    
    # 保存cooki为文本
    import http.cookiejar, urllib.request
    filename = "cookie.txt"
    # 保存类型有很多种
    ## 类型1
    cookie = http.cookiejar.MozillaCookieJar(filename)
    ## 类型2
    cookie = http.cookiejar.LWPCookieJar(filename)
    
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open("http://www.baidu.com")
    
    # 使用相应的方法读取
    import http.cookiejar, urllib.request
    cookie = http.cookiejar.LWPCookieJar()
    cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open("http://www.baidu.com")
    
    

    1.3.4 设置代理

    from urllib import request
    
    url = 'http://httpbin.org/ip'
    proxy = {'http':'218.18.232.26:80','https':'218.18.232.26:80'}
    # 创建代理处理器
    proxies = request.ProxyHandler(proxy)
    # 创建opener对象
    opener = request.build_opener(proxies)
    
    resp = opener.open(url)
    print(resp.read().decode())
    
    

    2 url解析 urllib.parse

    __all__ = ["urlparse", "urlunparse", "urljoin", "urldefrag",
               "urlsplit", "urlunsplit", "urlencode", "parse_qs",
               "parse_qsl", "quote", "quote_plus", "quote_from_bytes",
               "unquote", "unquote_plus", "unquote_to_bytes",
               "DefragResult", "ParseResult", "SplitResult",
               "DefragResultBytes", "ParseResultBytes", "SplitResultBytes"]
    
    • urlparse
    • urlunpars
    • quote/quote_plus
    • unquote/unquote_plus
    • urljoin
    • urlencode
    • parse_qs/parse_qsl

    2.1 url 解析 urlparse

    In [8]: from urllib.parse import urlparse
    
    In [9]: o = urlparse('https://docs.python.org/3/library/urllib.parse.html')
    '''
    将url分成六个部分,返回一个包含6个字符串项目的元组:
    协议 : scheme
    位置 : netloc
    路径 : path
    参数  .....
    查询
    判断
    输出结果如下
    '''
    In [10]: o
    Out[10]: ParseResult(scheme='https', netloc='docs.python.org', path='/3/library/urllib.parse.html', params='', query='', fragment='')
    
    In [11]: dir(o)
    Out[11]:
    ['__add__',
     '__class__',
     '__contains__',
     '__delattr__',
     '__dir__',
     '__doc__',
     '__eq__',
     '__format__',
     '__ge__',
     '__getattribute__',
     '__getitem__',
     '__getnewargs__',
     '__gt__',
     '__hash__',
     '__init__',
     '__init_subclass__',
     '__iter__',
     '__le__',
     '__len__',
     '__lt__',
     '__module__',
     '__mul__',
     '__ne__',
     '__new__',
     '__reduce__',
     '__reduce_ex__',
     '__repr__',
     '__rmul__',
     '__setattr__',
     '__sizeof__',
     '__slots__',
     '__str__',
     '__subclasshook__',
     '_asdict',
     '_encoded_counterpart',
     '_fields',
     '_hostinfo',
     '_make',
     '_replace',
     '_source',
     '_userinfo',
     'count',
     'encode',
     'fragment',
     'geturl',
     'hostname',
     'index',
     'netloc',
     'params',
     'password',
     'path',
     'port',
     'query',
     'scheme',
     'username']
    
    In [12]: o.path
    Out[12]: '/3/library/urllib.parse.html'
    
    In [13]: o.scheme
    Out[13]: 'https'
    
    In [14]: o.geturl()
    Out[14]: 'https://docs.python.org/3/library/urllib.parse.html'
    
    url = "https://docs.python.org/3.5/library/urllib.parse.html?highlight=parse#module-urllib.parse"
    result = parse.urlparse(url)
    print(result.query)  # 获取返回结果参数内容
    print(parse.parse_qs(result.query))  # 结果转换成字典
    print(parse.parse_qsl(result.query))  # 结果转换成列表
    
    
    

    2.2 url 解析 urlunpars

    In [15]: o
    Out[15]: ParseResult(scheme='https', netloc='docs.python.org', path='/3/library/urllib.parse.html', params='', query='', fragment='')
    
    In [16]: from urllib.parse import urlunparse
    
    In [17]: urlunparse(o)
    Out[17]: 'https://docs.python.org/3/library/urllib.parse.html'
    # list(o)
    In [18]: urlunparse(list(o))
    Out[18]: 'https://docs.python.org/3/library/urllib.parse.html'
    
    

    2.3 url 解析 parse_qs/parse_qsl

    In [52]: parse_qs('https://i.cnblogs.com/EditPosts.aspx?opt=1')
    Out[52]: {'https://i.cnblogs.com/EditPosts.aspx?opt': ['1']}
    
    In [53]: parse_qsl('https://i.cnblogs.com/EditPosts.aspx?opt=1')
    Out[53]: [('https://i.cnblogs.com/EditPosts.aspx?opt', '1')]
    
    

    2.4 url 解析 quote/unquote

    
    Help on function quote in module urllib.parse:
    
    quote(string, safe='/', encoding=None, errors=None)
        quote('abc def') -> 'abc%20def'
    
        Each part of a URL, e.g. the path info, the query, etc., has a
        different set of reserved characters that must be quoted.
    
        RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax lists
        the following reserved characters.
    
        reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                      "$" | ","
    
        Each of these characters is reserved in some component of a URL,
        but not necessarily in all of them.
    
        By default, the quote function is intended for quoting the path
        section of a URL.  Thus, it will not encode '/'.  This character
        is reserved, but in typical usage the quote function is being
        called on a path where the existing slash characters are used as
        reserved characters.
    
        string and safe may be either str or bytes objects. encoding and errors
        must not be specified if string is a bytes object.
    
        The optional encoding and errors parameters specify how to deal with
        non-ASCII characters, as accepted by the str.encode method.
        By default, encoding='utf-8' (characters are encoded with UTF-8), and
        errors='strict' (unsupported characters raise a UnicodeEncodeError).
    
    In [26]: search = '搜索内容'
    
    In [27]: quote(search)
    Out[27]: '%E6%90%9C%E7%B4%A2%E5%86%85%E5%AE%B9'
    
    

    2.5 url 反向解析 unquote/unquote_plus

    
    In [41]: from urllib import parse
    
    In [42]: parse.quote('a&b/c')
    Out[42]: 'a%26b/c'  # 未编码斜线
    
    In [43]: parse.quote_plus('a&b/c')
    Out[43]: 'a%26b%2Fc' # 编码了斜线
    
    

    2.6 url urlencode

    In [44]:  query = {
        ...:
        ...:   'name': 'Lee',
        ...:
        ...:   'age': 19,
        ...:
        ...:   }
    
    In [45]: type(query)
    Out[45]: dict
    
    In [46]:  parse.urlencode(query)
    Out[46]: 'name=Lee&age=19'
    

    2.6.1 GET 的请求方式

    >>> import urllib.request
    >>> import urllib.parse
    >>> params = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
    >>> url = "http://www.musi-cal.com/cgi-bin/query?%s" % params
    >>> with urllib.request.urlopen(url) as f:
    ...     print(f.read().decode('utf-8'))
    
    

    2.6.2 POST 的请求方式

    >>> import urllib.request
    >>> import urllib.parse
    >>> data = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
    >>> data = data.encode('ascii')
    >>> with urllib.request.urlopen("http://requestb.in/xrbl82xr", data) as f:
    ...     print(f.read().decode('utf-8'))
    
    

    2.7 其余方法-见参考文档

    本文参考链接

    官方教程: https://docs.python.org/3/library/urllib.request.html

    GitHub源码: https://github.com/python/cpython/blob/3.7/Lib/urllib/request.py

    博主Coder : https://www.cnblogs.com/zhaof/p/6910871.html

    博主Hoptop : https://www.jianshu.com/u/9ea40b5f607a

    博主支付宝 : http://www.pianshen.com/article/2667231307/

    脚本之家 : https://www.jb51.net/article/128540.htm

  • 相关阅读:
    白天写代码,晚上摆地摊!9年前摆地摊学会了这些道理...
    啪啪,打脸了!领导说:try-catch必须放在循环体外!
    阿里巴巴为什么让初始化集合时必须指定大小?
    永远不要使用双花括号初始化实例,否则就会OOM!
    限流的6大狠招,附详细代码和评测结果
    HashMap 的 7 种遍历方式与性能分析!(强烈推荐)
    String性能提升10倍的几个方法!(源码+原理分析)
    9个小技巧让你的 if else看起来更优雅
    用了自定义Banner后,SpringBoot瞬间变的高大上了...
    别再问我 new 字符串创建了几个对象了!我来证明给你看!
  • 原文地址:https://www.cnblogs.com/Echo-O/p/11044789.html
Copyright © 2020-2023  润新知