• urllib库


    urllib库,是Python内置的http请求库,不需要额外安装,包含4个模块,前三个比较常用:

    1. request:http请求模块,用来模拟发送请求,只需要传入url以及额外的参数,就可以模拟整个实现过程
    2. error:异常处理模块
    3. parse:用于编码、解析、合并url、参数等
    4. robotparser:辨别Robot协议(爬虫协议/机器人协议/网络爬虫排除标准/Robots Exclusion protocol)。
      robot.txt协议通常放在根目录下,告诉爬虫和搜索引擎那些页面可以抓取,哪些不可抓取。
      # robot.txt大致格式
      User-agent:*
      Disallow:/
      Allow:/public/

    Request

    一、利用urllib.request.urlopen请求常用两种格式

    1. 用 urllib.request.urlopen( url, data, timeout....),data为byte格式

    import urllib
    
    # 无参
    response = urllib.request.urlopen("https://www.python.org")
    
    # post传字节流
    data = bytes(urllib.parse.urlencode({"word": "test"}), encoding="utf8")
    response2 = urllib.request.urlopen("http://httpbin.org/post", data= data, timeout=3)

    2. 通过urllib.request.Request(url, data, headers, method)类构造请求参数,data为byte字节流,headers为字典类型

    import urllib
    
    url = "http://httpbin.org/post"
    headers = {
        'User-Agent': 'Mozilla/4.0 (compatible; MSIE5.5; Windows NT)',
        'Host': 'httpbin.org'
    }
    dict = {
        'name': 'tester'
    }
    data = bytes(urllib.parse.urlencode(dict), encoding = "utf8")
    request = urllib.request.Request(url= url, data=data, headers= headers, method= "POST")
    response = urllib.request.urlopen(request)
    print(response.read().decode("utf-8"))

    二、Handler 处理登录、Cookies、代理等

    urllib.request.BaseHandler 类是所有Handler的父类
    HTTPDefaultErrorHandler 用于处理HTTP响应错误时抛出的HTTPError类型的异常
    HTTPRedirectHandler 用于处理重定向
    HTTPCookieProcessor 用来处理Cookies
    ProxyHandler 用于设置代理,默认代理为空
    HTTPPasswordMgr 用于密码管理,它会维护一个用户名和密码表
    HTTPBasicAuthHandler 用于管理认证,连接打开时需要认证是,用此解决

    借助代理完成请求的过程:利用urllib.request中各种Handler构建Opener,在用Opener.open(url)去请求。

    #构建过程:
    1. handler
    2. opener = urllib.request.build_opener(handler)
    3. opener.open(url)

     Opener:OpenerDirector类,urlopen()这个方法也是一个urllib提供的一个简单的Opener,Opener的高级用法可完成更深一层、更高级的功能配置。

    #实例:输入密码才能进入测试网页,需要借助HTTPBasicAuthHandler完成
    
    from urllib.request import HTTPPasswordMgrWithDefaulRealm, HTTPBasicAuthHandler, build_opener
    from urllib.error import URLError
    
    username = "tester"
    password = "testerpw"
    url = "http://km******.test.mararun.com/"
    
    pwMsg = HTTPPasswordMgrWithDefaultRealm()
    pwMsg.addPassword(None, url, username, password)
    handlerAuth = HTTPBasicAuthHandler(pwMsg)
    opener = build_opener(handlerAuth)
    
    try:
        response = opener.open(url)
        print(response.read().decode('utf-8'))
    except URLError as e:
        print(e.reason)
    '''
    Cookie处理:声明一个http.CookieJar对象,利用HTTPCookieProcessor来构建一个Handler,最后利用build_opener()方法构建Opener,执行open()方法请求
    '''
    
    import http.cookiejar, urllib.request
    
    # 循环打印cookie的key-value
    cookie = http.cookiejar.CookieJar()
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open("http://www.baidu.com")
    for item in cookie:
        print(item.name+"="+item.value)
    
    # 以文本格式保存cookie数据
    filename= "cookies.txt"
    # 两种格式
    # cookie = http.cookiejar.MozilaCookieJar(filename)
    cookie = http.cookiejar.LWPCookieJar(filename)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open("http://www.baidu.com")
    cookie.save(ignore_discard=True, ignore_expires=True)
    
    # 读取并利用cookies,以LWPCookieJar格式为例
    cookie = http.cookiejar.LWPCookieJar()
    cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open("http://www.baidu.com")
    print(response.status)

    Error

    URLError类继承自OSError类,是error异常模块的基类,由request模块产生的异常都可以通过这个类捕获,有个reason属性可返回错误原因。
    HTTPError是URLError的子类,用来专门处理HTTP请求错误,有三个属性 code状态码、reason错误原因、headers请求头。

    # 实例1 reason为字符串
    from urllib import request, error
    try:
        response = request.urlopen("http://testerror.com/index.html")
    except error.HTTPError as e:
        print(e.code, e.reason, e.headers)
    except error.URLError as e:
        print(e.reason)
    else:
        print("Request Successfully")
    
    
    # 实例2 reason为一个对象,比如请求超时,返回的是一个socket.timeout类,可以用isinstance()来判断它的类型
    import socket
    import urllib.request
    import urllib.error
    try:
        response = urllib.request.urlopen("https://www.baidu.com", timeout=0.01)
    except urllib.error.URLError as e:
        print(type(e.reason))
        if isinstance(e.reason, socket.timeout):
            print("Time Out!")

    Parse:url.parse.

    1. urlparse() 解析url的6个部分,返回<class 'urllib.parse.ParseResult'>类型,ParseResult(scheme, netloc, path, params, query, fragment)

    2. urlunparse() 拼接url,仅接收长度为6的可迭代对象

    3. urlsplit() 解析url的5个部分,返回<class 'urllib.parse.SplitResult'>类型,SplitResult(scheme, netloc, path, query, fragment),不单独解析params,合并到path里面

    4. urlunsplit() 拼接url,仅接收长度为5的迭代对象

    5. urljoin() 拼接字符串,解析baseurl的scheme、netloc、path对后面链接缺失的部分进行补充

    6. urlencode() 序列化参数 dict->url参数

    7. parse_qs() 反序列化 url参数->dict

    8. parse_qs() 反序列化 url参数->list[元组]

    9. quote() 将内容转化为url编码格式,防止带中文时出现乱码的问题

    10. unquote() 进行url解码

    # 1. urlparse
    from urllib.parse import urlparse
     
    result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
    print(type(result), result)
    
    '''
    <class 'urllib.parse.ParseResult'>
    ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
    '''
    
    # urlparse scheme默认协议参数、allow_fragments为False时fragment被解析为path、params、query最近那一个的一部分
    result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https', allow_fragments=False)
    
    '''
    元组ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='')
    '''
    # 2. urlunparse()
    
    from urllib.parse import urlunparse
     
    data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
    print(urlunparse(data))
    
    '''
    http://www.baidu.com/index.html;user?a=6#comment
    '''
    # 3. urlsplit()
    from urllib.parse import urlsplit
     
    result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
    print(result)
    
    '''
    元组SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')
    '''
    # 4. urlunsplit()
    
    from urllib.parse import urlunsplit
     
    data = ['http', 'www.baidu.com', 'index.html', 'a=6', 'comment']
    print(urlunsplit(data))
    
    '''
    http://www.baidu.com/index.html?a=6#comment
    '''
    # 5. unjoin()
    
    from urllib.parse import urljoin
    
    print(urljoin("http://www.baidu.com/about.html?wd=abc", "http://test/index.php"))
    
    '''
     http://test/index.php
    '''
    # 6. urlencode()
    
    from urllib.parse import urlencode
    params = {
        "name": "test",
        "age": 30
    }
    base_url = "http://www.baidu.com?"
    url = base_url + urlencode(params)
    print(url)
    
    '''
    http://www.baidu.com?name=test&age=30
    '''
    # 7. parse_qs
    
    from urllib.parse import parse_qs
    
    query = "name = test&age=22"
    print(parse_qs(query))
    
    '''
     {'name ': [' test'], 'age': ['22']
    '''
    # 8. parse_qsl()
    
    from urllib.parse import parse_qsl
    
    query = "name = test&age=22"
    print(parse_qsl(query))
    
    '''
    [('name ', ' test'), ('age', '22')]
    '''
    # 9. quote()
    
    from urllib.parse import quote
    
    keyword = "测试"
    url = "https://www.baidu.com/s?wd=" + quote(keyword)
    print(url)
    
    '''
    https://www.baidu.com/s?wd=%E6%B5%8B%E8%AF%95
    '''
    # unquote()
    
    from urllib.parse import unquote
    
    url = "https://www.baidu.com/s?wd=%E6%B5%8B%E8%AF%95"
    print(unquote(url))
    
    '''
    https://www.baidu.com/s?wd=测试
    '''

     参考:静觅 » [Python3网络爬虫开发实战]

  • 相关阅读:
    PowerShell2.0之Windows排错(六)检查网络故障
    确保数据安全是云计算取信于用户的关键
    企业发展如何借助“云的力量”
    PowerShell2.0之维护网络(三)设置网络适配器
    Feign最佳实践
    Nacos注册中心原理
    GateWay网关快速入门
    Nacos集群搭建
    Feign快速入门
    Feign的性能优化
  • 原文地址:https://www.cnblogs.com/belle-ls/p/12206345.html
Copyright © 2020-2023  润新知