• 爬虫第一篇基本库的使用——urllib


    在Python2中有urllib2和urllib3两个库来实现请求的发送,在Pyhon3中则统一为urllib。

    urilib包含以下4个模块

    1 request:最基本的请求模块,可以用来实现请求的发送
    2 
    3 error:异常处理模块,用于处理异常,使我们的程序不会意外终止
    4 
    5 parse:工具模块,提供了URL多种处理方法,拆分,解析与合并
    6 
    7 robotparser:用来识别网站的robots.txt文件,判断我们可以爬取哪些网站,哪些不可以

    一  request模块使用方法

    1.urlopen()  

    基本HTTP请求构造方法

    1 #模拟浏览器发送请求访问Python官网
    2 import urllib
    3 response = urllib.request.urlopen("http://www.python.org")
    4 print(response.read().decode("utf-8"))

    返回结果是一个HTTPResponse类型的对象,主要包含read(),readinto(),getheader(name),getheaders(),fileno()和getcode()方法 

    主要参数:

    url:要请求的地址

    data:post请求传送的表单数据

    timeout:请求的超时时间

    from urllib import parse,request
    try:
        data = bytes(parse.urlencode({'hello': 'word'}), encoding="utf-8")
        response = request.urlopen("http://httpbin.org/post",data=data,timeout=1)
    except error.URLError as e:
        if isinstance(e.reason,socket.timeout):
            print("TIME OUT")

    超时时间设置为1秒时输出

    {"args":{},"data":"","files":{},"form":{"hello":"word"},"headers":{"Accept-Encoding":"identity","Connection":"close","Content-Length":"10","Content-Type":"application/x-www-form-urlencoded","Host":"httpbin.org","User-Agent":"Python-urllib/3.6"},"json":null,"origin":"183.203.223.38","url":"http://httpbin.org/post"}

    超时时间设置为0.01秒时捕获异常输出TIME OUT

    2.Request

    当我们不满足与简单的http请求时,需要使用Request类来创建复杂的http请求

    主要参数:

    url:要请求的URL

    data:要传送的表单数据,需要使用urlencode编码

    headers:请求头

    origin_req_host:请求方host名称或IP地址

    unverifiable:表示这个请求是无法验证的,默认是False,意思就是说用户没有足够的权限来接收这个请求的结果

    method:指定请求的方法

    from urllib import request,parse
    
    data = bytes(parse.urlencode({"name":"sunqi"}),encoding='utf-8')
    head = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
         "Host":"
    httpbin.org"}
    req
    = request.Request('http://httpbin.org/post',data=data,headers=head,origin_req_host="",method="POST")
    res
    = request.urlopen(req)
    print(res.read().decode('utf-8'))
    {"args":{},"data":"","files":{},"form":{"name":"sunqi"},"headers":{"Accept-Encoding":"identity","Connection":"close","Content-Length":"10","Content-Type":"application/x-www-form-urlencoded","Host":"httpbin.org","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"},"json":null,"origin":"183.203.223.38","url":"http://httpbin.org/post"}

    3.高级用法各种handler

    HTTPDefaultErrorHandler:用于处理http响应错误,错误都会跑出HTTPError异常

    HTTPRedirectHandler:用于处理重定向

    HTTPCookieProcessor:用于处理Cookies

    ProxyHandler:用于设置代理,默认代理为空

    HTTPBasicAuthHandler:用于管理认证,如果一个链接打开时需要认证,那么可以用它来解决认证问题

    HTTPPasswordMgr:用于管理密码。它维护了用户名和密码的表

    OpenerDirector:可以实现open方法,返回值类型和urlopen方法一样,利用handler来构建opener

    HTTPPasswordMgrWithDefaultRealm()类将创建一个密码管理对象,用来保存 HTTP 请求相关的用户名和密码,主要应用两个场景:

      1. 验证代理授权的用户名和密码 (ProxyBasicAuthHandler())
      2. 验证Web客户端的的用户名和密码 (HTTPBasicAuthHandler())

    (1)爬取需要认证的网站时需要用到认证handler

    from urllib.request import HTTPBasicAuthHandler,HTTPPasswordMgrWithDefaultRealm,build_opener
    from urllib.error import URLError
    
    username = "1904700999"
    password = "1904700999"
    url = "http://localhost:5000/"
    p = HTTPPasswordMgrWithDefaultRealm()#密码管理类的对象
    p.add_password(None,url,username,password)#对象添加密码
    auth_handler = HTTPBasicAuthHandler(p)#密码认证实例化认证handler
    opener = build_opener(auth_handler)
    
    try:
        result = opener.open(url)
        print(result.status())
        html = result.read().decode('utf-8')
        print(html)
    
    except URLError as e:
        print(e.reason)

    (2)涉及到Cookies相关处理

    获取cookies

    import http.cookiejar,urllib.request
    
    # 打印输出cookie
    cookie = http.cookiejar.CookieJar()
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open("http://www.baidu.com")
    for item in cookie:
            print(item.name+"->"+item.value)

    存储cookies到文件

    ignore_discard:即使cookies将被丢弃也将它保存下来

    ignore_expires:如果在该文件中cookies已经存在,则覆盖原文件写入

    filename = "cookies_LWP.txt"

    cookie = http.cookiejar.LWPCookieJar(filename)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open("http://www.baidu.com")
    cookie.save(ignore_discard=True,ignore_expires=True)
    filename = "cookies_Mozilla.txt"
    
    cookie = http.cookiejar.MozillaCookieJar(filename)#Mozilla格式的文件
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open("http://www.baidu.com")
    cookie.save(ignore_discard=True,ignore_expires=True)

    从文件读取Cookies

    cookie = http.cookiejar.LWPCookieJar()
    cookie.load("cookies_LWP.txt",ignore_expires=True,ignore_discard=True)
    handler = urllib.request.HTTPCookieProcessor(cookie)
    opener = urllib.request.build_opener(handler)
    response = opener.open("http://www.baidu.com")
    print(response.read().decode('utf-8'))

    二  error模块使用用法

    1.URLError

    来自error模块,是error模块的基类,由request模块产生的异常都可以用它来捕获处理

    #URLError
    try :
        request.urlopen("http://cuiqingcai.com/index.htm")
    except error.URLError as e:
        print(e.reason)#Not Found

    2.HTTPError

    URLError的子类,专用于处理HTTP请求的错误

    code:返回的HTTP状态码

    reason:造成错误的原因

    headers:返回的请求头

    #HTTPerror URLError的子类
    try:
        request.urlopen("http://cuiqingcai.com/index.htm")
    except error.HTTPError as e:
        print(e.code)#状态码
        print(e.reason)#错误原因
        print(e.headers)#请求头
    404
    Not Found
    Server: nginx/1.10.3 (Ubuntu)
    Date: Thu, 19 Jul 2018 01:18:42 GMT
    Content-Type: text/html; charset=UTF-8
    Transfer-Encoding: chunked
    Connection: close
    Vary: Cookie
    Expires: Wed, 11 Jan 1984 05:00:00 GMT
    Cache-Control: no-cache, must-revalidate, max-age=0
    Link: <https://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"
    运行结果

    3.reason返回的不一定是一个字符串,可能是一个类对象

    #返回的不是字符串的reason
    try:
        response = request.urlopen("http://www.baidu.com",timeout=0.01)
    except error.HTTPError as e:
        print(e.reason,e.headers,e.code)
    except error.URLError as e:
        print(e.reason)
        print(type(e.reason))#e.reason返回的不一定是字符串
        if isinstance(e.reason,socket.timeout):
            print("TIME OUT")

    三  parse模块使用方法

    parse提供的方法主要是解析,拆分,合并链接

    urlparse():URL识别与分段,将链接拆分为6个部分:协议,域名,路径,参数,查询条件,锚点

    unlunparse():URL构造,参数长度必须为6个(上边的六个)

    urlsplit():URL拆分,不单独拆分参数部分,将参数部分与路径部分合并到一起

    urlunsplit():URL构造,参数长度必须为5(上比的五个)

    urljoin():URL拼接,基础链接作为第一个参数,新的链接作为第二个参数

    urlencode():将查询条件组成的参数字典序列化为get请求的参数

    parse_qs():经get请求URL的参数反序列化为参数字典

    parse_qsl():将get请求的参数转化为参数元组组成的列表

    quote():将内容编码为URL编码的格式,比如中文

    unquote():将URL编码格式转换为内容

    from urllib.parse import urlparse
    
    
    
    # urlprase  解析链接
    result = urlparse("http://www.baidu.com/index.html;user?id=5#comment",allow_fragments=True)
    print(type(result))
    print(result)
    '''
    <class 'urllib.parse.ParseResult'>
    ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
    协议,域名,路径,参数,查询条件,锚点
    标准连接格式
    schema://netloc/path;params?query#fragment
    urlparse(url="",scheme="",allow_fragments=True or False)
    设置协议时,如果url没有协议开头,则只用默认协议,如果有协议开头就使用url自身协议
    未设置设置allow_fragment选项时锚点与query一组
    没有query时,锚点被解析到path中
    '''
    
    from urllib.parse import urlunparse
    # urlunprase  构造链接
    data = ("http","www.baidu.com","index.html","user","","")
    data = {"schema":'http',"netloc":'www.baidu.com',"path":'index.html',"params":'',"query":'',"fragment":''}
    url = urlunparse(data)
    print(url)
    '''
    构造链接,数据结构长度必须为6,列表和元组都可以
    '''
    
    
    #urlsplit 链接分割
    from urllib.parse import urlsplit
    result = urlsplit("http://www.baidu.com/index.html;user?id=5#comment")
    print(result)#不单独区分param参数,与path合并
    
    #urlunsplit 构造链接
    from urllib.parse import urlunsplit
    data = ("http","www.baidu.com","index.html","user","")#要求长度必须为5
    url = urlunsplit(data)
    print(url)
    
    
    from urllib.parse import urljoin
    
    
    #urljoin 两个链接组合成一个新链接,前者为base_url,后者为给定的url
    #base_url 提供协议,域名与路径,当要构造的url没有时会自动补充,否则不补充
    print(urljoin("http://www,baidu.com","eg.html"))
    #无协议,无域名自动填充 http://www,baidu.com/eg.html
    print(urljoin("http://www.baidu.com","//www.sun.com/eg.html"))
    #无协议自动填充协议  http://www.sun.com/eg.html
    print(urljoin("http://www.baidu.com",'http://www.sun.com/index.html'))
    #都有不填充
    print(urljoin("http://www.baidu.com/index.html","http:"))
    #http://www.baidu.com/index.html
    print(urljoin("http://www.baidu.com/index.html?wd=123","http://www.sun.com/index.html"))
    #http://www.sun.com/index.html
    print(urljoin("http://www.baidu.com/index.html?wd=123#comment","http://www.sun.com/index.php?wd=456#com"))
    #http://www.sun.com/index.php?wd=456#com  base_url自带的params,query,fragment不起作用
    
    #urlencode 构造get请求
    from urllib.parse import urlencode
    params = {"name":"sunqi","password":"123456"}
    url = "http://www.baidu.com?"+urlencode(params)
    print(url)
    #http://www.baidu.com?name=sunqi&password=123456 将字典转化为get提交的参数
    
    from urllib.parse import parse_qs,parse_qsl
    #反序列化参数为字典和元组
    result = urlparse("http://www.baidu.com?name=sunqi&password=123456")
    print(result[4])
    dic = parse_qs(result[4])#转换为字典{'name': ['sunqi'], 'password': ['123456']}
    tu = parse_qsl(result[4])#转换为元组[('name', 'sunqi'), ('password', '123456')]
    print(dic)
    print(tu)
    
    
    from urllib.parse import quote,unquote
    #将字符串转换为URL编码与解码
    keyword = "壁纸"
    url = "http://www.baidu.com/s?wd="+quote(keyword)
    print(url)#http://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8
    print(unquote(url))#http://www.baidu.com/s?wd=壁纸

    四  分析robots协议

    Robots协议:   也称作爬虫协议,机器人协议,全名叫做网络爬虫排除标准(Robots Exclusion Protocol),用来告诉爬虫和搜索引擎哪些页面可以抓取,哪些不可以。

           通常是一个叫做robots.txt的文本文件。如果存在,爬虫就会按照制定的规则爬取页面,如果不存在,搜索爬虫就会访问所有可以直接访问的页面。

    样例:

    User-agent:*

    Disallow:/

    Allow:/public/

    所有爬虫只能爬取/public/目录下的内容.User-agent对爬虫限制,为*的话对任意爬虫都有效。Disallow指定了不允许爬取的页面,Allow指定了允许爬取的页面

    RobotFileParser()类,可以根据一个网站的robots.txt文件来判断一个爬虫是否有权限来爬取这个网页,常用方法:

    set_urk():用来设置robots.txt 文件的位置链接

    read():用来读取robots.txt 文件并用来分析

    parse():解析robots.txt文件,传入的是文件的某些行的内容

    can_fetch():传入两个参数,第一个是User-agent,第二个是抓取的URL,用来判断的和是否可以抓取指定的URL,返回布尔值

    modified():将当前时间设置为上次抓取和分析robots.txt的时间

    from urllib.robotparser import RobotFileParser
    from urllib.request import urlopen
    
    robot = RobotFileParser()
    result = urlopen('http://www.zhihu.com/robots.txt').read().decode('utf-8')
    print(result)
    
    robot.parse(result)
    
    print(robot.can_fetch('*','http://www.zhihu.com/question/29173647/answer/437189494'))#True
    print(robot.can_fetch('*','https://www.zhihu.com/signin?next=%2Fexplore'))#True
    print(robot.can_fetch('*','http://www.zhihu.com/inbox/7013224000'))#True
    response = urlopen('https://www.zhihu.com/inbox/7013224000').read().decode("utf-8")
    filename = "zhuhi_inbox.html"
    with open(filename,"w") as f:
        f.write(response)
    User-agent: Googlebot
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: Googlebot-Image
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: Baiduspider-news
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: Baiduspider
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: Baiduspider-image
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: Sosospider
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: bingbot
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: 360Spider
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: HaosouSpider 
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: yisouspider
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: YoudaoBot
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: Sogou Orion spider
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: Sogou News Spider
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: Sogou blog
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: Sogou spider2
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: Sogou inst spider
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: Sogou web spider
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: EasouSpider
    Request-rate: 1/2 # load 1 page per 2 seconds
    Crawl-delay: 10
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-agent: MSNBot
    Request-rate: 1/2 # load 1 page per 2 seconds
    Crawl-delay: 10
    Disallow: /login
    Disallow: /logout
    Disallow: /resetpassword
    Disallow: /terms
    Disallow: /search
    Disallow: /notifications
    Disallow: /settings
    Disallow: /inbox
    Disallow: /admin_inbox
    Disallow: /*?guide*
    
    User-Agent: *
    Disallow: /
    知乎robots.txt 文件内容
  • 相关阅读:
    Android--MediaPlayer高级
    Android--SoundPool
    Android--MP3播放器MediaPlayer
    Android--加载大分辨率图片到内存
    Android--Task和BackStack高级
    Android--Activity的启动模式
    Android--操作图片Exif信息
    JDK5.0特性,使用ProcessBuilder执行本地命令
    MySQL 读写分离 使用驱动com.mysql.jdbc.ReplicationDriver
    kafka delete topic
  • 原文地址:https://www.cnblogs.com/SunQi-Tony/p/9330400.html
Copyright © 2020-2023  润新知