• Python爬虫基础


    今日概要:

    1. Requests与BeautifulSoup
    2. 爬取汽车之家的新闻资讯
    3. 爬github和抽屉
    4. 轮询和长轮询

    一.HTTP知识扫盲

    1. http的get请求 是没有请求体,所有的参数都放在请求头的url里
    2. http的post请求 将请求内容放到请求体里
    3. http = 请求头+请求体 响应头+响应体
    4. http是无状态请求,一个请求,一次响应就会结束

    二.Requests

    Requests 是使用 Apache2 Licensed 许可证的 基于Python开发的HTTP 库,其在Python内置模块的基础上进行了高度的封装,从而使得Pythoner进行网络请求时,变得美好了许多,使用Requests可以轻而易举的完成浏览器可有的任何操作。

    1.GET请求

    无参数实例:

    # 无参实例
    import requests
    
    data = requests.get("http://www.sina.com.cn/")
    print(data.url)
    print(data.text)
    

     有参实例:

    import requests
    
    payload = {'key1': 'value1', 'key2': 'value2'}
    ret = requests.get("http://httpbin.org/get", params=payload)
    
    print(ret.url)
    print(ret.text)
    

     向 https://github.com/timeline.json 发送一个GET请求,将请求和响应相关均封装在 data对象中。

    2.POST请求

    基本POST实例:

    import requests
    
    payload = {'key1': 'value1', 'key2': 'value2'}
    data = requests.post("http://httpbin.org/post", data=payload)
    
    print(data.text)
    

     发送请求头和数据实例

    # -*- coding:utf-8 -*-
    # !/usr/bin/python
    
    import requests
    import json
    
    url = 'https://api.github.com/some/endpoint'
    payload = {'some': 'data'}
    headers = {'content-type': 'application/json'}
    
    data = requests.post(url, data=json.dumps(payload), headers=headers)
    
    print(data.text)
    print(data.cookies)
    

     3.其他请求

    requests.get(url, params=None, **kwargs)
    requests.post(url, data=None, json=None, **kwargs)
    requests.put(url, data=None, **kwargs)
    requests.head(url, **kwargs)
    requests.delete(url, **kwargs)
    requests.patch(url, data=None, **kwargs)
    requests.options(url, **kwargs)
      
    # 以上方法均是在此方法的基础上构建
    requests.request(method, url, **kwargs)
    

     4.更多参数

     1 def request(method, url, **kwargs):
     2     """Constructs and sends a :class:`Request <Request>`.
     3 
     4     :param method: method for the new :class:`Request` object.
     5     :param url: URL for the new :class:`Request` object.
     6     :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
     7     :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
     8     :param json: (optional) json data to send in the body of the :class:`Request`.
     9     :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
    10     :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
    11     :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
    12         ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
    13         or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
    14         defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
    15         to add for the file.
    16     :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
    17     :param timeout: (optional) How long to wait for the server to send data
    18         before giving up, as a float, or a :ref:`(connect timeout, read
    19         timeout) <timeouts>` tuple.
    20     :type timeout: float or tuple
    21     :param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
    22     :type allow_redirects: bool
    23     :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
    24     :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
    25     :param stream: (optional) if ``False``, the response content will be immediately downloaded.
    26     :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
    27     :return: :class:`Response <Response>` object
    28     :rtype: requests.Response
    29 
    30     Usage::
    31 
    32       >>> import requests
    33       >>> req = requests.request('GET', 'http://httpbin.org/get')
    34       <Response [200]>
    35     """
    36 
    37 参数列表

     更多requests模块相关的文档见:http://cn.python-requests.org/zh_CN/latest/

    5.爬取汽车之家新闻无需登录

    # -*- coding:utf-8 -*-
    # !/usr/bin/python
    from bs4 import BeautifulSoup
    import requests
    
    # http方式
    response = requests.get("http://www.autohome.com.cn/news/")
    response.encoding = 'gbk'
    
    soup = BeautifulSoup(response.text, "html.parser")
    tag = soup.find(name="div", attrs={"id":"auto-channel-lazyload-article"})
    li_list = tag.find_all("li") # [标签对象,标签对象]
    for li in li_list:
        h3 = li.find(name="h3")
        if not h3:
            continue
        print(h3.text, li.find(name="a").get("href"))
    """
    售13.59-18.59万元 别克新款威朗上市 //www.autohome.com.cn/news/201710/908038.html#pvareaid=102624
    售11.99-14.69万元 别克阅朗正式上市 //www.autohome.com.cn/news/201710/908029.html#pvareaid=102624
    售14.49-16.69万元 别克GL6正式上市 //www.autohome.com.cn/news/201710/908024.html#pvareaid=102624
    售10.99-14.39万元 别克新款英朗上市 //www.autohome.com.cn/news/201710/908023.html#pvareaid=102624
    中型SUV/1.6T动力 中华V7申报图曝光 //www.autohome.com.cn/news/201710/908128.html#pvareaid=102624
    拉低门槛 奔驰C级或换装全新1.3T发动机 //www.autohome.com.cn/news/201710/908114.html#pvareaid=102624
    外观造型硬朗 昌河全新SUV申报图曝光 //www.autohome.com.cn/news/201710/908111.html#pvareaid=102624
    将于年内正式投产 捷豹XEL实车曝光 //www.autohome.com.cn/news/201710/908101.html#pvareaid=102624
    与海外版一致 英菲尼迪新款Q50L申报图 //www.autohome.com.cn/news/201710/908108.html#pvareaid=102624
    或11月上市/两种动力 荣威RX3实车到店 //www.autohome.com.cn/news/201710/908106.html#pvareaid=102624
    更年轻 北汽新能源EC180/200推定制套装 //www.autohome.com.cn/news/201710/908107.html#pvareaid=102624
    即将“复活” 别克全新凯越申报图曝光 //www.autohome.com.cn/news/201710/908105.html#pvareaid=102624
    内饰焕然一新 全新牧马人产品手册曝光 //www.autohome.com.cn/news/201710/908102.html#pvareaid=102624
    售16.78-17.98万元 长安CS95荣耀版上市 //www.autohome.com.cn/news/201710/908103.html#pvareaid=102624
    售9.98-18.68万 2018款荣威RX5上市 //www.autohome.com.cn/news/201710/908094.html#pvareaid=102624
    """
    

    三.BeautifulSoup

    BeautifulSoup是一个模块,该模块用于接收一个HTML或XML字符串,然后将其进行格式化,之后遍可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。

    windows下安装BeautifulSoup模块:pip install BeautifulSoup4

    from bs4 import BeautifulSoup
     
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    asdf
        <div class="title">
            <b>The Dormouse's story总共</b>
            <h1>f</h1>
        </div>
    <div class="story">Once upon a time there were three little sisters; and their names were
        <a  class="sister0" id="link1">Els<span>f</span>ie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</div>
    ad<br/>sf
    <p class="story">...</p>
    </body>
    </html>
    """
     
    soup = BeautifulSoup(html_doc, features="lxml")
    # 找到第一个a标签
    tag1 = soup.find(name='a')
    # 找到所有的a标签
    tag2 = soup.find_all(name='a')
    # 找到id=link2的标签
    tag3 = soup.select('#link2')
    

    1. name,标签名称 

    # tag = soup.find('a')
    # name = tag.name # 获取
    # print(name)
    # tag.name = 'span' # 设置
    # print(soup)
    

    2.attrs,标签属性

    # tag = soup.find('a')
    # attrs = tag.attrs    # 获取
    # print(attrs)
    # tag.attrs = {'ik':123} # 设置
    # tag.attrs['id'] = 'iiiii' # 设置
    # print(soup)
    

     3.children,所有子标签

    # body = soup.find('body')
    # v = body.children
    

     4.children,所有子子孙孙标签

    # body = soup.find('body')
    # v = body.descendants
    

    5.clear,将标签的所有子标签全部清空(保留标签名)

    # tag = soup.find('body')
    # tag.clear()
    # print(soup)
    

     6.decompose,递归的删除所有的标签

    # body = soup.find('body')
    # body.decompose()
    # print(soup)
    

     7.extract,递归的删除所有的标签,并获取删除的标签

    # body = soup.find('body')
    # v = body.extract()
    # print(soup)
    

     8. decode,转换为字符串(含当前标签);decode_contents(不含当前标签)

    # body = soup.find('body')
    # v = body.decode()
    # v = body.decode_contents()
    # print(v)
    

     9.encode,转换为字节(含当前标签);encode_contents(不含当前标签)

    # body = soup.find('body')
    # v = body.encode()
    # v = body.encode_contents()
    # print(v)
    

     10. find,获取匹配的第一个标签

    # tag = soup.find('a')
    # print(tag)
    # tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
    # tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
    # print(tag)
    

     11. find_all,获取匹配的所有标签

     1 # tags = soup.find_all('a')
     2 # print(tags)
     3  
     4 # tags = soup.find_all('a',limit=1)
     5 # print(tags)
     6  
     7 # tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
     8 # # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
     9 # print(tags)
    10  
    11  
    12 # ####### 列表 #######
    13 # v = soup.find_all(name=['a','div'])
    14 # print(v)
    15  
    16 # v = soup.find_all(class_=['sister0', 'sister'])
    17 # print(v)
    18  
    19 # v = soup.find_all(text=['Tillie'])
    20 # print(v, type(v[0]))
    21  
    22  
    23 # v = soup.find_all(id=['link1','link2'])
    24 # print(v)
    25  
    26 # v = soup.find_all(href=['link1','link2'])
    27 # print(v)
    28  
    29 # ####### 正则 #######
    30 import re
    31 # rep = re.compile('p')
    32 # rep = re.compile('^p')
    33 # v = soup.find_all(name=rep)
    34 # print(v)
    35  
    36 # rep = re.compile('sister.*')
    37 # v = soup.find_all(class_=rep)
    38 # print(v)
    39  
    40 # rep = re.compile('http://www.oldboy.com/static/.*')
    41 # v = soup.find_all(href=rep)
    42 # print(v)
    43  
    44 # ####### 方法筛选 #######
    45 # def func(tag):
    46 # return tag.has_attr('class') and tag.has_attr('id')
    47 # v = soup.find_all(name=func)
    48 # print(v)
    49  
    50  
    51 # ## get,获取标签属性
    52 # tag = soup.find('a')
    53 # v = tag.get('id')
    54 # print(v)
    View Code

    12. has_attr,检查标签是否具有该属性

    # tag = soup.find('a')
    # v = tag.has_attr('id')
    # print(v)
    

     13.get_text,获取标签内部文本内容

    # tag = soup.find('a')
    # v = tag.get_text('id')
    # print(v)
    

     14.index,检查标签在某标签中的索引位置

    # tag = soup.find('body')
    # v = tag.index(tag.find('div'))
    # print(v)
     
    # tag = soup.find('body')
    # for i,v in enumerate(tag):
    # print(i,v)
    

     15. is_empty_element,是否是空标签(是否可以是空)或者自闭合标签,判断是否是如下标签:'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'

    # tag = soup.find('br')
    # v = tag.is_empty_element
    # print(v)
    

     16. 当前的关联标签

    # soup.next
    # soup.next_element
    # soup.next_elements
    # soup.next_sibling
    # soup.next_siblings
     
    #
    # tag.previous
    # tag.previous_element
    # tag.previous_elements
    # tag.previous_sibling
    # tag.previous_siblings
     
    #
    # tag.parent
    # tag.parents
    

     17. 查找某标签的关联标签

    # tag.find_next(...)
    # tag.find_all_next(...)
    # tag.find_next_sibling(...)
    # tag.find_next_siblings(...)
     
    # tag.find_previous(...)
    # tag.find_all_previous(...)
    # tag.find_previous_sibling(...)
    # tag.find_previous_siblings(...)
     
    # tag.find_parent(...)
    # tag.find_parents(...)
     
    # 参数同find_all
    

     18. select,select_one, CSS选择器

     1 soup.select("title")
     2  
     3 soup.select("p nth-of-type(3)")
     4  
     5 soup.select("body a")
     6  
     7 soup.select("html head title")
     8  
     9 tag = soup.select("span,a")
    10  
    11 soup.select("head > title")
    12  
    13 soup.select("p > a")
    14  
    15 soup.select("p > a:nth-of-type(2)")
    16  
    17 soup.select("p > #link1")
    18  
    19 soup.select("body > a")
    20  
    21 soup.select("#link1 ~ .sister")
    22  
    23 soup.select("#link1 + .sister")
    24  
    25 soup.select(".sister")
    26  
    27 soup.select("[class~=sister]")
    28  
    29 soup.select("#link1")
    30  
    31 soup.select("a#link2")
    32  
    33 soup.select('a[href]')
    34  
    35 soup.select('a[href="http://example.com/elsie"]')
    36  
    37 soup.select('a[href^="http://example.com/"]')
    38  
    39 soup.select('a[href$="tillie"]')
    40  
    41 soup.select('a[href*=".com/el"]')
    42  
    43  
    44 from bs4.element import Tag
    45  
    46 def default_candidate_generator(tag):
    47     for child in tag.descendants:
    48         if not isinstance(child, Tag):
    49             continue
    50         if not child.has_attr('href'):
    51             continue
    52         yield child
    53  
    54 tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator)
    55 print(type(tags), tags)
    56  
    57 from bs4.element import Tag
    58 def default_candidate_generator(tag):
    59     for child in tag.descendants:
    60         if not isinstance(child, Tag):
    61             continue
    62         if not child.has_attr('href'):
    63             continue
    64         yield child
    65  
    66 tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1)
    67 print(type(tags), tags)
    View Code

    19. 标签的内容

    # tag = soup.find('span')
    # print(tag.string)          # 获取
    # tag.string = 'new content' # 设置
    # print(soup)
     
    # tag = soup.find('body')
    # print(tag.string)
    # tag.string = 'xxx'
    # print(soup)
     
    # tag = soup.find('body')
    # v = tag.stripped_strings  # 递归内部获取所有标签的文本
    # print(v)
    

     20.append在当前标签内部追加一个标签

    # tag = soup.find('body')
    # tag.append(soup.find('a'))
    # print(soup)
    #
    # from bs4.element import Tag
    # obj = Tag(name='i',attrs={'id': 'it'})
    # obj.string = '我是一个新来的'
    # tag = soup.find('body')
    # tag.append(obj)
    # print(soup)
    

     21.insert在当前标签内部指定位置插入一个标签

    # from bs4.element import Tag
    # obj = Tag(name='i', attrs={'id': 'it'})
    # obj.string = '我是一个新来的'
    # tag = soup.find('body')
    # tag.insert(2, obj)
    # print(soup)
    

     22. insert_after,insert_before 在当前标签后面或前面插入

    # from bs4.element import Tag
    # obj = Tag(name='i', attrs={'id': 'it'})
    # obj.string = '我是一个新来的'
    # tag = soup.find('body')
    # # tag.insert_before(obj)
    # tag.insert_after(obj)
    # print(soup)
    

    23. replace_with 在当前标签替换为指定标签

    # from bs4.element import Tag
    # obj = Tag(name='i', attrs={'id': 'it'})
    # obj.string = '我是一个新来的'
    # tag = soup.find('div')
    # tag.replace_with(obj)
    # print(soup)
    

     24. 创建标签之间的关系

    # tag = soup.find('div')
    # a = soup.find('a')
    # tag.setup(previous_sibling=a)
    # print(tag.previous_sibling)
    

     25. wrap,将指定标签把当前标签包裹起来

    # from bs4.element import Tag
    # obj1 = Tag(name='div', attrs={'id': 'it'})
    # obj1.string = '我是一个新来的'
    #
    # tag = soup.find('a')
    # v = tag.wrap(obj1)
    # print(soup)
     
    # tag = soup.find('a')
    # v = tag.wrap(soup.find('p'))
    # print(soup)
    

     26. unwrap,去掉当前标签,将保留其包裹的标签

    # tag = soup.find('a')
    # v = tag.unwrap()
    # print(soup)
    

     四.爬取GitHub和抽屉的新闻页

    GitHub自动登录

    # -*- coding:utf-8 -*-
    # !/usr/bin/python
    from bs4 import BeautifulSoup
    import requests
    
    # 1. 获取token和cookie
    r1 = requests.get(url='https://github.com/login')
    s1 = BeautifulSoup(r1.text,'html.parser')
    val = s1.find(attrs={'name':'authenticity_token'}).get('value')
    # cookie返回给你
    r1_cookie_dict = r1.cookies.get_dict()
    
    # 发送用户认证
    r2 = requests.post(
        url='https://github.com/session',
        data={
            'commit':'Sign in',
            'utf8':'✓',
            'authenticity_token':val,
            'login':'xxx',
            'password':'xxx'
        },
        cookies = r1_cookie_dict
    )
    
    r2_cookie_dict = r2.cookies.get_dict()
    
    print(r1_cookie_dict)
    print(r2_cookie_dict)
    
    all_cookies = {}
    
    all_cookies.update(r1_cookie_dict)
    all_cookies.update(r2_cookie_dict)
    
    # 3.github直接用带token之后的cookies就行
    
    r3 = requests.get('https://github.com/settings/emails',cookies=r2_cookie_dict)
    print(r3.text)
    

     登录抽屉并自动点赞

    # -*- coding:utf-8 -*-
    # !/usr/bin/python
    from bs4 import BeautifulSoup
    import requests
    
    r1 = requests.get(url='http://dig.chouti.com/')
    r1_cookies_dict = r1.cookies.get_dict()
    
    r2 = requests.post(
        url='http://dig.chouti.com/login',
        data={
            'phone':'xxx',
            'password':'xxx',
            'oneMonth':1
        },
        cookies = r1_cookies_dict
    )
    r2_cookies_dict = r2.cookies.get_dict()
    
    print(r1_cookies_dict)
    print(r2_cookies_dict)
    
    all_cookies = {}
    
    all_cookies.update(r1_cookies_dict)
    all_cookies.update(r2_cookies_dict)
    
    
    r3 = requests.post('http://dig.chouti.com/link/vote?linksId=14708906',cookies=r1_cookies_dict)
    print(r3.text)
    

     注意:有的登录页面,登录的时候不一定会给cookie,需要get一次才给cookie,而登录的时候仅仅是授权,get的时候的cookie,这样就不需要带第二次的cookie去请求

    五.轮询和长轮询

    1. 轮询客户端定时向服务器发送Ajax请求,服务器接到请求后马上返回响应信息并关闭连接。
      优点:后端程序编写比较容易。
      缺点:请求中有大半是无用,浪费带宽和服务器资源。
      实例:适于小型应用。

    2. 长轮询:客户端向服务器发送Ajax请求,服务器接到请求后hold住连接,直到有新消息才返回响应信息并关闭连接,客户端处理完响应信息后再向服务器发送新的请求,服务器端会设置超时时间,当出现超时的时候,服务端会断开链接,客户端会再次请求服务端hold住
      优点:在无消息的情况下不会频繁的请求。
      缺点:服务器hold连接会消耗资源。
      实例:WebQQ、Hi网页版、Facebook IM。

      另外,对于长连接和socket连接也有区分:

        1. 长连接:在页面里嵌入一个隐蔵iframe,将这个隐蔵iframe的src属性设为对一个长连接的请求,服务器端就能源源不断地往客户端输入数据。
          优点:消息即时到达,不发无用请求。
          缺点:服务器维护一个长连接会增加开销。
          实例:Gmail聊天

  • 相关阅读:
    javascript延迟对象
    Fetch-新一代Ajax API
    AJAX笔记
    VR/AR/MR
    为什么Javascript有设计缺陷
    Javascript函数式编程
    vim基本操作
    Git 常用命令(二)
    SSH配置
    C# NPOI导出Excel和EPPlus导出Excel
  • 原文地址:https://www.cnblogs.com/Crazy-lyl/p/7679547.html
Copyright © 2020-2023  润新知