• 爬虫基础知识及scrapy框架使用和基本原理


    爬虫

    一、异步IO

    线程:线程是计算机中工作的最小单元

    ​ IO请求(IO密集型)时多线程更好,计算密集型进程并发最好,IO请求不涉及CPU

    自定义线程池

    进程:进程默认有主线程,可以有多线程共存,并且共享内部资源

    自定义进程

    协程:使用进程中一个线程去完成多个任务,微线程(伪线程)

    GIL:python特有,用于在进程中对线程枷锁,保证同一时刻只能有一个线程被CPU调度

    # Author:wylkjj
    # Date:2020/2/24
    # -*- coding:utf-8 -*-
    import requests
    # 创建多线程
    from concurrent.futures import ThreadPoolExecutor
    # 创建多进程
    from concurrent.futures import ProcessPoolExecutor
    
    
    def async_url(url):
        try:
            response = requests.get(url)
        except Exception as e:
            print('异常结果', response.url, response.content)
        print('获取结果', response.url, response.content)
    
    
    url_list = [
        'http://www.baidu.com',
        'http://www.chouti.com',
        'http://www.bing.com',
        'http://www.google.com',
    ]
    # 线程池pool:创建五个线程,IO请求线程更适合
    # GIL线程锁,只针对cpu的调用权限,针对IO请求不会锁住
    pool = ThreadPoolExecutor(5)
    # 进程池pools:创建五个线程,进程浪费资源
    pools = ProcessPoolExecutor(5)
    
    for url in url_list:
        print('开始请求:', url)
        pool.submit(async_url, url)
    
    pool.shutdown(wait=True)
    
    # 回调函数:.add_done_callback(回调的函数)
    
    

    异步IO模块:

    import asyncio缺点:只提供TCP,提供sleep,不提供http

    ​ 事件循环:get_event_loop()

    ​ @asyncio.coroutine和yield from要同时配套使用,固定写法

    ​ 异步IO:

    • asynico + aiohttp:asynico + request
    • gevent + request:gevent + request两个方法组合在一起后出现了一个grequests
    • twisted
    • tornado:异步非阻塞IO
    # Author:wylkjj
    # Date:2020/2/24
    # -*- coding:utf-8 -*-
    # 异步IO模块
    import asyncio
    
    
    @asyncio.coroutine
    def func1():
        print('before...func1......')
        yield from asyncio.sleep(5)
        print('end...func1......')
    
    
    tasks = [func1(), func1()]
    loop = asyncio.get_event_loop()  # 事件循环
    loop.run_until_complete(asyncio.gather(*tasks))  # 把任务作为列表传进来
    loop.close()
    
    # Author:wylkjj
    # Date:2020/2/25
    # -*- coding:utf-8 -*-
    import asyncio
    
    
    @asyncio.coroutine
    def fetch_async(host, url='/'):
        print(host, url)
        reader, writer = yield from asyncio.open_connection(host, 80)
    
        request_header_content = """GET %s HTTP/1.0
    Host: %s
    
    """ % (url, host,)
        request_header_content = bytes(request_header_content, encoding='utf-8')
    
        writer.write(request_header_content)
        yield from writer.drain()
        text = yield from reader.read()
        print(host, url, str(text, encoding='utf-8'))
        writer.close()
    
    tasks = [
        fetch_async('www.cnblogs.com', '/eric/'),
        fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091')
    ]
    
    loop = asyncio.get_event_loop()
    results = loop.run_until_complete(asyncio.gather(*tasks))
    loop.close()
    
    # Author:wylkjj
    # Date:2020/2/25
    # -*- coding:utf-8 -*-
    # 使用aiohttp和asyncio实现http请求 (aiohttp亲)
    import aiohttp
    import asyncio
    
    
    @asyncio.coroutine
    def fetch_async(url):
        print(url)
        response = yield from aiohttp.request('GET', url)
        # data = yield from response.read()
        # print(url, data)
        print(url, response)
        response.close()
     
    
    # Author:wylkjj
    # Date:2020/2/25
    # -*- coding:utf-8 -*-
    # asyncio和requests配合使用也可以支持HTTP (requests后)
    import asyncio
    import requests
    
    
    @asyncio.coroutine
    def fetch_async(func, *args):
        print(args)
        # 事件循环
        loop = asyncio.get_event_loop()
        future = loop.run_in_executor(None, func, *args)
        response = yield from future
        print(response.url, response.content)
    
    
    tasks = [
        fetch_async(requests.get, 'http://www.cnblogs.com/eric/'),
        fetch_async(requests.get, 'http://dig.chouti.com/pic/show?nid=4073644713430508&lid=10273091')
    ]
    
    loop = asyncio.get_event_loop()
    results = loop.run_until_complete(asyncio.gather(*tasks))
    loop.close()
    
    
    
    # Author:wylkjj
    # Date:2020/2/25
    # -*- coding:utf-8 -*-
    import gevent
    from gevent import monkey
    monkey.patch_all()
    
    import requests
    
    
    def fetch_async(method, url, req_kwargs):
        print(method, url, req_kwargs)
        response = requests.request(method=method, url=url, **req_kwargs)
        print(response.url, response.content)
    
    
    # ##### 发送请求 #####
    gevent.joinall([
        gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
        gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
        gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}),
    ])
    
    # pip3 install twisted
    # pip3 install wheel
    #       b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    #       c. 进入下载目录,执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl
    
    
    from twisted.web.client import getPage
    from twisted.internet import reactor
    
    REV_COUNTER = 0
    REQ_COUNTER = 0
    
    def callback(contents):
        print(contents,)
    
        global REV_COUNTER
        REV_COUNTER += 1
        if REV_COUNTER == REQ_COUNTER:
            reactor.stop()
    
    
    url_list = ['http://www.bing.com', 'http://www.baidu.com', ]
    REQ_COUNTER = len(url_list)
    for url in url_list:
        print(url)
        deferred = getPage(bytes(url, encoding='utf8'))
        deferred.addCallback(callback)
    reactor.run()
    

    import socket:它提供了标准的 BSD Sockets API,可以访问底层操作系统Socket接口的全部方法。

    tronado框架原理

    自定义异步IO:
    基于socket,setblocking(False)
    IO多路复用(也是同步IO)
    while True:
    r,w,e = select.select([ ],[ ],[ ],1)

    关于IO的详情博客:事件驱动IO模型:https://www.cnblogs.com/wylshkjj/p/10896994.html

    二、scrapy框架

    scrapy框架的安装

    ​ Linux
    pip3 install scrapy
    ​ Windows
    ​ 1.
    ​ pip3 install wheel
    ​ 安装Twisted:版本信息知识一个格式,非正确版本
    ​ a. http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted, 下载:Twisted-19.1.0-cp37-cp37m-win_amd64.whl
    ​ b. 进入文件所在目录
    ​ c. pip3 install Twisted-19.1.0-cp37-cp37m-win_amd64.whl

    ​ 2.
    pip3 install scrapy:,此版本与urllib3模块产生冲突,如有此模块需要先卸载此模块
    ​ 3.
    ​ windows上scrapy依赖 https://sourceforge.net/projects/pywin32/files/

    项目的创建和执行

    1. scrapy使用方法
    2. 创建新项目命令:scrapy startproject scy (在想要创建的目录中执行此命令,scy是项目名)
    3. 创建一个爬虫:scrapy genspider example example.com (创建爬虫要先cd 到项目的目录中,example是爬虫文件名字,example.com 是所爬网页地址)
    4. 项目的执行命令:scrapy crawl chouti (抽屉是所要执行的爬虫文件)
    5. 过滤日志命令:scrapy crawl chouti --nolog (过滤chouti 爬的数据日志)
    6. 查看爬虫模板命令:scrapy genspider --list(显示四个模板:basic,crawl,csvfeed,xmlfeed)
    7. 防止蜘蛛(genspider )的权限,robkts.txt属性,在项目setting配置文件中修改ROBOTSTXT_OBEY属性使其值为ROBOTSTXT_OBEY=False
    8. project_name/
      • scrapy.cfg 项目的主配置文件
      • project_name/
        • __init__.py
        • items.py 设置数据存储模板,用于结构化数据,如:Django的Model
        • pipelines.py 数据处理行为,如:一般结构化的数据持久化
        • settings.py 真正配置文件,如:递归的层数,并发数,延迟下载等
        • spiders/ 爬虫目录,如:创建文件,编写爬虫规则
          • __init__.py
          • 爬虫1.py
          • 爬虫2.py
    9. 注意:创建爬虫还是要在命令行创建,运行项目,运行爬虫文件都要在命令行执行
    # 部分项目代码展示,爬取优美图库图片
    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.http import Request
    from bs4 import BeautifulSoup
    
    
    class UmeiSpider(scrapy.Spider):
        name = 'umei'
        allowed_domains = ['umei.cc']
        start_urls = ['https://www.umei.cc/meinvtupian/meinvxiezhen/1.htm']
        visited_set = set()
    
        def parse(self, response):
            self.visited_set.add(response.url)  # 已经爬取的网页
            # 1.将当前页所有的meizi图片爬下来
            # 获取a标签并且属性为 class = TypeBigPics
            main_page = BeautifulSoup(response.text, "html.parser")
            item_list = main_page.find_all("a", attrs={'class': 'TypeBigPics'})
            for item in item_list:
                item = item.find_all("img",)
                print(item)
    
            # 2.获取:https://www.umei.cc/meinvtupian/meinvxiezhen/(d+).htm
            page_list = main_page.find_all("div", attrs={'class': 'NewPages'})
            a_urls = 'https://www.umei.cc/meinvtupian/meinvxiezhen/'
            a_list = page_list[0].find_all("a")
            a_href = set()
            for a in a_list:
                a = a.get('href')
                if a:
                    a_href.add(a_urls+a)
                else:
                    pass
            for i in a_href:
                if i in self.visited_set:
                    pass
                else:
                    obj = Request(url=i, method='GET', callback=self.parse)
                    yield obj
                    print("obj:", obj)
    
  • 相关阅读:
    模拟出栈
    全排列 next_permutation 用法
    区间覆盖
    BFS GPLT L2-016 愿天下有情人都是失散多年的兄妹
    GPLT L2-014 列车调度
    图的联通分量个数统计(判断图是否联通)
    堆排序 GPLT L2-012 关于堆的判断
    牛客挑战赛 30 A 小G数数
    由树的中后序遍历求树的前层序遍历
    【HDOJ4699】Editor(对顶栈,模拟)
  • 原文地址:https://www.cnblogs.com/wylshkjj/p/12365770.html
Copyright © 2020-2023  润新知