• 【Python爬虫】初识scrapy框架


    初识Scrapy框架

    一、scrapy原理介绍

    Scrapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的,使用它可以以快速、简单、可扩展的方式从网站中提取所需的数据。

    但目前Scrapy的用途十分广泛,可用于如数据挖掘、监测和自动化测试等领域,也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。

    Scrapy 是基于twisted框架开发而来,twisted是一个流行的事件驱动的python网络框架。因此Scrapy使用了一种非阻塞(又名异步)的代码来实现并发。整体架构大致如下

    Scrapy主要包括了以下组件:

    • 引擎(Scrapy)
      用来处理整个系统的数据流处理, 触发事务(框架核心)
    • 调度器(Scheduler)
      用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
    • 下载器(Downloader)
      用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
    • 爬虫(Spiders)
      爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
    • 项目管道(Pipeline)
      负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
    • 下载器中间件(Downloader Middlewares)
      位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。
    • 爬虫中间件(Spider Middlewares)
      介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。
    • 调度中间件(Scheduler Middewares)
      介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。

    Scrapy运行流程大概如下:

    1. 引擎从调度器中取出一个链接(URL)用于接下来的抓取
    2. 引擎把URL封装成一个请求(Request)传给下载器
    3. 下载器把资源下载下来,并封装成应答包(Response)
    4. 爬虫解析Response
    5. 解析出实体(Item),则交给实体管道进行进一步的处理
    6. 解析出的是链接(URL),则把URL交给调度器等待抓取

    Scrapy框架安装

    #Windows平台
        1、pip3 install wheel #安装后,便支持通过wheel文件安装软件,wheel文件官网:https://www.lfd.uci.edu/~gohlke/pythonlibs
        3、pip3 install lxml
        4、pip3 install pyopenssl
        5、下载并安装pywin32:https://sourceforge.net/projects/pywin32/files/pywin32/
           pip3 install pywin32
        6、下载twisted的wheel文件:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
        7、执行pip3 install 下载目录Twisted-17.9.0-cp36-cp36m-win_amd64.whl
        8、pip3 install scrapy
      
    #Linux平台
        1、pip3 install scrapy    

    二、基本使用

    基本命令

    1. scrapy startproject 项目名称
       - 在当前目录中创建中创建一个项目文件(类似于Django)
     
    2. scrapy genspider [-t template] <name> <domain>
       - 创建爬虫应用
       如:
          scrapy genspider -t basic oldboy oldboy.com
          scrapy genspider -t xmlfeed autohome autohome.com.cn
       PS:
          查看所有命令:scrapy genspider -l
          查看模板命令:scrapy genspider -d 模板名称
     
    3. scrapy list
       - 展示爬虫应用列表
     
    4. scrapy crawl  [爬虫应用名称]
       - 运行单独爬虫应用

    详细的命令行工具

    #1 查看帮助
        scrapy -h
        scrapy <command> -h
    
    #2 有两种命令:其中Project-only必须切到项目文件夹下才能执行,而Global的命令则不需要
        Global commands:
            startproject #创建项目
            genspider    #创建爬虫程序
            settings     #如果是在项目目录下,则得到的是该项目的配置
            runspider    #运行一个独立的python文件,不必创建项目
            shell        #scrapy shell url地址  在交互式调试,如选择器规则正确与否
            fetch        #独立于程单纯地爬取一个页面,可以拿到请求头
            view         #下载完毕后直接弹出浏览器,以此可以分辨出哪些数据是ajax请求
            version      #scrapy version 查看scrapy的版本,scrapy version -v查看scrapy依赖库的版本
        Project-only commands:
            crawl        #运行爬虫,必须创建项目才行,确保配置文件中ROBOTSTXT_OBEY = False
            check        #检测项目中有无语法错误
            list         #列出项目中所包含的爬虫名
            edit         #编辑器,一般不用
            parse        #scrapy parse url地址 --callback 回调函数  #以此可以验证我们的回调函数是否正确
            bench        #scrapy bentch压力测试
    
    #3 官网链接
        https://docs.scrapy.org/en/latest/topics/commands.html
    全局命令:所有文件夹都使用的命令,可以不依赖与项目文件也可以执行
    项目的文件夹下执行的命令
    1、scrapy startproject Myproject #创建项目
       cd Myproject        #切换到项目文件夹下
       
    2、scrapy genspider baidu www.baidu.com  #创建(一个虫子)爬虫程序,baidu是爬虫名,定位爬虫的名字
    #写完域名以后默认会有一个url,
    
    3、scrapy settings --get BOT_NAME  #获取配置文件
    
    #全局:4、scrapy runspider budui.py
    
    5、scrapy runspider AMAZONspidersamazon.py  #执行爬虫程序
       在项目下:scrapy crawl amazon  #指定爬虫名,定位爬虫程序来运行程序
        #robots.txt 反爬协议:在目标站点建一个文件,里面规定了哪些能爬,哪些不能爬
        # 有的国家觉得是合法的,有的是不合法的,这就产生了反爬协议
        # 默认是ROBOTSTXT_OBEY = True
        # 修改为ROBOTSTXT_OBEY = False  #默认不遵循反扒协议
        
    6、scrapy shell https://www.baidu.com  #直接超目标站点发请求
         response
         response.status
         response.body
         view(response)
         
    7、scrapy view https://www.taobao.com #如果页面显示内容不全,不全的内容则是ajax请求实现的,以此快速定位问题
    
    8、scrapy version  #查看版本
    
    9、scrapy version -v #查看scrapy依赖库所依赖的版本
    
    10、scrapy fetch --nolog http://www.logou.com    #获取响应的内容
    
    11、scrapy fetch --nolog --headers http://www.logou.com  #获取响应的请求头
        (venv3_spider) E:	wistedscrapy框架AMAZON>scrapy fetch --nolog --headers http://www.logou.com
        > Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
        > Accept-Language: en
        > User-Agent: Scrapy/1.5.0 (+https://scrapy.org)
        > Accept-Encoding: gzip,deflate
        >
        < Content-Type: text/html; charset=UTF-8
        < Date: Tue, 23 Jan 2018 15:51:32 GMT
        < Server: Apache
        >代表请求
        <代表返回
        
    10、scrapy shell http://www.logou.com #直接朝目标站点发请求
    
    11、scrapy check  #检测爬虫有没有错误
    
    12、scrapy list  #所有的爬虫名
    
    13、scrapy parse http://quotes.toscrape.com/ --callback parse #验证回调函函数是否成功执行
    
    14、scrapy bench #压力测试
    示例用法

    三、项目结构以及爬虫应用简介

    文件说明:

    • scrapy.cfg  项目的主配置信息。(真正爬虫相关的配置信息在settings.py文件中)
    • items.py    设置数据存储模板,用于结构化数据,如:Django的Model
    • pipelines    数据处理行为,如:一般结构化的数据持久化
    • settings.py 配置文件,如:递归的层数、并发数,延迟下载等
    • spiders      爬虫目录,如:创建文件,编写爬虫规则

    注意:一般创建爬虫文件时,以网站域名命名

    day96spiderchouti.py

    # -*- coding: utf-8 -*-
    import scrapy
    import sys
    import io
    sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')     #windows 编码问题
    from scrapy.selector import Selector,HtmlXPathSelector
    
    class ChoutiSpider(scrapy.Spider):
        name = 'chouti'     #爬虫名称
        allowed_domains = ['chouti.com']    #允许的域名
        start_urls = [
            'http://dig.chouti.com/',       #真实的url
        ]
    
        def parse(self, response):
            hxs = Selector(response=response).xpath('//a')
            for i in hxs:
                print(i)
    爬虫程序

    小试牛刀

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.http import Request
    from scrapy.selector import Selector, HtmlXPathSelector
    from ..items import ChoutiItem
    from scrapy.http.cookies import CookieJar
    
    
    class ChoutiSpider(scrapy.Spider):
        name = "chouti"
        allowed_domains = ["chouti.com",]
        start_urls = ['http://dig.chouti.com/']
    
        cookie_dict = None
        def parse(self, response):
            cookie_obj = CookieJar()
            cookie_obj.extract_cookies(response,response.request)
            self.cookie_dict = cookie_obj._cookies
            # 带上用户名密码+cookie
            yield Request(
                url="http://dig.chouti.com/login",
                method='POST',
                body = "phone=8615990000000&password=xxxxxx&oneMonth=1",
                headers={'Content-Type': "application/x-www-form-urlencoded; charset=UTF-8"},
                cookies=cookie_obj._cookies,
                callback=self.check_login
            )
    
        def check_login(self,response):
            print(response,'===登陆成功')
            yield Request(url="http://dig.chouti.com/",callback=self.good)
    
    
        def good(self,response):
            id_list = Selector(response=response).xpath('//div[@share-linkid]/@share-linkid').extract()
            for nid in id_list:
                url = "http://dig.chouti.com/link/vote?linksId=%s" % nid
                print(nid,url)
                yield Request(
                    url=url,
                    method="POST",
                    cookies=self.cookie_dict,
                    callback=self.show
                )
    
            # page_urls = Selector(response=response).xpath('//div[@id="dig_lcpage"]//a/@href').extract()
            # for page in page_urls:
            #     url = "http://dig.chouti.com%s" % page
            #     yield Request(url=url,callback=self.good)
    
    
        def show(self,response):
            print("点赞成功--->",response)

    在cmd终端进入项目目录执行如下命令即可执行此爬虫文件:

    scrapy crawl dig        #打印出所有日志文件
    scrapy crawl dig --nolog    #不打印日志文件

    处理HTML的选择器--->xpath

    #!/usr/bin/env python
    # -*- coding:utf-8 -*-
    from scrapy.selector import Selector, HtmlXPathSelector
    from scrapy.http import HtmlResponse
    html = """<!DOCTYPE html>
    <html>
        <head lang="en">
            <meta charset="UTF-8">
            <title></title>
        </head>
        <body>
            <ul>
                <li class="item-"><a id='i1' href="link.html">first item</a></li>
                <li class="item-0"><a id='i2' href="llink.html">first item</a></li>
                <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
            </ul>
            <div><a href="llink2.html">second item</a></div>
        </body>
    </html>
    """
    response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')
    # hxs = HtmlXPathSelector(response)
    # print(hxs)
    # hxs = Selector(response=response).xpath('//a')
    # print(hxs)
    # hxs = Selector(response=response).xpath('//a[2]')
    # print(hxs)
    # hxs = Selector(response=response).xpath('//a[@id]')
    # print(hxs)
    # hxs = Selector(response=response).xpath('//a[@id="i1"]')
    # print(hxs)
    # hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]')
    # print(hxs)
    # hxs = Selector(response=response).xpath('//a[contains(@href, "link")]')
    # print(hxs)
    # hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]')
    # print(hxs)
    # hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]')
    # print(hxs)
    # hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]/text()').extract()
    # print(hxs)
    # hxs = Selector(response=response).xpath('//a[re:test(@id, "id+")]/@href').extract()
    # print(hxs)
    # hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract()
    # print(hxs)
    # hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first()
    # print(hxs)
     
    # ul_list = Selector(response=response).xpath('//body/ul/li')
    # for item in ul_list:
    #     v = item.xpath('./a/span')
    #     # 或
    #     # v = item.xpath('a/span')
    #     # 或
    #     # v = item.xpath('*/a/span')
    #     print(v)

    ##day97 scrapy框架回顾
    
    cmd命令:
        创建工程文件夹-    scrapy startproject [工程名称]
        切换到工程名称文件夹下-    cd [工程名称]
        创建爬虫程序-    scrapy genspider [虫子名称] [www.虫子名称.com]
    
    编写代码:
        class ChoutiSpider(scrapy.Spider):
            name = 'chouti'
            allowed_domains = ['dig.chouti.com']
            start_urls = ['https://dig.chouti.com/',]
        
        <a>name不能忽略
        <b>allowed_domains = ['dig.chouti.com'] 允许的域名,只允许爬取抽屉网站。也可以多个
        <c>start_urls  起始访问的url
        <d>可以重写start_requests ,指定初始请求的函数
            def start_requests(self):
                for url in self.start_urls:
                    yield Request(url,callback=self.parse)
        <e>响应response
            response.url
            response.text
            response.body
            response.meta = {'depth':'深度'}        #当前深度DEPTH_LIMIT
        
        <f>采集数据
            Selector(response=response),xpath('//div')
            Selector(response=response),xpath('//div[@id="username"]')
            Selector(response=response),xpath('//div[start-with(@id,"us"]')        #匹配div标签,属性id 以'us'开头的
            Selector(response=response),xpath('//div[re:test(@id,"us"]')        #正则匹配div标签,属性id 以'us'开头的
            Selector(response=response).xpath('//a[@id]')        #找到所有带有id属性的a标签
            Selector(response=response).xpath('//a/@id')        #找到所有a标签的id属性 
            Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]')    #双条件,找到所有href属性为link.html且id属性为i1 的a标签
            Selector(response=response).xpath('//a[contains(@href, "link")]')    #href属性为包含link的所有a标签
            Selector(response=response),xpath('//div/a')
            
            li_list = Selector(response=response),xpath('//div//li')
            for li in li_list:
                li.xpath('./a/@href').extract()        #从当前li标签下开始找
                li.xpath('.//a/text()').extract()        #返回列表
                li.xpath('.//a/text()').first_extract()        #返回列表第一项
                li.xpath('.//a/text()').extract()[0]        #返回列表第一项
        
        <g>yield Request(url='',callback='')
            yield Request(url='',callback=self.parse)
            #必须先引入Request,from scrapy.http.request import Request
        
        <h>yield Item(url='',title='',href='',)
        
        <i>进入pipelines
            class Day96Pipeline(object):
                def process_item(self, item, spider):
                    if(spider.name =="chouti"):
                        """
                            当spider 是 chouti 时数据持久化到json文件
                        """
                        tpl = "%s
    %s
    
    "%(item["href"],item["title"])
                        f = open("new.json","a+",encoding="utf8")
                        f.write(tpl)
                        f.close()
            #注意要在settings.py 中配置
            ITEM_PIPELINES = {
               'day96.pipelines.Day96Pipeline': 300,
               # 'day96.pipelines.Day95Pipeline': 400,        #值越高权重越大,越先执行pipelines中的类
               # 'day96.pipelines.Day97Pipeline': 200,        #值越高权重越大,越先执行pipelines中的类
            }
            
        <f>采集数据
    
    
    今日内容:
        一、去重版 URL
            1、在settings.py 设置 DUPEFILTER_CLASS = "day97.duplication.RepeatFilter"
            2、参考chouti.py中导入的RFPDupeFilter文件 from scrapy.dupefilters import RFPDupeFilter,自定义一个去重类
            3、重写duplication.py 的RepateFilter类
                class RepeatFilter(object):
                    def __init__(self):
                        """
                        2、对象初始化
                        """
                        self.visited_set = set()        #可以自定制的将url放入缓存,数据库等
    
                    @classmethod
                    def from_settings(cls, settings):
                        """
                        1、创建对象
                        :param settings:
                        :return:
                        """
                        return cls()
    
                    def request_seen(self, request):
                        """
                        4、检查url是否已经访问过
                        :param request:
                        :return:
                        """
                        if request.url in self.visited_set:
                            return True
                        self.visited_set.add(request.url)
                        return False
    
                    def open(self):  # can return deferred
                        """
                        3、开始爬取
                        :return:
                        """
                        print('open>>>')
                        pass
    
                    def close(self, reason):  # can return a deferred
                        """
                        5、停止爬取
                        :param reason:
                        :return:
                        """
                        print('close<<<')
                        pass
    
                    def log(self, request, spider):  # log that a request has been filtered
                        pass
    
    
                    # @classmethod
                    # from_settings()
                    # 装饰器调用,
                    # RepeatFilter.from_settings()    #执行__init__() 初始化
            
            4、配置settings
                DUPEFILTER_CLASS = "day97.duplication.RepeatFilter"
    
                # DUPEFILTER_CLASS = "scrapy.dupefilters.RFPDupeFilter"   #默认的
                # from scrapy.dupefilters import RFPDupeFilter    #打开RFPDupeFilter 参考源码的默认写法
        
        二、pipeline补充
            class Day97Pipeline(object):
                def __init__(self, v):
                    self.value = v
    
                @classmethod
                def from_crawler(cls, crawler):
                    """
                    初始化时候,用于创建pipeline对象
                    :param crawler:
                    :return:
                    """
                    val = crawler.settings.getint('MMMM')       #获取settings.py 中的MMMM
                    return cls(val)
    
                def open_spider(self, spider):
                    """
                    爬虫开始执行时调用
                    :param spider:
                    :return:
                    """
                    print("开始爬虫》》》")
                    self.f = open("new.json", "a+", encoding="utf8")  # 也可以链接上数据库
    
                def process_item(self, item, spider):
                    if (spider.name == "chouti"):
                        """
                        当spider 是 chouti 时数据持久化到json文件
                        """
                        tpl = "%s
    %s
    
    " % (item["href"], item["title"])
                        self.f.write(tpl)
                        #交给下一个pipelines处理
                        return item
                        #丢弃item,不交给pipelines
                        # raise DropItem()
    
                    elif (spider.name == "taobao"):
                        """
                            当spider 是 taobao 时数据持久化到数据库
                        """
                        pass
                    elif (spider.name == "cnblogs"):
                        """
                            当spider 是 taobao 时数据持久化到html页面展示
                        """
                        pass
                    else:
                        pass
    
                def close_spider(self, spider):
                    """
                    爬虫关闭时调用
                    :param spider:
                    :return:
                    """
                    print("停止爬虫《《《")
                    self.f.close()
                    
            ##注意
            #settings 里面的配置文件名称必须大写,否则crawler.settings.getint('MMMM') 找不到配置文件中的文件名
            #process_item() 方法中,如果抛出异常DropItem 表示终止,否则基础交给后续的pipelines处理
            #spider可判断是那个爬虫程序
        
        三、cookies问题
        
        
        四、扩展
            settings.py
                    EXTENSIONS = {
                       # 'scrapy.extensions.telnet.TelnetConsole': None,
                       'day97.extentions.Myextend': 300,
                    }
            
            from scrapy import signals
    
            class Myextend:
                def __init__(self, crawler):
                    self.crawler = crawler
                    #在钩子上挂障碍物
                    #标准说法:在指定信号上注册操作
                    crawler.signals.connect(self.start,signals.engine_started)
                    crawler.signals.connect(self.close,signals.spider_closed)
    
                @classmethod
                def from_crawler(cls, crawler):
                    return cls(crawler)
    
                def start(self):
                    print("signals.engine_started...")
    
                def close(self):
                    print("signals.spider_closed...")
    
    
        五、配置文件
            
        
        
    scrapy基础知识回顾笔记

    案例讲解-抽屉网点赞(目前已失效,抽屉升级了反扒机制)

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.http import Request
    from scrapy.selector import Selector, HtmlXPathSelector
    from ..items import ChoutiItem
    
    
    from scrapy.http.cookies import CookieJar
    class ChoutiSpider(scrapy.Spider):
        name = "chouti"
        allowed_domains = ["chouti.com",]
        start_urls = ['http://dig.chouti.com/']
    
        cookie_dict = None
        def parse(self, response):
            cookie_obj = CookieJar()
            cookie_obj.extract_cookies(response,response.request)
            self.cookie_dict = cookie_obj._cookies
            # 带上用户名密码+cookie
            yield Request(
                url="http://dig.chouti.com/login",
                method='POST',
                body = "phone=8615990076961&password=xr112358&oneMonth=1",
                headers={'Content-Type': "application/x-www-form-urlencoded; charset=UTF-8"},
                cookies=cookie_obj._cookies,
                callback=self.check_login
            )
    
        def check_login(self,response):
            print(response,'===登陆成功')
            yield Request(url="http://dig.chouti.com/",callback=self.good)
    
    
        def good(self,response):
            id_list = Selector(response=response).xpath('//div[@share-linkid]/@share-linkid').extract()
            for nid in id_list:
                url = "http://dig.chouti.com/link/vote?linksId=%s" % nid
                print(nid,url)
                yield Request(
                    url=url,
                    method="POST",
                    cookies=self.cookie_dict,
                    callback=self.show
                )
    
            # page_urls = Selector(response=response).xpath('//div[@id="dig_lcpage"]//a/@href').extract()
            # for page in page_urls:
            #     url = "http://dig.chouti.com%s" % page
            #     yield Request(url=url,callback=self.good)
    
    
        def show(self,response):
            print("点赞成功--->",response)
    spiderschouti.py
    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class ChoutiItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        title  = scrapy.Field()
        href  = scrapy.Field()
    items.py
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    from scrapy.exceptions import DropItem
    
    class Day97Pipeline(object):
        def __init__(self, v):
            self.value = v
    
        @classmethod
        def from_crawler(cls, crawler):
            """
            初始化时候,用于创建pipeline对象
            :param crawler:
            :return:
            """
            val = crawler.settings.getint('MMMM')       #获取settings.py 中的MMMM
            return cls(val)
    
        def open_spider(self, spider):
            """
            爬虫开始执行时调用
            :param spider:
            :return:
            """
            # print("开始爬虫》》》")
            self.f = open("new.json", "a+", encoding="utf8")  # 也可以链接上数据库
    
        def process_item(self, item, spider):
            if (spider.name == "chouti"):
                """
                当spider 是 chouti 时数据持久化到json文件
                """
                tpl = "%s
    %s
    
    " % (item["href"], item["title"])
                self.f.write(tpl)
                #交给下一个pipelines处理
                return item
                #丢弃item,不交给pipelines
                # raise DropItem()
    
            elif (spider.name == "taobao"):
                """
                    当spider 是 taobao 时数据持久化到数据库
                """
                pass
            elif (spider.name == "cnblogs"):
                """
                    当spider 是 taobao 时数据持久化到html页面展示
                """
                pass
            else:
                pass
    
        def close_spider(self, spider):
            """
            爬虫关闭时调用
            :param spider:
            :return:
            """
            # print("停止爬虫《《《")
            self.f.close()
    
    
    # class Day96Pipeline(object):
    #     def __init__(self, v):
    #         self.value = v
    #
    #     @classmethod
    #     def from_crawler(cls, crawler):
    #         """
    #         初始化时候,用于创建pipeline对象
    #         :param crawler:
    #         :return:
    #         """
    #         val = crawler.settings.getint('MMMM')       #获取settings.py 中的MMMM
    #         return cls(val)
    #
    #     def open_spider(self, spider):
    #         """
    #         爬虫开始执行时调用
    #         :param spider:
    #         :return:
    #         """
    #         print("开始爬虫》》》")
    #         self.f = open("new.json", "a+", encoding="utf8")  # 也可以链接上数据库
    #
    #     def process_item(self, item, spider):
    #         if (spider.name == "chouti"):
    #             """
    #             当spider 是 chouti 时数据持久化到json文件
    #             """
    #             tpl = "%s
    %s
    
    " % (item["href"], item["title"])
    #             self.f.write(tpl)
    #
    #         elif (spider.name == "taobao"):
    #             """
    #                 当spider 是 taobao 时数据持久化到数据库
    #             """
    #             pass
    #         elif (spider.name == "cnblogs"):
    #             """
    #                 当spider 是 taobao 时数据持久化到html页面展示
    #             """
    #             pass
    #         else:
    #             pass
    #
    #     def close_spider(self, spider):
    #         """
    #         爬虫关闭时调用
    #         :param spider:
    #         :return:
    #         """
    #         print("停止爬虫《《《")
    #         self.f.close()
    pipelines.py
    # -*- coding: utf-8 -*-
    
    # Scrapy settings for day97 project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'day97'
    
    SPIDER_MODULES = ['day97.spiders']
    NEWSPIDER_MODULE = 'day97.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent        #伪造成浏览器的请求头
    USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3472.3 Safari/537.36"
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False                  #不遵守协议
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32       #concurrent_request  并发数
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    # DOWNLOAD_DELAY = 2        #下载延迟2秒
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16        #针对每个域名最多放16个虫子去爬
    #CONCURRENT_REQUESTS_PER_IP = 16        #
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = True        #是否让response 帮你去爬取cookies(默认是爬取cookies)
    # COOKIES_ENABLED = False     #是否在调试时打印cookies(默认是False)
    
    # Disable Telnet Console (enabled by default)
    # TELNETCONSOLE_ENABLED = True
    
    # Override the default request headers:     #默认的请求头,在yield Request() 中也可以再传入请求头
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares      #中间件
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'day97.middlewares.Day97SpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'day97.middlewares.Day97DownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions      #扩展
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    EXTENSIONS = {
       # 'scrapy.extensions.telnet.TelnetConsole': None,
       'day97.extentions.Myextend': 300,
    }
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'day97.pipelines.Day97Pipeline': 300,
       # 'day97.pipelines.Day96Pipeline': 200,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)                 #download delay智能限速的(每次都设置为固定秒数 网站后台能识别到是爬虫)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)           #缓存
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
    DEPTH_LIMIT = 1
    # DEPTH_PRIORITY = 0      #深度优先(默认是0),广度优先(1)。只能是 0或1
    
    
    DUPEFILTER_CLASS = "day97.duplication.RepeatFilter"
    settings.py
    class RepeatFilter(object):
    
        def __init__(self):
            """
            2、对象初始化
            """
            self.visited_set = set()        #可以自定制的将url放入缓存,数据库等
    
        @classmethod
        def from_settings(cls, settings):
            """
            1、创建对象
            :param settings:
            :return:
            """
            return cls()
    
        def request_seen(self, request):
            """
            4、检查url是否已经访问过
            :param request:
            :return:
            """
            if request.url in self.visited_set:
                return True
            self.visited_set.add(request.url)
            return False
    
        def open(self):  # can return deferred
            """
            3、开始爬取
            :return:
            """
            # print('open>>>')
            pass
    
        def close(self, reason):  # can return a deferred
            """
            5、停止爬取
            :param reason:
            :return:
            """
            # print('close<<<')
            pass
    
        def log(self, request, spider):  # log that a request has been filtered
            pass
    
    
    # @classmethod
    # from_settings()
    # 装饰器调用,
    # RepeatFilter.from_settings()    #执行__init__() 初始化
    duplication.py
    from scrapy import signals
    
    class Myextend:
        def __init__(self, crawler):
            self.crawler = crawler
            #在钩子上挂障碍物
            #标准说法:在指定信号上注册操作
            crawler.signals.connect(self.start,signals.engine_started)
            crawler.signals.connect(self.close,signals.spider_closed)
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(crawler)
    
        def start(self):
            print("signals.engine_started...")
    
        def close(self):
            print("signals.spider_closed...")
    extentions.py

    注意:settings.py中设置 DEPTH_LIMIT = 1来指定“递归”的层数。

    格式化处理

    爬虫实例比较简单的持久化处理,可以在parse方法中直接处理。如果对于想要获取更多的数据处理,则可以利用Scrapy的items将数据格式化,然后统一交由pipelines来处理。

    import scrapy
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http.request import Request
    from scrapy.http.cookies import CookieJar
    from scrapy import FormRequest
    
    
    class XiaoHuarSpider(scrapy.Spider):
        # 爬虫应用的名称,通过此名称启动爬虫命令
        name = "xiaohuar"
        # 允许的域名
        allowed_domains = ["xiaohuar.com"]
    
        start_urls = [
            "http://www.xiaohuar.com/list-1-1.html",
        ]
        # custom_settings = {
        #     'ITEM_PIPELINES':{
        #         'spider1.pipelines.JsonPipeline': 100
        #     }
        # }
        has_request_set = {}
    
        def parse(self, response):
            # 分析页面
            # 找到页面中符合规则的内容(校花图片),保存
            # 找到所有的a标签,再访问其他a标签,一层一层的搞下去
    
            hxs = HtmlXPathSelector(response)
    
            items = hxs.select('//div[@class="item_list infinite_scroll"]/div')
            for item in items:
                src = item.select('.//div[@class="img"]/a/img/@src').extract_first()
                name = item.select('.//div[@class="img"]/span/text()').extract_first()
                school = item.select('.//div[@class="img"]/div[@class="btns"]/a/text()').extract_first()
                url = "http://www.xiaohuar.com%s" % src
                from ..items import XiaoHuarItem
                obj = XiaoHuarItem(name=name, school=school, url=url)
                yield obj
    
            urls = hxs.select('//a[re:test(@href, "http://www.xiaohuar.com/list-1-d+.html")]/@href')
            for url in urls:
                key = self.md5(url)
                if key in self.has_request_set:
                    pass
                else:
                    self.has_request_set[key] = url
                    req = Request(url=url,method='GET',callback=self.parse)
                    yield req
    
        @staticmethod
        def md5(val):
            import hashlib
            ha = hashlib.md5()
            ha.update(bytes(val, encoding='utf-8'))
            key = ha.hexdigest()
            return key
    spiders/xiahuar.py
    import scrapy
    
    
    class XiaoHuarItem(scrapy.Item):
        name = scrapy.Field()
        school = scrapy.Field()
        url = scrapy.Field()
    items.py
    import json
    import os
    import requests
    
    
    class JsonPipeline(object):
        def __init__(self):
            self.file = open('xiaohua.txt', 'w')
    
        def process_item(self, item, spider):
            v = json.dumps(dict(item), ensure_ascii=False)
            self.file.write(v)
            self.file.write('
    ')
            self.file.flush()
            return item
    
    
    class FilePipeline(object):
        def __init__(self):
            if not os.path.exists('imgs'):
                os.makedirs('imgs')
    
        def process_item(self, item, spider):
            response = requests.get(item['url'], stream=True)
            file_name = '%s_%s.jpg' % (item['name'], item['school'])
            with open(os.path.join('imgs', file_name), mode='wb') as f:
                f.write(response.content)
            return item
    pipelines.py
    ITEM_PIPELINES = {
       'spider1.pipelines.JsonPipeline': 100,
       'spider1.pipelines.FilePipeline': 300,
    }
    # 每行后面的整型值,确定了他们运行的顺序,item按数字从低到高的顺序,通过pipeline,通常将这些数字定义在0-1000范围内。
    settings.py

    pipelines自定义开始 结束,如下:

    from scrapy.exceptions import DropItem
    
    class CustomPipeline(object):
        def __init__(self,v):
            self.value = v
    
        def process_item(self, item, spider):
            # 操作并进行持久化
    
            # return表示会被后续的pipeline继续处理
            return item
    
            # 表示将item丢弃,不会被后续pipeline处理
            # raise DropItem()
    
    
        @classmethod
        def from_crawler(cls, crawler):
            """
            初始化时候,用于创建pipeline对象
            :param crawler: 
            :return: 
            """
            val = crawler.settings.getint('MMMM')
            return cls(val)
    
        def open_spider(self,spider):
            """
            爬虫开始执行时,调用
            :param spider: 
            :return: 
            """
            print('000000')
    
        def close_spider(self,spider):
            """
            爬虫关闭时,被调用
            :param spider: 
            :return: 
            """
            print('111111')
    pipelines.py

    可以根据spider.name 不同对数据做不同的存储

  • 相关阅读:
    史上最简单的 SpringCloud 教程 | 第一篇: 服务的注册与发现(Eureka)
    史上最简单的 SpringCloud 教程
    mybatis逆向工程
    mybatis Oracle 批量插入,批量更新
    IDEA快捷键之for循环
    Oracle查询CLOB字段类型的内容
    mybatis + oracle insert clob,出现ORA-01461:仅能绑定要插入LONG列的LONG值
    mybatis 遍历map;
    sqlserver text类型字段错误 net.sourceforge.jtds.jdbc.ClobImpl@66fa192的解决方法
    使用spring-boot-starter-data-jpa 怎么配置使运行时输出SQL语句
  • 原文地址:https://www.cnblogs.com/XJT2018/p/11004435.html
Copyright © 2020-2023  润新知