• 13.CrawlSpider类爬虫


    1.CrawlSpider介绍

    Scrapy框架中分两类爬虫,Spider类和CrawlSpider类。
    此案例采用的是CrawlSpider类实现爬虫。

    它是Spider的派生类,Spider类的设计原则是只爬取start_url列表中的网页,而CrawlSpider类定义了一些规则(rule)来提供跟进link的方便的机制,从爬取的网页中获取link并继续爬取的工作更适合。

    创建项目指令:

    scrapy startproject baidu

    模版创建:

    scrapy genspider -t crawl baidu 'tieba.baidu.com'

    CrawlSpider继承于Spider类,除了继承过来的属性外(name、allow_domains),还提供了新的属性和方法:

    LinkExtractors

    class scrapy.linkextractors.LinkExtractor
    Link Extractors 的目的很简单: 提取链接。
    每个LinkExtractor有唯一的公共方法是 extract_links(),它接收一个 Response 对象,并返回一个 scrapy.link.Link 对象。
    Link Extractors要实例化一次,并且 extract_links 方法会根据不同的 response 调用多次提取链接。
    主要参数:
                allow:满足括号中“正则表达式”的值会被提取,如果为空,则全部匹配。
                deny:与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。
                allow_domains:会被提取的链接的domains。
                deny_domains:一定不会被提取链接的domains。            
                restrict_xpaths:使用xpath表达式,和allow共同作用过滤链接。
    rules

    在rules中包含一个或多个Rule对象,每个Rule对爬取网站的动作定义了特定操作。如果多个rule匹配了相同的链接,则根据规则在本集合中被定义的顺序,第一个会被使用。

    参数介绍:
    link_extractor:是一个Link Extractor对象,用于定义需要提取的链接

    callback: 从link_extractor中每获取到链接时,参数所指定的值作为回调函数,该回调函数接受一个response作为其第一个参数。    
        注意:当编写爬虫规则时,避免使用parse作为回调函数。由于CrawlSpider使用parse方法来实现其逻辑,如果覆盖了 parse方法,crawl spider将会运行失败。    
        follow:是一个布尔(boolean)值,指定了根据该规则从response提取的链接是否需要跟进。 如果callback为None,follow 默认设置为True ,否则默认为False。    
        process_links:指定该spider中哪个的函数将会被调用,从link_extractor中获取到链接列表时将会调用该函数。该方法主要用来过滤。    
        process_request:指定该spider中哪个的函数将会被调用, 该规则提取到每个request时都会调用该函数。 (用来过滤request)
    

     2.创建案例

    a.开始一个项目

    scrapy startproject wxapp

    b.创建模板

    scrapy genspider -t crawl wxapp_spider "wxapp-union.com"

     c.settings.py

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for wxapp project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'wxapp'
    
    SPIDER_MODULES = ['wxapp.spiders']
    NEWSPIDER_MODULE = 'wxapp.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'wxapp (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    DOWNLOAD_DELAY = 2
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    DEFAULT_REQUEST_HEADERS = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en',
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'wxapp.middlewares.WxappSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'wxapp.middlewares.WxappDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'wxapp.pipelines.WxappPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

    wxapp_spider.py

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from wxapp.items import WxappItem
    
    class WxappSpiderSpider(CrawlSpider):
        name = 'wxapp_spider'
        allowed_domains = ['wxapp-union.com']
        start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
    
        rules = (
            Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=d'), follow=True),
            Rule(LinkExtractor(allow=r'.+article-.+.html'),callback="parse_detail",follow=False)
        )
    
        def parse_detail(self, response):
            title=response.xpath('//h1[@class="ph"]/text()').get()
            author_p=response.xpath('//p[@class="authors"]')
            author=author_p.xpath('.//a/text()').get()
            pub_time=author_p.xpath('.//span/text()').get()
            content=response.xpath('//td[@id="article_content"]//text()').getall()
            item=WxappItem(title=title,author=author,pub_time=pub_time,content=content)
            yield item
            #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
            #i['name'] = response.xpath('//div[@id="name"]').extract()
            #i['description'] = response.xpath('//div[@id="description"]').extract()

    items.py

    import scrapy
    
    
    class WxappItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        title=scrapy.Field()
        author=scrapy.Field()
        pub_time=scrapy.Field()
        content=scrapy.Field()

    pipelines.py

    from scrapy.exporters import JsonLinesItemExporter
    
    class WxappPipeline(object):
        def __init__(self):
            self.fp=open("wxapp.json","wb")
            self.exporter=JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding="utf-8")
    
        def process_item(self, item, spider):
            self.exporter.export_item(item)
            return item
    
        def close_spider(self,spider):
            self.fp.close()

    在wxapp目录下创建:start.py

    from scrapy import cmdline
    
    cmdline.execute("scrapy crawl wxapp_spider".split())

    执行start.py即可




  • 相关阅读:
    AGC 015 E
    CF 1041 F. Ray in the tube
    AGC 005 D
    CF 348 D. Turtles
    2069: [POI2004]ZAW
    AGC 007 D
    zhengruioi 470 区间
    2653: middle
    Django 源码安装及使用
    Django MTV模型思想
  • 原文地址:https://www.cnblogs.com/hbxZJ/p/9629101.html
Copyright © 2020-2023  润新知