• Scrapy框架: 通用爬虫之CrawlSpider


    步骤01: 创建爬虫项目

    scrapy startproject quotes
    

    步骤02: 创建爬虫模版

    scrapy genspider -t quotes quotes.toscrape.com
    

    步骤03: 配置爬虫文件quotes.py

    import scrapy
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    
    class Quotes(CrawlSpider):
    	# 爬虫名称
        name = "get_quotes"
        allow_domain = ['quotes.toscrape.com']
        start_urls = ['http://quotes.toscrape.com/']
    
    # 设定规则
        rules = (
            # 对于quotes内容页URL,调用parse_quotes处理,
          		# 并以此规则跟进获取的链接
            Rule(LinkExtractor(allow=r'/page/d+'), callback='parse_quotes', follow=True),
          		# 对于author内容页URL,调用parse_author处理,提取数据
            Rule(LinkExtractor(allow=r'/author/w+'), callback='parse_author')
        )
    
    # 提取内容页数据方法
        def parse_quotes(self, response):
            for quote in response.css(".quote"):
                yield {'content': quote.css('.text::text').extract_first(),
                       'author': quote.css('.author::text').extract_first(),
                       'tags': quote.css('.tag::text').extract()
                       }
    	# 获取作者数据方法
    
        def parse_author(self, response):
            name = response.css('.author-title::text').extract_first()
            author_born_date = response.css('.author-born-date::text').extract_first()
            author_bron_location = response.css('.author-born-location::text').extract_first()
            author_description = response.css('.author-description::text').extract_first()
    
            return ({'name': name,
                     'author_bron_date': author_born_date,
                     'author_bron_location': author_bron_location,
                     'author_description': author_description
                     })
    

    步骤04: 运行爬虫

    scrapy crawl quotes
    
  • 相关阅读:
    Android下获取FPS的几种方法
    Headless Android开发板的调试及远程显示和控制
    ServiceHub.DataWarehouseHost.exe内存泄漏问题的处理
    Android远程桌面助手(Build 0787)
    Android远程桌面助手(Build 0737)
    Vysor破解助手for Linux/macOS/Windows
    Android远程桌面助手
    Vysor破解助手for Linux and macOS
    Django入门第一步(安装和创建一个简单的项目)
    Python-操作Excel
  • 原文地址:https://www.cnblogs.com/hankleo/p/11872497.html
Copyright © 2020-2023  润新知