• Scrapy持久化(items+pipelines)


    一、items保存爬取的文件

    items.py

    import scrapy
    
    
    class QuoteItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        text = scrapy.Field()
        author = scrapy.Field()
        tags = scrapy.Field()

    quote.py

    # -*- coding: utf-8 -*-
    import scrapy
    from toscrapy.items import QuoteItem
    
    
    class QuoteSpider(scrapy.Spider):
        name = 'quote'
        allowed_domains = ['quotes.toscrape.com']
        start_urls = ['http://quotes.toscrape.com/']
        """
        知识点
            1. text()获取标签的text
            2. @属性  获取属性的值
            3. extract()查找多个    extract_first() 查找一个
            4. response.urljoin     url拼接
            5. scrapy.Request(url=_next, callback=self.parse)   回调
        """
        def parse(self, response):
            # print(response.text)
            quotes = response.xpath('//div[@class="col-md-8"]/div[@class="quote"]')
            # print(quotes)''
            for quote in quotes:
                # print('=' * 20)
                # print(quote)
                item = QuoteItem()
                # extract_first() 查找一个
                text = quote.xpath('.//span[@class="text"]/text()').extract_first()
                # print(text)
                item['text'] = text
                author = quote.xpath('.//span/small[@class="author"]/text()').extract_first()
                # print(author)
                item['author'] = author
                # extract()查找多个
                tags = quote.xpath('.//div[@class="tags"]/a[@class="tag"]/@href').extract()
                item['tags'] = tags
                # print(tags)
                yield item
            # print('>' * 40)
            next_url = response.xpath('//div[@class="col-md-8"]/nav/ul[@class="pager"]/li[@class="next"]/a/@href').extract_first()
            # print(next_url)
            # 拼接url
            _next = response.urljoin(next_url)
            # print(_next)
            # callback 回调函数
            yield scrapy.Request(url=_next, callback=self.parse)

    或直接yield QuoteItem()

    产生文件命令

    scrapy crawl quote -o qutoes.json
    scrapy crawl quote -o quotes.jsonlines
    or
    scrapy crawl quote -o quotes.jl
    # 每一个item输出一行json

    文件类型:qutoes.xml  qutoes.jl  qutoes.csv等

    二、piplines

    1、核心:

    爬虫每执行一次 yield item对象 -> 执行一次pipelines中的process_item方法(通过修改配置文件使pipelines生效) -> 将数据存入数据库或写入文件

    2、settings.py

    ITEM_PIPELINES = {
       # 后面的参数是不同pipelines类的权重值(0-1000),权重值越小越优先
       'toscrapy.pipelines.ToscrapyPipeline': 300,
    }

    配置settings文件是执行pipeline的前提条件

    3、pipelines

    a、默认

    class ToscrapyPipeline(object):
    
        def process_item(self, item, spider):
            """
            :param item:    item对象
            :param spider:  爬虫对象
            :return:
            """
            # print('='*20, item)
            return item

    b、其它方法

    开始爬虫时,调用的方法

        def open_spider(self, spider):
            """
            开始爬虫,调用
            :param spider:
            :return:
            """
            pass

    爬虫结束时,调用的方法

        def close_spider(self, spider):
            """
            关闭爬虫,调用
            :param spider:
            :return:
            """
            pass

    from_crawler方法

    作用:初始化时,实例化pipleline类对象

    目的:将数据储存的路径,写到配置文件中

        @classmethod
        def from_crawler(cls, crawler):
            """
            初始化方法时,创建pipeline类的对象
            :param crawler: 
            :return: 
            """
            # crawler.settings 获取全部配置文件
            path = crawler.settings.get('FILE_PATH')
            # 实例化对象
            return cls(path)

    c、pipeline类方法分析

    判断是否有from_crawler方法

    有:obj = pipeline类.from_crawler()

    无:obj = pipeline类()

    当爬虫开始时,执行 open_spider方法

    当爬虫yield item对象时, 执行 process_item方法

    当爬虫结束时,执行 close_spider方法

    d、序列化

    pipelines.py

    class ToscrapyPipeline(object):
    
        def __init__(self, path):
            self.f = None
            self.path = path
    
        @classmethod
        def from_crawler(cls, crawler):
            """
            初始化方法时,创建pipeline类的对象
            :param crawler:
            :return:
            """
            # crawler.settings 获取全部配置文件
            path = crawler.settings.get('FILE_PATH')
            # 实例化对象
            return cls(path)
    
        def open_spider(self, spider):
            """
            开始爬虫,调用
            :param spider:
            :return:
            """
            self.f = open(file=self.path, mode='a', encoding="utf-8")
    
        def process_item(self, item, spider):
            """
            爬虫执行yield item对象,调用
            :param item:    item对象
            :param spider:  爬虫对象
            :return:
            """
            # print('='*20, item)
            self.f.write(item['text'] + '
    ')
            return item
    
        def close_spider(self, spider):
            """
            爬虫结束,调用
            :param spider:
            :return:
            """
            self.f.close()

    4、多个pipeline类

    piplines.py文件中可以有多个类,一个把数据保存到数据库,一个把数据保存到文件

    a、执行顺序由settings.py的权重值决定,多个pipeline类中方法的执行顺序可以看成有序的异步

    file from_crawl
    db from_crawl
    file open_spider
    db open_spider
    file process_item
    db process_item
    file process_item
    db process_item
    file process_item
    db process_item
    db close_spider
    file close_spider

    b、process_item方法中return item的作用

    为下一个pipeline类中的process_item方法提供item

    1)、没有返回item

    下一个类的item是None

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    class ToscrapyPipeline(object):
    
        def __init__(self, path):
            self.f = None
            self.path = path
    
        @classmethod
        def from_crawler(cls, crawler):
            """
            初始化方法时,创建pipeline类的对象
            :param crawler:
            :return:
            """
            # crawler.settings 获取全部配置文件
            path = crawler.settings.get('FILE_PATH')
            # 实例化对象
            print('file from_crawl')
            return cls(path)
    
        def open_spider(self, spider):
            """
            开始爬虫,调用
            :param spider:
            :return:
            """
            print('file open_spider')
            self.f = open(file=self.path, mode='a', encoding="utf-8")
    
        def process_item(self, item, spider):
            """
            爬虫执行yield item对象,调用
            :param item:    item对象
            :param spider:  爬虫对象
            :return:
            """
            print('file process_item')
            # self.f.write(item['text'] + '
    ')
            # return item
    
        def close_spider(self, spider):
            """
            爬虫结束,调用
            :param spider:
            :return:
            """
            print('file close_spider')
            self.f.close()
    
    
    class DBPipeline(object):
    
        def __init__(self, path):
            self.f = None
            self.path = path
    
        @classmethod
        def from_crawler(cls, crawler):
            """
            初始化方法时,创建pipeline类的对象
            :param crawler:
            :return:
            """
            # crawler.settings 获取全部配置文件
            path = crawler.settings.get('DB_PATH')
            # 实例化对象
            print('db from_crawl')
            return cls(path)
    
        def open_spider(self, spider):
            """
            开始爬虫,调用
            :param spider:
            :return:
            """
            print('db open_spider')
            self.f = open(file=self.path, mode='a', encoding="utf-8")
    
        def process_item(self, item, spider):
            """
            爬虫执行yield item对象,调用
            :param item:    item对象
            :param spider:  爬虫对象
            :return:
            """
            print('db process_item value is {}'.format(item))
            # self.f.write(item['text'] + '
    ')
            return item
    
        def close_spider(self, spider):
            """
            爬虫结束,调用
            :param spider:
            :return:
            """
            print('db close_spider')
            self.f.close()
    没有返回item

    2)、DropItem

    不执行后续pipeline类中的process_item方法

    导入

    from scrapy.exceptions import DropItem
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    from scrapy.exceptions import DropItem
    
    
    
    class ToscrapyPipeline(object):
    
        def __init__(self, path):
            self.f = None
            self.path = path
    
        @classmethod
        def from_crawler(cls, crawler):
            """
            初始化方法时,创建pipeline类的对象
            :param crawler:
            :return:
            """
            # crawler.settings 获取全部配置文件
            path = crawler.settings.get('FILE_PATH')
            # 实例化对象
            print('file from_crawl')
            return cls(path)
    
        def open_spider(self, spider):
            """
            开始爬虫,调用
            :param spider:
            :return:
            """
            print('file open_spider')
            self.f = open(file=self.path, mode='a', encoding="utf-8")
    
        def process_item(self, item, spider):
            """
            爬虫执行yield item对象,调用
            :param item:    item对象
            :param spider:  爬虫对象
            :return:
            """
            print('file process_item')
            # self.f.write(item['text'] + '
    ')
            # return item
            raise DropItem()
    
        def close_spider(self, spider):
            """
            爬虫结束,调用
            :param spider:
            :return:
            """
            print('file close_spider')
            self.f.close()
    
    
    class DBPipeline(object):
    
        def __init__(self, path):
            self.f = None
            self.path = path
    
        @classmethod
        def from_crawler(cls, crawler):
            """
            初始化方法时,创建pipeline类的对象
            :param crawler:
            :return:
            """
            # crawler.settings 获取全部配置文件
            path = crawler.settings.get('DB_PATH')
            # 实例化对象
            print('db from_crawl')
            return cls(path)
    
        def open_spider(self, spider):
            """
            开始爬虫,调用
            :param spider:
            :return:
            """
            print('db open_spider')
            self.f = open(file=self.path, mode='a', encoding="utf-8")
    
        def process_item(self, item, spider):
            """
            爬虫执行yield item对象,调用
            :param item:    item对象
            :param spider:  爬虫对象
            :return:
            """
            print('db process_item value is {}'.format(item))
            # self.f.write(item['text'] + '
    ')
            return item
    
        def close_spider(self, spider):
            """
            爬虫结束,调用
            :param spider:
            :return:
            """
            print('db close_spider')
            self.f.close()
    DropItem

    5、spider参数的作用

    作用:pipelines.py中的类和方法是所有爬虫共用的

    应用场景:如果先让某个方法,只有一个爬虫可以使用,就要用到spider参数

    注意:spider参数对应的是爬虫中的name值

        def open_spider(self, spider):
            """
            开始爬虫,调用
            :param spider:
            :return:
            """
            # 
            if spider == 'quote':
                print('file open_spider')
                self.f = open(file=self.path, mode='a', encoding="utf-8")

     持久化到redis可参考

    https://www.cnblogs.com/wanglan/p/10826678.html

    使用mongodb参考

    https://blog.csdn.net/qq_41020281/article/details/79459604

  • 相关阅读:
    Docker学习笔记04---网络部分
    Docker学习笔记03---DockerFile
    sqlmap使用手册
    windows系统应急响应排查手册
    linux系统应急响应排查手册
    Sublime Text中输入python代码,不能执行问题解决
    人工智能“理解”话语了吗 ——《我们赖以生存的意义》读书笔记(下)
    人工智能“理解”话语了吗 ——《我们赖以生存的意义》读书笔记(上)
    常识推理和NLI(natural language inference)
    常识推理和组合泛化 (commonsense & compositionality)
  • 原文地址:https://www.cnblogs.com/wt7018/p/11729742.html
Copyright © 2020-2023  润新知