• scrapy之Pipeline


    官方文档:https://docs.scrapy.org/en/latest/topics/item-pipeline.html

      激活pipeline,需要在settings里配置,然而这里配置的pipeline会作用于所有的spider。加入项目中有很多spider在运行。item pipeline的处理就会很麻烦,你可以通过process_item(self,item,spider)中的spider参数来判断是来自哪个爬虫,但是这种方法很冗余。更好的做法是配置spider类中的custom_settings属性。为每一个spider配置不同的pipeline。示例如下:

      同时,这里你也会看到custom_settings的用法和用处。

    class XiaohuaSpider(scrapy.Spider):
        name = 'xiaohua'
        custom_settings = {
            'ITEM_PIPELINES ':{
                'TB.pipelines.TBMongoPipeline':300,
            }
        }
    

    一  method

      1 process_item(self,item,spider)

      This method is called for every item pipeline component

      2 open_spider(self,spider)

      This method is called when the spider is opened.

      3 close_spider(self,spider)

      4 from_crawler(cls,crawler)

      It must return a new instance of the pipeline

    二 Item Pipeline example

      1 write items to mongodb

    import pymongo
    
    class MongoPipeline(object):
    
        collection_name = 'scrapy_items'
    
        def __init__(self, mongo_uri, mongo_db):
            self.mongo_uri = mongo_uri
            self.mongo_db = mongo_db
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(
                mongo_uri=crawler.settings.get('MONGO_URI'),
                mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
            )
    
        def open_spider(self, spider):
            self.client = pymongo.MongoClient(self.mongo_uri)
            self.db = self.client[self.mongo_db]
    
        def close_spider(self, spider):
            self.client.close()
    
        def process_item(self, item, spider):
            self.db[self.collection_name].insert_one(dict(item))
            return item

      2 duplicates filter

    from scrapy.exceptions import DropItem
    
    class DuplicatesPipeline(object):
    
        def __init__(self):
            self.ids_seen = set()
    
        def process_item(self, item, spider):
            if item['id'] in self.ids_seen:
                raise DropItem("Duplicate item found: %s" % item)
            else:
                self.ids_seen.add(item['id'])
                return item

      

  • 相关阅读:
    JAVA泛类型(汽车Demo)
    java自定义事件机制分析
    ExtJS自定义事件
    模块化设计进化
    面向服务的SOA架构与服务总线ESB
    数据加密数字签名
    面试题
    一点ExtJS开发的感悟
    学习代理模式
    抽象类与接口
  • 原文地址:https://www.cnblogs.com/654321cc/p/8877079.html
Copyright © 2020-2023  润新知