• 从脚本中运行Scrapy


    文档: https://www.osgeo.cn/scrapy/topics/practices.html

    1、scrapy.crawler.CrawlerProcess

      Scrapy构建于Twisted异步网络框架基础之上,因此你需要在Twisted reactor里面运行。

      可以使用scrapy.crawler.CrawlerProcess这个类来运行你的spider,这个类会为你启动一个Twisted reactor,并能配置你的日志和shutdown处理器。所有的scrapy命令都使用这个类。

    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    
    process = CrawlerProcess(get_project_settings())
    
    # 'followall' is the name of one of the spiders of the project.
    process.crawl('followall', domain='scrapinghub.com')
    process.start() # the script will block here until the crawling is finished

      使用 get_project_settings 得到一个 Settings 具有项目设置的实例。

    2、scrapy.crawler.CrawlerRunner

      使用这个类,在调度spider之后应该显式地运行reactor。建议您使用 CrawlerRunner 而不是 CrawlerProcess 如果您的应用程序已经在使用Twisted,并且您希望在同一个反应器中运行Scrapy。

    from twisted.internet import reactor
    import scrapy
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider(scrapy.Spider):
        # Your spider definition
        ...
    
    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
    runner = CrawlerRunner()
    
    d = runner.crawl(MySpider)
    d.addBoth(lambda _: reactor.stop())
    reactor.run() # the script will block here until the crawling is finished

    3、在同一进程中运行多个spider

      默认情况下,当您运行时,scrapy为每个进程运行一个spider scrapy crawl . 但是,Scrapy支持使用 internal API .

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    process = CrawlerProcess()
    process.crawl(MySpider1)
    process.crawl(MySpider2)
    process.start() # the script will block here until all crawling jobs are finished
    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    configure_logging()
    runner = CrawlerRunner()
    runner.crawl(MySpider1)
    runner.crawl(MySpider2)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    
    reactor.run() # the script will block here until all crawling jobs are finished

    通过链接延迟来按顺序运行spider

    from twisted.internet import reactor, defer
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    configure_logging()
    runner = CrawlerRunner()
    
    @defer.inlineCallbacks
    def crawl():
        yield runner.crawl(MySpider1)
        yield runner.crawl(MySpider2)
        reactor.stop()
    
    crawl()
    reactor.run() # the script will block here until the last crawl call is finished
  • 相关阅读:
    Oracle备份Scott
    Oracle_备份整库
    Oracle配置说明
    Linux-防火墙设置-centos6.10版
    Centos6.10-Nginx代理配置
    oen /var/run/nginx.pid failed
    Win10重置 系统诸多设置或者菜单点击无效或者异常信息回复办法
    EasyUI TreeGrid 悬浮效果
    Hive
    MapReduce高级_讲义
  • 原文地址:https://www.cnblogs.com/Mint-diary/p/14507583.html
Copyright © 2020-2023  润新知