• Scrapy 教程(11)-API启动爬虫


    scarpy 不仅提供了 scrapy crawl spider 命令来启动爬虫,还提供了一种利用 API 编写脚本 来启动爬虫的方法。

    scrapy 基于 twisted 异步网络库构建的,因此需要在 twisted 容器内运行它。

    可以通过两个 API 运行爬虫:scrapy.crawler.CrawlerProcess  和  scrapy.crawler.CrawlerRunner

    scrapy.crawler.CrawlerProcess

    这个类内部将会开启 twisted.reactor、配置log 和 设置 twisted.reactor 自动关闭,该类是所有 scrapy 命令使用的类。

    运行单个爬虫示例

    class QiushispiderSpider(scrapy.Spider):
        name = 'qiushiSpider'
        # allowed_domains = ['qiushibaike.com']
        start_urls = ['https://tianqi.2345.com/']          
    
        def start_requests(self):
            return [scrapy.Request(url=self.start_urls[0], callback=self.parse)]          #
    
        def parse(self, response):
            print('proxy simida')
    
    
    if __name__ == '__main__':
        from scrapy.crawler import CrawlerProcess
        process = CrawlerProcess()
        process.crawl(QiushispiderSpider)         # 'qiushiSpider'
        process.start()

    process.crawl() 内的参数可以是 爬虫名'qiushiSpider',也可以是 爬虫类名QiushispiderSpider

    这种方式并没有使用爬虫的配置文件settings

    2019-05-27 14:39:57 [scrapy.crawler] INFO: Overridden settings: {}

    获取配置

    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    process = CrawlerProcess(get_project_settings())
    process.crawl(QiushispiderSpider)         # 'qiushiSpider'
    process.start()

    运行多个爬虫

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    class MySpider1(scrapy.Spider):
        ...
    
    class MySpider2(scrapy.Spider):
        ...
    
    process = CrawlerProcess()
    process.crawl(MySpider1)
    process.crawl(MySpider2)
    process.start()

    scrapy.crawler.CrawlerRunner

    1. 更好的控制爬虫运行过程

    2. 显式运行 twisted.reactor,显式关闭 twisted.reactor

    3. 需要在 CrawlerRunner.crawl 返回的对象中添加回调函数

    运行单个爬虫示例

    class QiushispiderSpider(scrapy.Spider):
        name = 'qiushiSpider'
        # allowed_domains = ['qiushibaike.com']
        start_urls = ['https://tianqi.2345.com/']          
    
        def start_requests(self):
            return [scrapy.Request(url=self.start_urls[0], callback=self.parse)]          #
    
        def parse(self, response):
            print('proxy simida')
    
    
    if __name__ == '__main__':
        # test CrawlerRunner
        from twisted.internet import reactor
        from scrapy.crawler import CrawlerRunner
        from scrapy.utils.log import configure_logging
        from scrapy.utils.project import get_project_settings
    
        configure_logging({'LOG_FORMAT':'%(levelname)s: %(message)s'})
        runner = CrawlerRunner(get_project_settings())
    
        d = runner.crawl(QiushispiderSpider)
        d.addBoth(lambda _: reactor.stop())
        reactor.run() # the script will block here until the crawling is finished

    configure_logging 设定日志输出格式

    addBoth 添加 关闭 twisted.crawl 的回调函数

    运行多个爬虫

    import scrapy
    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
        ...
    
    class MySpider2(scrapy.Spider):
        ...
    
    configure_logging()
    runner = CrawlerRunner()
    runner.crawl(MySpider1)
    runner.crawl(MySpider2)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    reactor.run() # the script will block here until all crawling jobs are finished

    也可以异步实现

    from twisted.internet import reactor, defer
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
        ...
    
    class MySpider2(scrapy.Spider):
        ...
    
    configure_logging()
    runner = CrawlerRunner()
    
    @defer.inlineCallbacks
    def crawl():
        yield runner.crawl(MySpider1)
        yield runner.crawl(MySpider2)
        reactor.stop()
    
    crawl()
    reactor.run() # the script 

    参考资料:

    https://blog.csdn.net/weixin_33857230/article/details/89571872

  • 相关阅读:
    Redis中两种持久化机制RDB和AOF
    Redis缓存穿透,缓存击穿,缓存雪崩原因+解决方案
    正则匹配日期
    Linux笔记oom时anon rss和total vm的含义
    go笔记time ticker泄露的排查
    C++笔记编译优化:RVO、NRVO
    C++笔记成员变量的初始值
    Redis Java客户端Jedis入门
    vs2005下安装windows sdk v7.1 出现:uuid.lib(unknwn_i.obj) : fatal error LNK1103 或uuid.lib(oaidl_i.obj) : fatal error LNK1103
    查看Nginx的状态
  • 原文地址:https://www.cnblogs.com/yanshw/p/10929954.html
Copyright © 2020-2023  润新知