• Python.Scrapy.14-scrapy-source-code-analysis-part-4


    Scrapy 源代码分析系列-4 scrapy.commands 子包

    子包scrapy.commands定义了在命令scrapy中使用的子命令(subcommand): bench, check, crawl, deploy, edit, fetch, 

    genspider, list, parse, runspider, settings, shell, startproject, version, view。 所有的子命令模块都定义了一个继承自

    类ScrapyCommand的子类Command。

    首先来看一下子命令crawl, 该子命令用来启动spider。

    1. crawl.py

    关注的重点在方法run(self, args, opts):

     1 def run(self, args, opts):
     2         if len(args) < 1:
     3             raise UsageError()
     4         elif len(args) > 1:
     5             raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")
     6         spname = args[0]
     7 
     8         crawler = self.crawler_process.create_crawler()  # A
     9         spider = crawler.spiders.create(spname, **opts.spargs) # B
    10         crawler.crawl(spider) # C
    11         self.crawler_process.start() # D

     那么问题来啦,run接口方法是从哪里调用的呢? 让我们回到 Python.Scrapy.11-scrapy-source-code-analysis-part-1 

    中 "1.2 cmdline.py command.py" 关于"_run_print_help() "的说明。

    A: 创建类Crawler对象crawler。在创建Crawler对象时, 同时将创建Crawler对象的实例属性spiders(SpiderManager)。如下所示:

     1 class Crawler(object):
     2 
     3     def __init__(self, settings):
     4         self.configured = False
     5         self.settings = settings
     6         self.signals = SignalManager(self)
     7         self.stats = load_object(settings['STATS_CLASS'])(self)
     8         self._start_requests = lambda: ()
     9         self._spider = None
    10         # TODO: move SpiderManager to CrawlerProcess
    11         spman_cls = load_object(self.settings['SPIDER_MANAGER_CLASS'])
    12         self.spiders = spman_cls.from_crawler(self)  # spiders 的类型是: SpiderManager

    Crawler对象对应一个SpiderManager对象,而SpiderManager对象管理多个Spider。

    B: 获取Sipder对象。

    C: 为Spider对象安装Crawler对象。(为蜘蛛安装爬行器)

    D: 类CrawlerProcess的start()方法如下:

     1     def start(self):
     2         if self.start_crawling():
     3             self.start_reactor()
     4 
     5     def start_crawling(self):
     6         log.scrapy_info(self.settings)
     7         return self._start_crawler() is not None
     8 
     9     def start_reactor(self):
    10         if self.settings.getbool('DNSCACHE_ENABLED'):
    11             reactor.installResolver(CachingThreadedResolver(reactor))
    12         reactor.addSystemEventTrigger('before', 'shutdown', self.stop)
    13         reactor.run(installSignalHandlers=False)  # blocking call
    14 
    15     def _start_crawler(self):
    16         if not self.crawlers or self.stopping:
    17             return
    18 
    19         name, crawler = self.crawlers.popitem()
    20         self._active_crawler = crawler
    21         sflo = log.start_from_crawler(crawler)
    22         crawler.configure()
    23         crawler.install()
    24         crawler.signals.connect(crawler.uninstall, signals.engine_stopped)
    25         if sflo:
    26             crawler.signals.connect(sflo.stop, signals.engine_stopped)
    27         crawler.signals.connect(self._check_done, signals.engine_stopped)
    28         crawler.start()  # 调用类Crawler的start()方法
    29         return name, crawler

    类Crawler的start()方法如下:

    1     def start(self):
    2         yield defer.maybeDeferred(self.configure)
    3         if self._spider:
    4             yield self.engine.open_spider(self._spider, self._start_requests()) # 和Engine建立了联系 (ExecutionEngine)
    5         yield defer.maybeDeferred(self.engine.start)

    关于类ExecutionEngine将在子包scrapy.core分析涉及

    2. startproject.py

    3. subcommand是如何加载的

    在cmdline.py的方法execute()中有如下几行代码:

    1     inproject = inside_project()
    2     cmds = _get_commands_dict(settings, inproject)
    3     cmdname = _pop_command_name(argv)

     _get_commands_dict():

    1 def _get_commands_dict(settings, inproject):
    2     cmds = _get_commands_from_module('scrapy.commands', inproject)
    3     cmds.update(_get_commands_from_entry_points(inproject))
    4     cmds_module = settings['COMMANDS_MODULE']
    5     if cmds_module:
    6         cmds.update(_get_commands_from_module(cmds_module, inproject))
    7     return cmds

    _get_commands_from_module(): 

    1 def _get_commands_from_module(module, inproject):
    2     d = {}
    3     for cmd in _iter_command_classes(module):
    4         if inproject or not cmd.requires_project:
    5             cmdname = cmd.__module__.split('.')[-1]
    6             d[cmdname] = cmd()
    7     return d

    To Be Continued

    接下来解析settings相关的逻辑。Python.Scrapy.15-scrapy-source-code-analysis-part-5

  • 相关阅读:
    OI回忆录
    【CSP2019】题解合集
    微电影《Junior·BQB》——剧本
    【UOJ139】【UER #4】被删除的黑白树
    NOIWC2019 冬眠记
    THUWC2019 划水记
    【XSY3413】Lambda
    【BZOJ3065】带插入区间k小值
    【BZOJ3600】没有人的算术
    【BZOJ4864】【BJWC2017】神秘物质
  • 原文地址:https://www.cnblogs.com/cwgk/p/4649245.html
Copyright © 2020-2023  润新知