• Python.Scrapy.11-scrapy-source-code-analysis-part-1


    Scrapy 源代码分析系列-1 spider, spidermanager, crawler, cmdline, command

    分析的源代码版本是0.24.6, url: https://github.com/DiamondStudio/scrapy/blob/0.24.6

    如github 中Scrapy 源码树所示,包含的子包有:

    commands, contracts, contrib, contrib_exp, core, http, selector, settings, templates, tests, utils, xlib

    包含的模块有:

    _monkeypatches.py, cmdline.py, conf.py, conftest.py, crawler.py, dupefilter.py, exceptions.py, 

    extension.py, interface.py, item.py, link.py, linkextractor.py, log.py, logformatter.py, mail.py, 

    middleware.py, project.py, resolver.py, responsetypes.py, shell.py, signalmanager.py, signals.py,

    spider.py, spidermanager.py, squeue.py, stats.py, statscol.py, telnet.py, webservice.py 

    先从重要的模块进行分析。

    0. scrapy依赖的第三方库或者框架

    twisted

    1. 模块: spider, spidermanager, crawler, cmdline, command

    1.1 spider.py spidermanager.py crawler.py

    spider.py定义了spider的基类: BaseSpider. 每个spider实例只能有一个crawler属性。那么crawler具备哪些功能呢?

    crawler.py定义了类Crawler,CrawlerProcess。

    类Crawler依赖: SignalManager, ExtensionManager, ExecutionEngine,  以及设置项STATS_CLASS、SPIDER_MANAGER_CLASS

    、LOG_FORMATTER

    类CrawlerProcess: 顺序地在一个进程中运行多个Crawler。依赖: twisted.internet.reactor、twisted.internet.defer。

    启动爬行(Crawlering)。该类在1.2中cmdline.py会涉及。

    spidermanager.py定义类SpiderManager, 类SpiderManager用来创建和管理所有website-specific的spider。

     1 class SpiderManager(object):
     2 
     3     implements(ISpiderManager)
     4 
     5     def __init__(self, spider_modules):
     6         self.spider_modules = spider_modules
     7         self._spiders = {}
     8         for name in self.spider_modules:
     9             for module in walk_modules(name):
    10                 self._load_spiders(module)
    11 
    12     def _load_spiders(self, module):
    13         for spcls in iter_spider_classes(module):
    14             self._spiders[spcls.name] = spcls

    1.2 cmdline.py command.py

    cmdline.py定义了公有函数: execute(argv=None, settings=None)。

    函数execute是工具scrapy的入口方法(entry method),如下所示:

     1 XiaoKL$ cat `which scrapy`
     2 #!/usr/bin/python
     3 
     4 # -*- coding: utf-8 -*-
     5 import re
     6 import sys
     7 
     8 from scrapy.cmdline import execute
     9 
    10 if __name__ == '__main__':
    11     sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0])
    12     sys.exit(execute())

    所以可以根据这个点为切入点进行scrapy源码的分析。下面是execute()函数:

     1 def execute(argv=None, settings=None):
     2     if argv is None:
     3         argv = sys.argv
     4 
     5     # --- backwards compatibility for scrapy.conf.settings singleton ---
     6     if settings is None and 'scrapy.conf' in sys.modules:
     7         from scrapy import conf
     8         if hasattr(conf, 'settings'):
     9             settings = conf.settings
    10     # ------------------------------------------------------------------
    11 
    12     if settings is None:
    13         settings = get_project_settings()
    14     check_deprecated_settings(settings)
    15 
    16     # --- backwards compatibility for scrapy.conf.settings singleton ---
    17     import warnings
    18     from scrapy.exceptions import ScrapyDeprecationWarning
    19     with warnings.catch_warnings():
    20         warnings.simplefilter("ignore", ScrapyDeprecationWarning)
    21         from scrapy import conf
    22         conf.settings = settings
    23     # ------------------------------------------------------------------
    24 
    25     inproject = inside_project()
    26     cmds = _get_commands_dict(settings, inproject)
    27     cmdname = _pop_command_name(argv)
    28     parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), 
    29         conflict_handler='resolve')
    30     if not cmdname:
    31         _print_commands(settings, inproject)
    32         sys.exit(0)
    33     elif cmdname not in cmds:
    34         _print_unknown_command(settings, cmdname, inproject)
    35         sys.exit(2)
    36 
    37     cmd = cmds[cmdname]
    38     parser.usage = "scrapy %s %s" % (cmdname, cmd.syntax())
    39     parser.description = cmd.long_desc()
    40     settings.setdict(cmd.default_settings, priority='command')
    41     cmd.settings = settings
    42     cmd.add_options(parser)
    43     opts, args = parser.parse_args(args=argv[1:])
    44     _run_print_help(parser, cmd.process_options, args, opts)
    45 
    46     cmd.crawler_process = CrawlerProcess(settings)
    47     _run_print_help(parser, _run_command, cmd, args, opts)
    48     sys.exit(cmd.exitcode)

    execute()函数主要做: 对命令行进行解析并对scrapy命令模块进行加载;解析命令行参数;获取设置信息;创建CrawlerProcess对象。

    CrawlerProcess对象、设置信息、命令行参数都赋值给ScrapyCommand(或其子类)的对象。

    自然我们需要来查看定义类ScrapyCommand的模块: command.py。

    ScrapyCommand的子类在子包scrapy.commands中进行定义。

    _run_print_help() 函数最终调用cmd.run(),来执行该命令。如下:

    1 def _run_print_help(parser, func, *a, **kw):
    2     try:
    3         func(*a, **kw)
    4     except UsageError as e:
    5         if str(e):
    6             parser.error(str(e))
    7         if e.print_help:
    8             parser.print_help()
    9         sys.exit(2)

    func是参数_run_command,该函数的实现主要就是调用cmd.run()方法:

    1 def _run_command(cmd, args, opts):
    2     if opts.profile or opts.lsprof:
    3         _run_command_profiled(cmd, args, opts)
    4     else:
    5         cmd.run(args, opts)

    我们在进行设计时可以参考这个cmdline/commands无关的设计。

    command.py: 定义类ScrapyCommand,该类作为Scrapy Commands的基类。来简单看一下类ScrapyCommand提供的接口/方法:

     1 class ScrapyCommand(object):
     2 
     3     requires_project = False
     4     crawler_process = None
     5 
     6     # default settings to be used for this command instead of global defaults
     7     default_settings = {}
     8 
     9     exitcode = 0
    10 
    11     def __init__(self):
    12         self.settings = None  # set in scrapy.cmdline
    13 
    14     def set_crawler(self, crawler):
    15         assert not hasattr(self, '_crawler'), "crawler already set"
    16         self._crawler = crawler
    17 
    18     @property
    19     def crawler(self):
    20         warnings.warn("Command's default `crawler` is deprecated and will be removed. "
    21             "Use `create_crawler` method to instatiate crawlers.",
    22             ScrapyDeprecationWarning)
    23 
    24         if not hasattr(self, '_crawler'):
    25             crawler = self.crawler_process.create_crawler()
    26 
    27             old_start = crawler.start
    28             self.crawler_process.started = False
    29 
    30             def wrapped_start():
    31                 if self.crawler_process.started:
    32                     old_start()
    33                 else:
    34                     self.crawler_process.started = True
    35                     self.crawler_process.start()
    36 
    37             crawler.start = wrapped_start
    38 
    39             self.set_crawler(crawler)
    40 
    41         return self._crawler
    42 
    43     def syntax(self):
    44 
    45     def short_desc(self):
    46 
    47     def long_desc(self):
    48 
    49     def help(self):
    50 
    51     def add_options(self, parser):
    52 
    53     def process_options(self, args, opts):
    54     
    55     def run(self, args, opts):

    类ScrapyCommand的类属性: 

    requires_project: 是否需要在Scrapy project中运行
    crawler_process:CrawlerProcess对象。在cmdline.py的execute()函数中进行设置。

    类ScrapyCommand的方法,重点关注:

    def crawler(self): 延迟创建Crawler对象。
    def run(self, args, opts): 需要子类进行覆盖实现。
     

    那么我们来具体看一个ScrapyCommand的子类(参考 Python.Scrapy.14-scrapy-source-code-analysis-part-4)。 

    To Be Continued

    接下来分析模块: signals.py signalmanager.py project.py conf.py Python.Scrapy.12-scrapy-source-code-analysis-part-2

  • 相关阅读:
    丸内の霊 補充4
    丸内の霊 補充3
    丸内の霊 補充2
    N1 语法单词
    完全掌握1级日本与能力考试语法问题对策
    丸の内の霊 補充1
    丸内の霊 8
    丸内の霊   7
    丸内の霊  6
    丸の内の霊 6
  • 原文地址:https://www.cnblogs.com/cwgk/p/4645904.html
Copyright © 2020-2023  润新知