• Scrapy框架


    Scrapy

    Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。

    Scrapy囊括了爬取网站数据几乎所有的功能,是一个扩展性很强的一个框架,Scrapy在爬虫界里相当于web的Django

    Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下

    Scrapy主要包括了以下组件:

    • 引擎(Scrapy)
      用来处理整个系统的数据流处理, 触发事务(框架核心)
    • 调度器(Scheduler)
      用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
    • 下载器(Downloader)
      用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
    • 爬虫(Spiders)
      爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
    • 项目管道(Pipeline)
      负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
    • 下载器中间件(Downloader Middlewares)
      位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。
    • 爬虫中间件(Spider Middlewares)
      介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。
    • 调度中间件(Scheduler Middewares)
      介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。

    Scrapy运行流程大概如下:

      1. 引擎从调度器中取出一个链接(URL)用于接下来的抓取
      2. 引擎把URL封装成一个请求(Request)传给下载器
      3. 下载器把资源下载下来,并封装成应答包(Response)
      4. 爬虫解析Response
      5. 解析出实体(Item),则交给实体管道进行进一步的处理
      6. 解析出的是链接(URL),则把URL交给调度器等待抓取

    1.Scrapy的安装

    Windows
          a. pip3 install wheel
          b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
          c. 进入下载目录,执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl
          d. pip3 install scrapy
          e. 下载并安装pywin32:https://sourceforge.net/projects/pywin32/files/
    View Code

    2.命令使用

    1. scrapy startproject 项目名称
       - 在当前目录中创建中创建一个项目文件(类似于Django)
     
    2. scrapy genspider [-t template] <name> <domain>
       - 创建爬虫应用
       如:
          scrapy gensipider -t basic oldboy oldboy.com
          scrapy gensipider -t xmlfeed autohome autohome.com.cn
       PS:
          查看所有命令:scrapy gensipider -l
          查看模板命令:scrapy gensipider -d 模板名称
     
    3. scrapy list
       - 展示爬虫应用列表
     
    4. scrapy crawl 爬虫应用名称
       - 运行单独爬虫应用
    View Code

    3.Spiders介绍

    Spiders里主要装的是爬虫的目录,和爬虫的编写。

    name = 'chouti'
    
    #设置allowed_domains的含义是过滤爬取的域名,在插件OffsiteMiddleware启用的情况下(默认是启用的),不在此允许范围内的域名就会被过滤
        allowed_domains = ['chouti.com']
    
    
        #起始urls,在引擎调用调度器的时候,调度器会循环起始urls列表
        start_urls = ['http://dig.chouti.com']
    
    
    #在这里我们可以自定义起始url:
        1. start_requests
    #         def start_requests(self):
    #             for url in self.start_urls:
    #                 yield Request(url=url,callback=self.parse)
    #
    #         def start_requests(self):
    #             req_list = []
    #             for url in self.start_urls:
    #                 req_list.append(Request(url=url,callback=self.parse))
    #             return req_list
    #            两种都是可以的
    #         因为scrapy内部会将返回值转换成迭代器。                
    起始配置

    解析器

    可以在Chrome 右击某个标签copy-->copyxpath可以获取该路径

    #有两种执行解析器的方法
    # 方式一:
            # response.xpath('//div[@id="content-list"]/div[@class="item"]')
    
    # 方式二:
       # from scrapy.selector import HtmlXPathSelector
        #hsx=HtmlXPathSelector(response=response)
        # div = hsx.xpath('//div[@id="content-list"]/div[@class="item"]')
    
    查找规则:
    # // 代表子子孙孙  / 表示后代   .// 当前标签下的子子孙孙
    #             //a
    #             //div/a
    #             //a[re:test(@id, "id+")]
    #
    #             items = hxs.xpath("//div[@id='content-list']/div[@class='item']")
    #             for item in items:
    #                 item.xpath('.//div')
    #      
      #    text = row.xpath('.//div[@class="part1"]//a[1]/text()').extract_first()
      #  texe是文本内容
      #         href = row.xpath('.//div[@class="part1"]//a[1]/@href').extract_first()
    #    href是属性内容
    
    
    #         解析:
    #             标签对象:xpath('/html/body/ul/li/a/@href')
    #             列表:    xpath('/html/body/ul/li/a/@href').extract()
    #             值:      xpath('//body/ul/li/a/@href').extract_first()        
    解析器

    4.items

    设置数据存储模板,用于结构化数据,如:Django的Model

    class XianglongItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        title = scrapy.Field()
        href = scrapy.Field()
    View Code

    5.pipelines

    数据处理行为,如:一般结构化的数据持久化

    #             class FilePipeline(object):
    #                 def process_item(self, item, spider):
    #                     print('写入文件',item['href'])
    #                                     
    #                                     将item传递给下一个类
    #                     return item
    #
    #                 def open_spider(self, spider):
    #                     """
    #                     爬虫开始执行时,调用
    #                     :param spider:
    #                     :return:
    #                     """
    #                     print('打开文件')
    #
    #                 def close_spider(self, spider):
    #                     """
    #                     爬虫关闭时,被调用
    #                     :param spider:
    #                     :return:
    #                     """
    #                     print('关闭文件')        
    基本的使用
    如果想要丢弃,不给后续pipeline使用:
    #                 from scrapy.exceptions import DropItem
    #                 class FilePipeline(object):
    #
    #                     def process_item(self, item, spider):
    #                         print('写入文件',item['href'])
    #
    #                         # return item
    #                         raise DropItem()
    DropItem
    #         - 根据配置文件读取相关值,再进行pipeline处理
    #             class FilePipeline(object):
    #                 def __init__(self,path):
    #                     self.path = path
    #                     self.f = None
    #
    #                 @classmethod
    #                 def from_crawler(cls, crawler):
    #                     """
    #                     初始化时候,用于创建pipeline对象
    #                     :param crawler:
    #                     :return:
    #                     """
    #         - 根据配置文件读取相关值,再进行pipeline处理
    #                     path = crawler.settings.get('XL_FILE_PATH')
    #                     return cls(path)
    #
    #                 def process_item(self, item, spider):
    #                     self.f.write(item['href']+'
    ')
    #                     return item
    #
    #                 def open_spider(self, spider):
    #                     """
    #                     爬虫开始执行时,调用
    #                     :param spider:
    #                     :return:
    #                     """
    #                     self.f = open(self.path,'w')
    #
    #                 def close_spider(self, spider):
    #                     """
    #                     爬虫关闭时,被调用
    #                     :param spider:
    #                     :return:
    #                     """
    #                     self.f.close()            
    pipelines5个类方法
    ###配置文件###
    ITEM_PIPELINES = {
       'xianglong.pipelines.FilePipeline': 300,
       'xianglong.pipelines.DBPipeline': 301,
    }
    #数字越小优先级越高
    
    class FilePipeline(object):
       
        def process_item(self, item, spider):
            '''
            :param item: 
            :param spider: 
            :return: 
            '''
            print('写入文件')
            # 返回值会返回给下一个process_item
            return item
            raise DropItem() # 后续的pipeline不能在使用item (丢弃item)
        def open_spider(self,spider):
            '''
            
            :param spider: 开始时被调用
            :return: 
            '''
    
            print('打开文件')
        def close_spider(self,spider):
            '''
            :param spider:爬虫关闭时被调用 
            :return: 
            '''
            print('关闭文件')
    
    class DBPipeline(object):
        def process_item(self, item, spider):
            print('写入数据库')
            return item
    
        def open_spider(self, spider):
            print('打开数据库')
    
        def close_spider(self, spider):
            print('关闭数据库')
    
    #执行顺序,先执行FilePipeline的open_spider--->DBPipeline的open_spider
    FilePipeline的process_item---->DBPipeline的process_item
    FilePipeline的close_spider--->DBPipeline的close_spider
    创建多个pipelines

    6.post访问/请求头的设置/cookies

    自动登录抽屉+点赞
    #
    #         POST+请求头:
    #             from scrapy.http import Request
    #             req = Request(
    #                 url='http://dig.chouti.com/login',
    #                 method='POST',
    #                 body='phone=8613121758648&password=woshiniba&oneMonth=1',
    #                 headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
    #                 cookies={},
    #                 callback=self.parse_check_login,
    #             )
    POST+请求头
    #         cookies:
    #             手动:
    #               from scrapy.http.cookies import CookieJar
    #                 cookie_dict = {}
    #                 cookie_jar = CookieJar()
    #                 cookie_jar.extract_cookies(response, response.request)
    #                 for k, v in cookie_jar._cookies.items():
    #                     for i, j in v.items():
    #                         for m, n in j.items():
    #                             cookie_dict[m] = n.value
    #
    #                 req = Request(
    #                     url='http://dig.chouti.com/login',
    #                     method='POST',
    #                     headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
    #                     body='phone=8615131255089&password=pppppppp&oneMonth=1',
    #                     cookies=cookie_dict, # 手动携带
    #                     callback=self.check_login
    #                 )
    #                 yield req
    手动cookies
    #             自动:****meta={'cookiejar': True}*****
    #                 class ChoutiSpider(scrapy.Spider):
    #                     name = 'chouti'
    #                     allowed_domains = ['chouti.com']
    #                     start_urls = ['http://dig.chouti.com/',]
    #
    #                     def start_requests(self):
    #                         for url in self.start_urls:
    #                             yield Request(url=url,callback=self.parse_index,meta={'cookiejar':True})
    #
    #                     def parse_index(self,response):
    #                         req = Request(
    #                             url='http://dig.chouti.com/login',
    #                             method='POST',
    #                             headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
    #                             body='phone=8613121758648&password=woshiniba&oneMonth=1',
    #                             callback=self.parse_check_login,
    #                             meta={'cookiejar': True}
    #                         )
    #                         yield req
    #
    #                     def parse_check_login(self,response):
    #                         # print(response.text)
    #                         yield Request(
    #                             url='https://dig.chouti.com/link/vote?linksId=19440976',
    #                             method='POST',
    #                             callback=self.parse_show_result,
    #                             meta={'cookiejar': True}
    #                         )
    #
    #                     def parse_show_result(self,response):
    #                         print(response.text)
    #
    #         配置文件制定是否允许操作cookie:
    #             # Disable cookies (enabled by default)
    #             # COOKIES_ENABLED = False
    自动cookies

    7.去重

    访问过的url不再访问,减少不必要的资源浪费。

    #     5. 去重规则
    #         配置:
    #             DUPEFILTER_CLASS = 'xianglong.dupe.MyDupeFilter'
    #         编写类:
    #             class MyDupeFilter(BaseDupeFilter):
    #                 def __init__(self):
    #                     self.record = set()
    #
    #                 @classmethod
    #                 def from_settings(cls, settings):
    #                     return cls()
    #
    #                 def request_seen(self, request):
    #                     if request.url in self.record:
    #                         print('已经访问过了', request.url)
    #                         return True
    #                     self.record.add(request.url)
    #
    #                 def open(self):  # can return deferred
    #                     pass
    #
    #                 def close(self, reason):  # can return a deferred
    #                     pass
    #
    #         问题:为请求创建唯一标识
    #
    #
    #             http://www.oldboyedu.com?id=1&age=2
    #             http://www.oldboyedu.com?age=2&id=1
    #
    #             from scrapy.utils.request import request_fingerprint
    #             from scrapy.http import Request
    #
    #
    #             u1 = Request(url='http://www.oldboyedu.com?id=1&age=2')
    #             u2 = Request(url='http://www.oldboyedu.com?age=2&id=1')
    #
    #             result1 = request_fingerprint(u1)
    #             result2 = request_fingerprint(u2)
    #             print(result1,result2)
    #                    其内部将url用哈希加密成一个64位的字符串,那么建立索引的时候就可#以用char字段
                           result1=result2
    
    
    #         问题:记录到低要不要放在数据库?【使用redis集合存储】
    #             访问记录可以放在redis中。
    #
    #
    #         补充:dont_filter到低在哪里?
    #             from scrapy.core.scheduler import Scheduler
    #              def enqueue_request(self, request):
    #                 # request.dont_filter=False
    #                     # self.df.request_seen(request):
    #                     #   - True,已经访问
    #                     #   - False,未访问
    #                 # request.dont_filter=True,全部加入到调度器
    #                 if not request.dont_filter and self.df.request_seen(request):
    #                     self.df.log(request, self.spider)
    #                     return False
    #                 # 如果往下走,把请求加入调度器
    #                 dqok = self._dqpush(request)
                    
    View Code

     8.中间件

    1.下载中间件

    下载中间件的作用?
    在每次下载前和下载后对请求和响应可以定制功能。例如:user-agent/代理/cookie

    代理实例:

    #     1. 下载中间件
    #         问题:scrapy中如何添加代理?
    #         解决方案:
    #             方式一:内置添加代理功能
    #                 # -*- coding: utf-8 -*-
    #                 import os
    #                 import scrapy
    #                 from scrapy.http import Request
    #
    #                 class ChoutiSpider(scrapy.Spider):
    #                     name = 'chouti'
    #                     allowed_domains = ['chouti.com']
    #                     start_urls = ['https://dig.chouti.com/']
    #
    #                     def start_requests(self):
    #                         os.environ['HTTP_PROXY'] = "http://192.168.11.11"
    #
    #                         for url in self.start_urls:
    #                             yield Request(url=url,callback=self.parse)
    #
    #                     def parse(self, response):
    #                         print(response)
    #
    #             方式二:自定义下载中间件
    #                 import random
    #                 import base64
    #                 import six
    #                 def to_bytes(text, encoding=None, errors='strict'):
    #                     """Return the binary representation of `text`. If `text`
    #                     is already a bytes object, return it as-is."""
    #                     if isinstance(text, bytes):
    #                         return text
    #                     if not isinstance(text, six.string_types):
    #                         raise TypeError('to_bytes must receive a unicode, str or bytes '
    #                                         'object, got %s' % type(text).__name__)
    #                     if encoding is None:
    #                         encoding = 'utf-8'
    #                     return text.encode(encoding, errors)
    #
    #                 class MyProxyDownloaderMiddleware(object):
    #                     def process_request(self, request, spider):
    #                         proxy_list = [
    #                             {'ip_port': '111.11.228.75:80', 'user_pass': 'xxx:123'},
    #                             {'ip_port': '120.198.243.22:80', 'user_pass': ''},
    #                             {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
    #                             {'ip_port': '101.71.27.120:80', 'user_pass': ''},
    #                             {'ip_port': '122.96.59.104:80', 'user_pass': ''},
    #                             {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
    #                         ]
    #                         proxy = random.choice(proxy_list)
    #                         if proxy['user_pass'] is not None:
    #                             request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
    #                             encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
    #                             request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
    #                         else:
    #                             request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
    #
    #
    #
    #                 配置:
    #                     DOWNLOADER_MIDDLEWARES = {
    #                        # 'xiaohan.middlewares.MyProxyDownloaderMiddleware': 543,
    #                     }
    #
    View Code

    2.爬虫中间件

    2. 爬虫中间件
    #         编写:
    #             middlewares.py
    #                 class XiaohanSpiderMiddleware(object):
    #                     # Not all methods need to be defined. If a method is not defined,
    #                     # scrapy acts as if the spider middleware does not modify the
    #                     # passed objects.
    #                     def __init__(self):
    #                         pass
    #                     @classmethod
    #                     def from_crawler(cls, crawler):
    #                         # This method is used by Scrapy to create your spiders.
    #                         s = cls()
    #                         return s
    #
    #                     # 每次下载完成之后,未执行parse函数之前。
    #                     def process_spider_input(self, response, spider):
    #                         # Called for each response that goes through the spider
    #                         # middleware and into the spider.
    #
    #                         # Should return None or raise an exception.
    #                         print('process_spider_input',response)
    #                         return None
    #
    #                     def process_spider_output(self, response, result, spider):
    #                         # Called with the results returned from the Spider, after
    #                         # it has processed the response.
    #
    #                         # Must return an iterable of Request, dict or Item objects.
    #                         print('process_spider_output',response)
    #                         for i in result:
    #                             yield i
    #
    #                     def process_spider_exception(self, response, exception, spider):
    #                         # Called when a spider or process_spider_input() method
    #                         # (from other spider middleware) raises an exception.
    #
    #                         # Should return either None or an iterable of Response, dict
    #                         # or Item objects.
    #                         pass
    #
    #                     # 爬虫启动时,第一次执行start_requests时,触发。(只执行一次)
    #                     def process_start_requests(self, start_requests, spider):
    #                         # Called with the start requests of the spider, and works
    #                         # similarly to the process_spider_output() method, except
    #                         # that it doesn’t have a response associated.
    #
    #                         # Must return only requests (not items).
    #
    #                         print('process_start_requests')
    #                         for r in start_requests:
    #                             yield r
    #
    #         应用:
    #             SPIDER_MIDDLEWARES = {
    #                'xiaohan.middlewares.XiaohanSpiderMiddleware': 543,
    #             }
    #
    View Code

     9.自定义扩展

    engine_started = object()
    engine_stopped = object()
    spider_opened = object()
    spider_idle = object()
    spider_closed = object()
    spider_error = object()
    request_scheduled = object()
    request_dropped = object()
    response_received = object()
    response_downloaded = object()
    item_scraped = object()
    item_dropped = object()
    可扩展类型

    扩展方法:

    #     3. 扩展:信号
    #         单纯扩展:
    #             extends.py
    #                 class MyExtension(object):
    #                     def __init__(self):
    #                         pass
    #
    #                     @classmethod
    #                     def from_crawler(cls, crawler):
    #                         obj = cls()
    #                         return obj
    #             配置:
    #                 EXTENSIONS = {
    #                     'xiaohan.extends.MyExtension':500,
    #                 }
    #
    #         扩展+信号:
    #             extends.py
    #                 from scrapy import signals
    #
    #
    #                 class MyExtension(object):
    #                     def __init__(self):
    #                         pass
    #
    #                     @classmethod
    #                     def from_crawler(cls, crawler):
    #                         obj = cls()
    #                         # 在爬虫打开时,触发spider_opened信号相关的所有函数:xxxxxxxxxxx
    #                         crawler.signals.connect(obj.xxxxxxxxxxx1, signal=signals.spider_opened)
    #                         # 在爬虫关闭时,触发spider_closed信号相关的所有函数:xxxxxxxxxxx
    #                         crawler.signals.connect(obj.uuuuuuuuuu, signal=signals.spider_closed)
    #                         return obj
    #
    #                     def xxxxxxxxxxx1(self, spider):
    #                         print('open')
    #
    #                     def uuuuuuuuuu(self, spider):
    #                         print('close')
    #
    #                             return obj
    #             配置:
    #             EXTENSIONS = {
    #                 'xiaohan.extends.MyExtension':500,
    #             }
    #
    View Code

    10.settings详解

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for step8_king project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     http://doc.scrapy.org/en/latest/topics/settings.html
    #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
    
    # 1. 爬虫名称
    BOT_NAME = 'step8_king'
    
    # 2. 爬虫应用路径
    SPIDER_MODULES = ['step8_king.spiders']
    NEWSPIDER_MODULE = 'step8_king.spiders'
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    # 3. 客户端 user-agent请求头
    # USER_AGENT = 'step8_king (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    # 4. 禁止爬虫配置
    # ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    # 5. 并发请求数
    # CONCURRENT_REQUESTS = 4
    
    # Configure a delay for requests for the same website (default: 0)
    # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    # 6. 延迟下载秒数
    # DOWNLOAD_DELAY = 2
    
    
    # The download delay setting will honor only one of:
    # 7. 单域名访问并发数,并且延迟下次秒数也应用在每个域名
    # CONCURRENT_REQUESTS_PER_DOMAIN = 2
    # 单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延迟下次秒数也应用在每个IP
    # CONCURRENT_REQUESTS_PER_IP = 3
    
    # Disable cookies (enabled by default)
    # 8. 是否支持cookie,cookiejar进行操作cookie
    # COOKIES_ENABLED = True
    # COOKIES_DEBUG = True
    
    # Disable Telnet Console (enabled by default)
    # 9. Telnet用于查看当前爬虫的信息,操作爬虫等...
    #    使用telnet ip port ,然后通过命令操作
    # TELNETCONSOLE_ENABLED = True
    # TELNETCONSOLE_HOST = '127.0.0.1'
    # TELNETCONSOLE_PORT = [6023,]
    
    
    # 10. 默认请求头
    # Override the default request headers:
    # DEFAULT_REQUEST_HEADERS = {
    #     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #     'Accept-Language': 'en',
    # }
    
    
    # Configure item pipelines
    # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
    # 11. 定义pipeline处理请求
    # ITEM_PIPELINES = {
    #    'step8_king.pipelines.JsonPipeline': 700,
    #    'step8_king.pipelines.FilePipeline': 500,
    # }
    
    
    
    # 12. 自定义扩展,基于信号进行调用
    # Enable or disable extensions
    # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
    # EXTENSIONS = {
    #     # 'step8_king.extensions.MyExtension': 500,
    # }
    
    
    # 13. 爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度
    # DEPTH_LIMIT = 3
    
    # 14. 爬取时,0表示深度优先Lifo(默认);1表示广度优先FiFo
    
    # 后进先出,深度优先
    # DEPTH_PRIORITY = 0
    # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
    # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
    # 先进先出,广度优先
    
    # DEPTH_PRIORITY = 1
    # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
    # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
    
    # 15. 调度器队列
    # SCHEDULER = 'scrapy.core.scheduler.Scheduler'
    # from scrapy.core.scheduler import Scheduler
    
    
    # 16. 访问URL去重
    # DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'
    
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
    
    """
    17. 自动限速算法
        from scrapy.contrib.throttle import AutoThrottle
        自动限速设置
        1. 获取最小延迟 DOWNLOAD_DELAY
        2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY
        3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY
        4. 当请求下载完成后,获取其"连接"时间 latency,即:请求连接到接受到响应头之间的时间
        5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCY
        target_delay = latency / self.target_concurrency
        new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间
        new_delay = max(target_delay, new_delay)
        new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
        slot.delay = new_delay
    """
    
    # 开始自动限速
    # AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    # 初始下载延迟
    # AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    # 最大下载延迟
    # AUTOTHROTTLE_MAX_DELAY = 10
    # The average number of requests Scrapy should be sending in parallel to each remote server
    # 平均每秒并发数
    # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    
    # Enable showing throttling stats for every response received:
    # 是否显示
    # AUTOTHROTTLE_DEBUG = True
    
    # Enable and configure HTTP caching (disabled by default)
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    
    
    """
    18. 启用缓存
        目的用于将已经发送的请求或相应缓存下来,以便以后使用
        
        from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
        from scrapy.extensions.httpcache import DummyPolicy
        from scrapy.extensions.httpcache import FilesystemCacheStorage
    """
    # 是否启用缓存策略
    # HTTPCACHE_ENABLED = True
    
    # 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可
    # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
    # 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略
    # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"
    
    # 缓存超时时间
    # HTTPCACHE_EXPIRATION_SECS = 0
    
    # 缓存保存路径
    # HTTPCACHE_DIR = 'httpcache'
    
    # 缓存忽略的Http状态码
    # HTTPCACHE_IGNORE_HTTP_CODES = []
    
    # 缓存存储的插件
    # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    View Code

    11.自定义命令

    • 在spiders同级创建任意目录,如:commands
    • 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)
      • 在settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称'
      • 在项目目录执行命令:scrapy crawlall 
    5. 自定义命令
    #         from scrapy.commands import ScrapyCommand
    #         from scrapy.utils.project import get_project_settings
    #
    #
    #         class Command(ScrapyCommand):
    #             requires_project = True
    #
    #             def syntax(self):
    #                 return '[options]'
    #
    #             def short_desc(self):
    #                 return 'Runs all of the spiders'
    #
    #             def run(self, args, opts):
    #                 spider_list = self.crawler_process.spiders.list()
    #                 for name in spider_list:
    #                     self.crawler_process.crawl(name, **opts.__dict__)
    #                 self.crawler_process.start()
    #
    #
    #             PS:源码
    #                 def run(self, args, opts):
    #                     from scrapy.crawler import CrawlerProcess
    #                     CrawlerProcess.crawl
    #                     CrawlerProcess.start
    #                     """
    #                     self.crawler_process对象中含有:_active = {d,}
    #                     """
    #                     self.crawler_process.crawl('chouti',**opts.__dict__)
    #                     self.crawler_process.crawl('cnblogs',**opts.__dict__)
    #                     #
    #                     self.crawler_process.start()
    View Code

     

     参考:http://www.cnblogs.com/wupeiqi/articles/6229292.html

  • 相关阅读:
    Leetcode-Spiral Matrix
    Leetcode-Permuation Sequence
    浪潮服务器硬盘坏道数据恢复
    通过拼数据库碎片的方式恢复虚拟机磁盘文件丢失问题
    aix 6.1系统怎么安装?这里有详细图文教程
    技术文档分享_linux中生成考核用的GPT分区表结构修复
    python解释NTFS runlist的代码(文章转自北亚数据恢复张宇工程师)
    Unix下zfs文件系统重组RAID-5后可以这样恢复
    EXT3文件系统误删除导致文件系统中的邮件丢失恢复方法
    Hyper-V虚拟机故障导致数据文件丢失的数据恢复全过程
  • 原文地址:https://www.cnblogs.com/chenxuming/p/9274748.html
Copyright © 2020-2023  润新知