• 调试Scrapy过程中的心得体会


    1.大量抓取网页时出现“Memory Error”解决办法:设置一个队列,每当爬虫空闲时才向队列中放入请求,例如:

    from scrapy import signals, Spider
    from scrapy.xlib.pydispatch import dispatcher
    
    
    class ExampleSpider(Spider):
        name = "example"
        start_urls = ['http://www.example.com/']
    
        def __init__(self, *args, **kwargs):
            super(ExampleSpider, self).__init__(*args, **kwargs)
            # connect the function to the spider_idle signal
            dispatcher.connect(self.queue_more_requests, signals.spider_idle)
    
        def queue_more_requests(self, spider):
            # this function will run everytime the spider is done processing
            # all requests/items (i.e. idle)
    
            # get the next urls from your database/file
            urls = self.get_urls_from_somewhere()
    
            # if there are no longer urls to be processed, do nothing and the
            # the spider will now finally close
            if not urls:
                return
    
            # iterate through the urls, create a request, then send them back to
            # the crawler, this will get the spider out of its idle state
            for url in urls:
                req = self.make_requests_from_url(url)
                self.crawler.engine.crawl(req, spider)
    
        def parse(self, response):
            pass

    More info on the spider_idle signal: http://doc.scrapy.org/en/latest/topics/signals.html#spider-idle

    More info on debugging memory leaks: http://doc.scrapy.org/en/latest/topics/leaks.html

    P.S.还有一种限定爬取深度的方法(貌似在settings.py中?)待研究

    2.如果请求的url不存在(404),则不会有response对象返回,爬虫什么也没做

    3.编码问题

    pubmed_spider.py中

    import sys
    reload(sys)
    #python默认环境编码时ascii
    sys.setdefaultencoding("utf-8")

    保证抓取到的数据是utf8格式的

    pipeline.py中file = codecs.open('/%s.txt' % (item['name']), mode = 'w',encoding='utf-8')将数据以utf8格式存储

  • 相关阅读:
    CSS定位属性
    CSS属性
    CSS基础
    HTML
    JDBC
    语言元素
    初识Python
    redis配置文件
    zabbix
    jumpserver
  • 原文地址:https://www.cnblogs.com/zhouliyan/p/5970665.html
Copyright © 2020-2023  润新知