• scrapy框架集成http


    如果只是在Flask中调用Scrapy爬虫,可能会遇到如下错误:

    ValueError: signal only works in main thread
    # 或者
    twisted.internet.error.ReactorNotRestartable

    解决的办法有几个。

    1 使用python子进程(subproccess)

    首先,确保目录结构类似如下:

    > tree -L 1                                                                                                                                                              
    
    ├── dirbot
    ├── README.rst
    ├── scrapy.cfg
    ├── server.py
    └── setup.py

    然后在,新进程中启动爬虫:

    # server.py
    import subprocess
    
    from flask import Flask
    app = Flask(__name__)
    
    @app.route('/')
    def hello_world():
        """
        Run spider in another process and store items in file. Simply issue command:
    
        > scrapy crawl dmoz -o "output.json"
    
        wait for  this command to finish, and read output.json to client.
        """
        spider_name = "dmoz"
        subprocess.check_output(['scrapy', 'crawl', spider_name, "-o", "output.json"])
        with open("output.json") as items_file:
            return items_file.read()
    
    if __name__ == '__main__':
        app.run(debug=True)

    新进程中启动爬虫:

    2 使用Twisted-Klein + Scrapy

    代码如下:

    # server.py
    import json
    
    from klein import route, run
    from scrapy import signals
    from scrapy.crawler import CrawlerRunner
    
    from dirbot.spiders.dmoz import DmozSpider
    
    
    class MyCrawlerRunner(CrawlerRunner):
        """
        Crawler object that collects items and returns output after finishing crawl.
        """
        def crawl(self, crawler_or_spidercls, *args, **kwargs):
            # keep all items scraped
            self.items = []
    
            # create crawler (Same as in base CrawlerProcess)
            crawler = self.create_crawler(crawler_or_spidercls)
    
            # handle each item scraped
            crawler.signals.connect(self.item_scraped, signals.item_scraped)
    
            # create Twisted.Deferred launching crawl
            dfd = self._crawl(crawler, *args, **kwargs)
    
            # add callback - when crawl is done cal return_items
            dfd.addCallback(self.return_items)
            return dfd
    
        def item_scraped(self, item, response, spider):
            self.items.append(item)
    
        def return_items(self, result):
            return self.items
    
    
    def return_spider_output(output):
        """
        :param output: items scraped by CrawlerRunner
        :return: json with list of items
        """
        # this just turns items into dictionaries
        # you may want to use Scrapy JSON serializer here
        return json.dumps([dict(item) for item in output])
    
    
    @route("/")
    def schedule(request):
        runner = MyCrawlerRunner()
        spider = DmozSpider()
        deferred = runner.crawl(spider)
        deferred.addCallback(return_spider_output)
        return deferred
    
    
    run("localhost", 8080)

    3 使用ScrapyRT

    安装ScrapyRT,然后启动:

    > scrapyrt 

    文章来源:https://stackoverflow.com/questions/36384286/how-to-integrate-flask-scrapy

  • 相关阅读:
    QML Object Attributes QML对象属性
    FindPkgConfig----CMake的pkg-config模块
    如何在linux下制作一个windows的可启动u盘?
    cmake工具链
    sed 命令详解
    说说 bash 的 if 语句
    cmake的四个命令:add_compile_options、add_definitions、target_compile_definitions、build_command
    cmake的命令execute_process
    cmake的两个命令: option 和 configure_file
    linux文件相关的命令
  • 原文地址:https://www.cnblogs.com/Im-Victor/p/15473986.html
Copyright © 2020-2023  润新知