• python爬虫---->scrapy的使用(一)


      这里我们介绍一下python的分布式爬虫框架scrapy的安装以及使用。平庸这东西犹如白衬衣上的污痕,一旦染上便永远洗不掉,无可挽回。

    scrapy的安装使用

    我的电脑环境是win10,64位的。python版本是3.6.3。以下是安装以及学习scrapy的第一个案例。

    一、scrapy的安装准备

    直接运行以下命令

    pip install scrapy

    由于我的电脑上面没有安装Microsoft Visual C++ 14.0。会出现如下的错误。

    building 'twisted.test.raiser' extension
        error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

    解决方案有两种,一种是安装Microsoft Visual C++ Build Tools。这个比较大,这里我没有使用这种方式。可以直接安装网上已经编译好的twisted版本。可以在https://www.lfd.uci.edu/~gohlke/pythonlibs上找到已经编译好的python库。我们找到scrapy需要的twisted库。cp36表示python版本3.6,amd64表示64位。

    下载安装之后,运行以下命令安装Twisted。

    pip install D:360DownloadTwisted-17.9.0-cp36-cp36m-win_amd64.whl

    最后再运行 pip install scrapy可以成功安装。

    whl格式本质上是一个压缩包,里面包含了py文件,以及经过编译的pyd文件。使得可以在不具备编译环境的情况下,选择合适自己的python环境进行安装。 

    二、运行scrapy的第一个案例

    创建python文件quotes_spider.py,内容如下

    import scrapy
    
    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        start_urls = [
            'http://quotes.toscrape.com/tag/humor/',
        ]
    
        def parse(self, response):
            for quote in response.css('div.quote'):
                yield {
                    'text': quote.css('span.text::text').extract_first(),
                    'author': quote.xpath('span/small/text()').extract_first(),
                }
    
            next_page = response.css('li.next a::attr("href")').extract_first()
            if next_page is not None:
                yield response.follow(next_page, self.parse)

    在相应的目录下运行命令

    scrapy runspider quotes_spider.py -o quotes.json

    以上会出现以下的错误:

        import win32api
    ModuleNotFoundError: No module named 'win32api'

    需要安装win32api,地址https://sourceforge.net/projects/pywin32/files/pywin32/Build%20221/。这里我们选择安装.

    安装完之后,重新运行scrapy runspider quotes_spider.py -o quotes.json,可以看到成功的生成quotes.json文件。内容如下

    [
    {"text": "u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.u201d", "author": "Jane Austen"},
    {"text": "u201cA day without sunshine is like, you know, night.u201d", "author": "Steve Martin"},
    {"text": "u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.u201d", "author": "Garrison Keillor"},
    {"text": "u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.u201d", "author": "Jim Henson"},
    {"text": "u201cAll you need is love. But a little chocolate now and then doesn't hurt.u201d", "author": "Charles M. Schulz"},
    {"text": "u201cRemember, we're madly in love, so it's all right to kiss me anytime you feel like it.u201d", "author": "Suzanne Collins"},
    {"text": "u201cSome people never go crazy. What truly horrible lives they must lead.u201d", "author": "Charles Bukowski"},
    {"text": "u201cThe trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.u201d", "author": "Terry Pratchett"},
    {"text": "u201cThink left and think right and think low and think high. Oh, the thinks you can think up if only you try!u201d", "author": "Dr. Seuss"},
    {"text": "u201cThe reason I talk to myself is because Iu2019m the only one whose answers I accept.u201d", "author": "George Carlin"},
    {"text": "u201cI am free of all prejudice. I hate everyone equally. u201d", "author": "W.C. Fields"},
    {"text": "u201cA lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.u201d", "author": "Jane Austen"}
    ]

    友情链接

  • 相关阅读:
    Quartz cron表达式
    Apache NiFi 核心概念和关键特性
    Hive llap服务安装说明及测试(一)
    nifi生产环境使用
    DataX 中Transformer的使用
    vue2.0之过渡动画,分别用钩子函数,animated,原生css实现(前端网备份)
    js对对象数组的某一字段排序(前端网备份)
    浏览器之禁扒(前端网备份)
    iframe 从父像子穿参数(前端网备份)
    关于小程序仿微博导航效果(前端网备份 )
  • 原文地址:https://www.cnblogs.com/huhx/p/baseusepythonscrapy1.html
Copyright © 2020-2023  润新知