这里我们介绍一下python的分布式爬虫框架scrapy的安装以及使用。平庸这东西犹如白衬衣上的污痕,一旦染上便永远洗不掉,无可挽回。
scrapy的安装使用
我的电脑环境是win10,64位的。python版本是3.6.3。以下是安装以及学习scrapy的第一个案例。
一、scrapy的安装准备
直接运行以下命令
pip install scrapy
由于我的电脑上面没有安装Microsoft Visual C++ 14.0。会出现如下的错误。
building 'twisted.test.raiser' extension error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools
解决方案有两种,一种是安装Microsoft Visual C++ Build Tools。这个比较大,这里我没有使用这种方式。可以直接安装网上已经编译好的twisted版本。可以在https://www.lfd.uci.edu/~gohlke/pythonlibs上找到已经编译好的python库。我们找到scrapy需要的twisted库。cp36表示python版本3.6,amd64表示64位。
下载安装之后,运行以下命令安装Twisted。
pip install D:360DownloadTwisted-17.9.0-cp36-cp36m-win_amd64.whl
最后再运行 pip install scrapy可以成功安装。
whl格式本质上是一个压缩包,里面包含了py文件,以及经过编译的pyd文件。使得可以在不具备编译环境的情况下,选择合适自己的python环境进行安装。
二、运行scrapy的第一个案例
创建python文件quotes_spider.py,内容如下
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/tag/humor/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.xpath('span/small/text()').extract_first(), } next_page = response.css('li.next a::attr("href")').extract_first() if next_page is not None: yield response.follow(next_page, self.parse)
在相应的目录下运行命令
scrapy runspider quotes_spider.py -o quotes.json
以上会出现以下的错误:
import win32api ModuleNotFoundError: No module named 'win32api'
需要安装win32api,地址https://sourceforge.net/projects/pywin32/files/pywin32/Build%20221/。这里我们选择安装.
安装完之后,重新运行scrapy runspider quotes_spider.py -o quotes.json,可以看到成功的生成quotes.json文件。内容如下
[ {"text": "u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.u201d", "author": "Jane Austen"}, {"text": "u201cA day without sunshine is like, you know, night.u201d", "author": "Steve Martin"}, {"text": "u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.u201d", "author": "Garrison Keillor"}, {"text": "u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.u201d", "author": "Jim Henson"}, {"text": "u201cAll you need is love. But a little chocolate now and then doesn't hurt.u201d", "author": "Charles M. Schulz"}, {"text": "u201cRemember, we're madly in love, so it's all right to kiss me anytime you feel like it.u201d", "author": "Suzanne Collins"}, {"text": "u201cSome people never go crazy. What truly horrible lives they must lead.u201d", "author": "Charles Bukowski"}, {"text": "u201cThe trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.u201d", "author": "Terry Pratchett"}, {"text": "u201cThink left and think right and think low and think high. Oh, the thinks you can think up if only you try!u201d", "author": "Dr. Seuss"}, {"text": "u201cThe reason I talk to myself is because Iu2019m the only one whose answers I accept.u201d", "author": "George Carlin"}, {"text": "u201cI am free of all prejudice. I hate everyone equally. u201d", "author": "W.C. Fields"}, {"text": "u201cA lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.u201d", "author": "Jane Austen"} ]
友情链接
- scrapy的官方文档:https://docs.scrapy.org/en/latest/
- 编译好的python文件地址:https://www.lfd.uci.edu/~gohlke/pythonlibs