python3 支持 scrapy了。
通过pycharm的菜单file-default setting-project interpreter,进行搜索安装;
通过如下pip也可安装:
$ pip install scrapy==1.1.0rc1
scrapy下的每个item对象表示网站的一个页面。可以定义不同的item(url,content,header,image)
首先,在当前目录下创建scrapy项目:
$scrapy startproject wikiSpider
会新建一个wikiSpider的项目文件夹,目录中有item.py、settings.py、spiders文件夹等;
在spider文件夹下新建articleSpider.py:
from scrapy import Spider from wikiSpider.items import Article class ArticleSpider(Spider): name = 'article' allowed_domains = ['en.wikipedia.org'] start_urls = ['http://en.wikipedia.org/wiki/Main_Page', 'http://en.wikipedia.org/wiki/Python_%28programming_language%29'] def parse(self, response): item = Article() title = response.xpath('//h1/text()')[0].extract() print('title is :'+title) item['title'] = title return item
把item.py改成:
from scrapy import Item,Field class Article(Item): # define the fields for your item here like: # name = scrapy.Field() title = Field() pass
同时在setting.py中修改日志,方便查看输出结果:
LOG_LEVEL = 'ERROR'
然后在wikiSpider主目录中运行:
$scrapy crawl article
可以出现调试信息:
title is :Main Page title is :Python (programming language)