安装完scrapy后,创建一个新的工程:
scrapy startproject tutorial
会创建一个tutorial文件夹有以下的文件:
tutorial/ scrapy.cfg tutorial/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
These are basically:
- scrapy.cfg: the project configuration file
- tutorial/: the project’s python module, you’ll later import your code from here.
- tutorial/items.py: the project’s items file.
- tutorial/pipelines.py: the project’s pipelines file.
- tutorial/settings.py: the project’s settings file.
- tutorial/spiders/: a directory where you’ll later put your spiders.
Defining our Item
Items are containers that will be loaded with the scraped data; they work like simple python dicts but provide additional protecting against populating undeclared fields, to prevent typos.
items是我们将会装入的数据的容器。他们类似python 字典但是提供了附加的保护如防止填充未声明的段等。
They are declared by creating an scrapy.item.Item class an defining its attributes as scrapy.item.Field objects, like you will in an ORM (don’t worry if you’re not familiar with ORMs, you will see that this is an easy task).
We begin by modeling the item that we will use to hold the sites data obtained from dmoz.org, as we want to capture the name, url and description of the sites, we define fields for each of these three attributes. To do that, we edit items.py, found in the tutorial directory. Our Item class looks like this:
它通过建立一个scrapy.item.Item的类来生命,定义它的属性为scrpiy.item.Field对象,就像你在一个ORM中.
我们通过将我们需要的条目模型化来控制从dmoz.org获得的数据,比如我们要获得网站的名字,url和网站描述,我们定义这三种属性的范围,为了达到目的,我们编辑在dmoz目录下的items.py文件,我们的Item类将会是这样
from scrapy.item import Item, Field class DmozItem(Item): title = Field() link = Field() desc = Field()
This may seem complicated at first, but defining the item allows you to use other handy components of Scrapy that need to know how your item looks like.
这个首先看起来有点复杂,但是定义这些条目让你用其他Scrapy的组件的时候你能够知道你的 items到底是如何定义。。
Our first Spider
Spiders are user-written classes used to scrape information from a domain (or group of domains). spider是用户写的类用来scrapy信息。
They define an initial list of URLs to download, how to follow links, and how to parse the contents of those pages to extract items.
它定义了一个初始的url列表来下载,如何follow link,如何解析页面提取到items。
To create a Spider, you must subclass scrapy.spider.BaseSpider, and define the three main, mandatory, attributes:
为了创建一个spider,你必须创建一个scrapy.spider.BaseSpider的子类,然后定义3个主要的必须的属性。
-
name: identifies the Spider. It must be unique, that is, you can’t set the same name for different Spiders.
-
start_urls: is a list of URLs where the Spider will begin to crawl from. So, the first pages downloaded will be those listed here. The subsequent URLs will be generated successively from data contained in the start URLs.
- 这是一个URL列表,爬虫从这里开始抓取数据,所以,第一次下载的数据将会从这些URLS开始。 后来计算的所有子URL将会从这些URL中开始计算
-
parse() is a method of the spider, which will be called with the downloaded Response object of each start URL. The response is passed to the method as the first and only argument.This method is responsible for parsing the response data and extracting scraped data (as scraped items) and more URLs to follow.
-
是一个爬虫的方法,调用时候传入从每一个URL传回的Response对象作为参数,response将会是parse方法的唯一的一个参数,这个方法负责解析返回的response数据匹配抓取的数据(解析为items)和其他的URL.
The parse() method is in charge of processing the response and returning scraped data (as Item objects) and more URLs to follow (as Requestobjects). parse()方法负责处理response,返回scrapy的数据(作为item对象)。
This is the code for our first Spider; save it in a file named dmoz_spider.py under the dmoz/spiders directory:
from scrapy.spider import BaseSpider class DmozSpider(BaseSpider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body)
name #爬虫的ID,该属性必须唯一
Crawling
To put our spider to work, go to the project’s top level directory and run:
scrapy crawl dmoz
The crawl dmoz command runs the spider for the dmoz.org domain. You will get an output similar to this:
:51:13-0300 [scrapy] INFO: Started project: dmoz 2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled extensions: ... 2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled downloader middlewares: ... 2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled spider middlewares: ... 2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled item pipelines: ... 2008-08-20 03:51:14-0300 [dmoz] INFO: Spider opened 2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: <None>) 2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: <None>) 2008-08-20 03:51:14-0300 [dmoz] INFO: Spider closed (finished)
Pay attention to the lines containing [dmoz], which corresponds to our spider. You can see a log line for each URL defined in start_urls. Because these URLs are the starting ones, they have no referrers, which is shown at the end of the log line, where it says (referer: <None>).
注意有 [dmoz.org]的输出 ,对我们的爬虫做出的结果(identified by the domain "dmoz.org"). 你可以看见在start_urls中定义的一些URL的一些输出。因为这些URL是起始页面,所以他们没有引用(referrers),所以在每行的末尾你会看到 (referer: <None>).
But more interesting, as our parse method instructs, two files have been created: Books and Resources, with the content of both URLs.
有趣的是,在我们的 parse 方法的作用下,两个文件被创建 Books 和 Resources, 这两个文件中有着URL的页面内容。
(在顶层目录下有了2个新文件。 Books 和 Resources,分别是2个url网页的内容)
What just happened under the hood?
Scrapy creates scrapy.http.Request objects for each URL in the start_urls attribute of the Spider, and assigns them the parse method of the spider as their callback function.
These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed back to the spider, through the parse()method.
发生了什么事情?Scrapy为爬虫属性中的 start_urls中的每个URL创建了一个 scrapy.http.Request 对象 , 为他们指定爬虫的 parse 方法作为回调。
这些 Request首先被计划,然后被执行, 之后 scrapy.http.Response 对象通过parse() 方法返回给爬虫.
Extracting Items
Introduction to Selectors
There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath expressions called XPath selectors. For more information about selectors and other extraction mechanisms see the XPath selectors documentation.
Here are some examples of XPath expressions and their meanings:
- /html/head/title: selects the <title> element, inside the <head> element of a HTML document
- /html/head/title/text(): selects the text inside the aforementioned <title> element.
- //td: selects all the <td> elements
- //div[@class="mine"]: selects all div elements which contain an attribute class="mine"
These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much more powerful. To learn more about XPath we recommend this XPath tutorial.
For working with XPaths, Scrapy provides a XPathSelector class, which comes in two flavours, HtmlXPathSelector (for HTML data) andXmlXPathSelector (for XML data). In order to use them you must instantiate the desired class with a Response object.
You can see selectors as objects that represent nodes in the document structure. So, the first instantiated selectors are associated to the root node, or the entire document.
为了方便使用XPaths, Scrapy提供XPathSelector 类, 一共有两种, HtmlXPathSelector (HTML数据解析) 和XmlXPathSelector (XML数据解析). 为了使用他们你必须通过一个 Response 对象对他们进行实例化操作.
你会发现Selector对象展示了文档的节点结构.所以,首先被实例化的selector与跟节点或者是整个目录有关 。
Selectors have three methods (click on the method to see the complete API documentation).
-
select(): returns a list of selectors, each of them representing the nodes selected by the xpath expression given as argument.
- extract(): returns a unicode string with
-
the data selected by the XPath selector. 提取unicode字符串。
-
re(): returns a list of unicode strings extracted by applying the regular expression given as argument.
Trying Selectors in the Shell
To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended Python console) installed on your system.
尝试在交互环境中使用Selectors为了举例说明Selectors的用法我们将用到 Scrapy shell, 需要使用ipython (一个扩展python交互环境) 。
To start a shell, you must go to the project’s top level directory and run:
scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
输出:
[ ... Scrapy log here ... ] [s] Available Scrapy objects: [s] 2010-08-19 21:45:59-0300 [default] INFO: Spider closed (finished) [s] hxs <HtmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None> [s] item Item() [s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> [s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> [s] spider <BaseSpider 'default' at 0x1b6c2d0> [s] xxs <XmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None> [s] Useful shortcuts: [s] shelp() Print this help [s] fetch(req_or_url) Fetch a new request or URL and update shell objects [s] view(response) View response in a browser In [1]:
After the shell loads, you will have the response fetched in a local response variable, so if you type response.body you will see the body of the response, or you can type response.headers to see its headers.
The shell also instantiates two selectors, one for HTML (in the hxs variable) and one for XML (in the xxs variable) with this response. So let’s try them:
交互环境载入后,你将会有一个在本地变量 response 中提取的response , 所以如果你输入 response.body 你将会看到response的body部分,或者你可以输入 response.headers 来查看它的 headers.
交互环境也实例化了两种selectors, 一个是解析HTML的 hxs 变量 一个是解析 XML 的 xxs 变量 :
- scrapy shell url #url表示你要提取网页的URL
- hxs = HtmlXPathSelector(response)
- self.title = hxs.select('//title/text()').extract()[0].strip().replace(' ', '_')
- sites = hxs.select('//ul/li/div/a/img/@src').extract()
In [1]: hxs.select('//title') Out[1]: [<HtmlXPathSelector (title) xpath=//title>] In [2]: hxs.select('//title').extract() Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>'] In [3]: hxs.select('//title/text()') Out[3]: [<HtmlXPathSelector (text) xpath=//title/text()>] In [4]: hxs.select('//title/text()').extract() Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books'] In [5]: hxs.select('//title/text()').re('(w+):') Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']
Extracting the data
Now, let’s try to extract some real information from those pages.
You could type response.body in the console, and inspect the source code to figure out the XPaths you need to use. However, inspecting the raw HTML code there could become a very tedious task. To make this an easier task, you can use some Firefox extensions like Firebug. For more information seeUsing Firebug for scraping and Using Firefox for scraping.
After inspecting the page source, you’ll find that the web sites information is inside a <ul> element, in fact the second <ul> element.
So we can select each <li> element belonging to the sites list with this code:
检查源代码后,你会发现我们需要的数据在一个 <ul>元素中 事实是第二个<ul>元素。
我们可以通过如下命令选择每个在网站中的 <li> 元素:
hxs.select('//ul/li')
And from them, the sites descriptions: hxs.select('//ul/li/text()').extract() The sites titles: hxs.select('//ul/li/a/text()').extract() And the sites links: hxs.select('//ul/li/a/@href').extract()
As we said before, each select() call returns a list of selectors, so we can concatenate further select() calls to dig deeper into a node. We are going to use that property here, so:每个 select() 调用返回一个selectors列表, 所以我们可以结合 select() 调用去查找更深的节点. 我们将会用到这些特性,所以:
sites = hxs.select('//ul/li') for site in sites: title = site.select('a/text()').extract() link = site.select('a/@href').extract() desc = site.select('text()').extract() print title, link, desc
Let’s add this code to our spider:
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector class DmozSpider(BaseSpider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select('//ul/li') for site in sites: title = site.select('a/text()').extract() link = site.select('a/@href').extract() desc = site.select('text()').extract() print title, link, desc
Now try crawling the dmoz.org domain again and you’ll see sites being printed in your output, run:
scrapy crawl dmoz
Using our item
Item objects are custom python dicts; you can access the values of their fields (attributes of the class we defined earlier) using the standard dict syntax like:
>>> item = DmozItem() >>> item['title'] = 'Example title' >>> item['title'] 'Example title'
Spiders are expected to return their scraped data inside Item objects. So, in order to return the data we’ve scraped so far, the final code for our Spider would be like this:Spiders将会返回在 Item 中抓取的信息 ,所以为了返回我们抓取的信息,spider的内容应该是这样:
rom scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from tutorial.items import DmozItem class DmozSpider(BaseSpider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select('//ul/li') items = [] for site in sites: item = DmozItem() item['title'] = site.select('a/text()').extract() item['link'] = site.select('a/@href').extract() item['desc'] = site.select('text()').extract() items.append(item) return items
发现上面有个错误,现在我的文件组织结构如下:
tutorial --tutorial ----spiders ------__init__ ------dmoz_spider ----__init__ ----items ----pipelines ----setting
我最开始用
from tutorial.items import DmozItem
pydev 提示错误,tutorial没有items模块。
改为from tutorial.tutorial.items import DmozItem pydev正常。
但是当我在cmd下scrapy crawl dmoz时就出错了。提示tutorial没有tutorial.items模块。
改为from tutorial.items import DmozItem 正常。
Now doing a crawl on the dmoz.org domain yields DmozItem‘s:
[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> {'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc. ], 'link': [u'http://gnosis.cx/TPiP/'], 'title': [u'Text Processing in Python']} [dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> {'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR] '], 'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'], 'title': [u'XML Processing with Python']}
Storing the scraped data
The simplest way to store the scraped data is by using the Feed exports, with the following command:
scrapy crawl dmoz -o items.json -t json
That will generate a items.json file containing all scraped items, serialized in JSON.
In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex things with the scraped items, you can write an Item Pipeline. As with Items, a placeholder file for Item Pipelines has been set up for you when the project is created, in tutorial/pipelines.py. Though you don’t need to implement any item pipeline if you just want to store the scraped items.
Next steps
This tutorial covers only the basics of Scrapy, but there’s a lot of other features not mentioned here. Check the What else? section in Scrapy at a glancechapter for a quick overview of the most important ones.
Then, we recommend you continue by playing with an example project (see Examples), and then continue with the section Basic concepts.
说明;
0.24的scrapy版本
作了
以前是:
from scrapy.item import Item, Field
class DmozItem(Item):
# define the fields for your item here like:
# name = Field()
title=Field()
link=Field()
desc=Field()
现在是:
import scrapy
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
做了类似的改动。