用Scrapy写一个爬虫

用Scrapy写一个爬虫
昨天用python谢了一个简单爬虫，抓取页面图片；
但实际用到的爬虫需要处理很多复杂的环境，也需要更加的智能，重复发明轮子的事情不能干，
再说python向来以爬虫作为其擅长的一个领域，想必有许多成熟的第三方框架，百度后选用了
Scrapy作为平台构建复杂爬虫。
Scarpy的下载安装不必细说，话说当前只支持python2.x版本，很郁闷，下载安装了python2.7。
安装完后，按照《Scrapy Tutorial》和Scrapy at a glance两篇帖子作为学习范本。
概念及步骤简要摘录总结如下：
Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。可以应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中。
　　其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的，也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛，可以用于数据挖掘、监测和自动化测试。
Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下：
Scrapy Architecture
　　Scrapy主要包括了以下组件：
- 引擎：用来处理整个系统的数据流处理，触发事务。
- 调度器：用来接受引擎发过来的请求，压入队列中，并在引擎再次请求的时候返回。
- 下载器：用于下载网页内容，并将网页内容返回给蜘蛛。
- 蜘蛛：蜘蛛是主要干活的，用它来制订特定域名或网页的解析规则。
- 项目管道：负责处理有蜘蛛从网页中抽取的项目，他的主要任务是清晰、验证和存储数据。当页面被蜘蛛解析后，将被发送到项目管道，并经过几个特定的次序处理数据。
- 下载器中间件：位于Scrapy引擎和下载器之间的钩子框架，主要是处理Scrapy引擎与下载器之间的请求及响应。
- 蜘蛛中间件：介于Scrapy引擎和蜘蛛之间的钩子框架，主要工作是处理蜘蛛的响应输入和请求输出。
- 调度中间件：介于Scrapy引擎和调度之间的中间件，从Scrapy引擎发送到调度的请求和响应。
　　使用Scrapy可以很方便的完成网上数据的采集工作，它为我们完成了大量的工作，而不需要自己费大力气去开发。
主要步骤：
```
1. 创建一个Scrapy项目
2. 定义提取的Item
3. 编写爬取网站的 spider 并提取 Item
4. 编写 Item Pipeline 来存储提取到的Item(即数据)
```
1.scrapy startproject bbsdmoz
2.编辑 bbsDmoz 目录中的 items.py 文件，根据需要从bbs网站获取到的数据对item进行建模；
1 # -*- coding: utf-8 -*-
2 # Define here the models for your scraped items
3 # See documentation in:
4 # http://doc.scrapy.org/en/latest/topics/items.html
5 from scrapy.item import Item, Field
6 class BbsDmozItem(Item):
7     # define the fields for your item here like:
8     # name = scrapy.Field()
9     url   = Field()
10     forum = Field()
11     poster = Field()
12     content = Field()
3.创建一个Spider，保存在 bbsDmoz/spiders，您必须继承 scrapy.Spider 类。
　从网页中提取数据有很多方法。Scrapy使用了一种基于 XPath 和 CSS 表达式机制： Scrapy Selectors 。
这里给出XPath表达式的例子及对应的含义:
- /html/head/title: 选择HTML文档中 <head> 标签内的 <title> 元素
- /html/head/title/text(): 选择上面提到的 <title> 元素的文字
- //td: 选择所有的 <td> 元素
- //div[@class="mine"]: 选择所有具有 class="mine" 属性的 div 元素
Selector有四个基本的方法(点击相应的方法可以看到详细的API文档):
- xpath(): 传入xpath表达式，返回该表达式所对应的所有节点的selector list列表。
- css(): 传入CSS表达式，返回该表达式所对应的所有节点的selector list列表.
- extract(): 序列化该节点为unicode字符串并返回list。
- re(): 根据传入的正则表达式对数据进行提取，返回unicode字符串list列表。
4.编写spider代码；
Spider代码
　　以下为我们的第一个Spider代码，保存在 bbsDmoz/spiders 目录下的 forumSpider.py 文件中：
  1 #-*- coding: utf-8 -*-
2 '''
3 bbsSpider, Created on Oct, 2014
4 #version: 1.0
5 #author: chenqx @http://chenqx.github.com
6 See more: http://doc.scrapy.org/en/latest/index.html
7 '''
8 from scrapy.selector import Selector
9 from scrapy.http import  Request
10 from scrapy.contrib.spiders import CrawlSpider
11 from scrapy.contrib.loader import ItemLoader
12 from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
13 from bbs.items import BbsItem
14 class forumSpider(CrawlSpider):
15     # name of spiders
16     name = 'bbsSpider'
17     allow_domain = ['bbs.sjtu.edu.cn']
18     start_urls = [ 'https://bbs.sjtu.edu.cn/bbsall' ]
19     link_extractor = {
20         'page':  SgmlLinkExtractor(allow = '/bbsdoc,board,w+.html$'),
21         'page_down':  SgmlLinkExtractor(allow = '/bbsdoc,board,w+,page,d+.html$'),
22         'content':  SgmlLinkExtractor(allow = '/bbscon,board,w+,file,M.d+.A.html$'),
23     }
24     _x_query = {
25         'page_content':    '//pre/text()[2]',
26         'poster'    :    '//pre/a/text()',
27         'forum'    :    '//center/text()[2]',
28     }
29
30     def parse(self, response):
31         for link in self.link_extractor['page'].extract_links(response):
32             yield Request(url = link.url, callback=self.parse_page)
33
34     def parse_page(self, response):
35         for link in self.link_extractor['page_down'].extract_links(response):
36             yield Request(url = link.url, callback=self.parse_page)
37
38         for link in self.link_extractor['content'].extract_links(response):
39             yield Request(url = link.url, callback=self.parse_content)
40     def parse_content(self, response):
41         bbsItem_loader = ItemLoader(item=BbsItem(), response = response)
42         url = str(response.url)
43         bbsItem_loader.add_value('url', url)
44         bbsItem_loader.add_xpath('forum', self._x_query['forum'])
45         bbsItem_loader.add_xpath('poster', self._x_query['poster'])
46         bbsItem_loader.add_xpath('content', self._x_query['page_content'])
47         return bbsItem_loader.load_item()
5.编写 ItemPipeline
当Item在Spider中被收集之后，它将会被传递到Item Pipeline，一些组件会按照一定的顺序执行对Item的处理。
　　每个item pipeline组件(有时称之为“Item Pipeline”)是实现了简单方法的Python类。他们接收到Item并通过它执行一些行为，同时也决定此Item是否继续通过pipeline，或是被丢弃而不再进行处理。
　　以下是item pipeline的一些典型应用：
- 清理HTML数据
- 验证爬取的数据(检查item包含某些字段)
- 查重(并丢弃)
- 将爬取结果保存，如保存到数据库、XML、JSON等文件中
这个例子将spider从bbs收集来的item写入data.xml文件。
  1 # -*- coding: utf-8 -*-
2 # Define your item pipelines here
3 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
4 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
5 from scrapy import signals
6 from scrapy import log
7 from bbsDmoz.items import BbsDmozItem
8 from twisted.enterprise import adbapi
9 from scrapy.contrib.exporter import XmlItemExporter
10 from dataProcess import dataProcess
11 class XmlWritePipeline(object):
12     def __init__(self):
13         pass
14     @classmethod
15     def from_crawler(cls, crawler):
16         pipeline = cls()
17         crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
18         crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
19         return pipeline
20     def spider_opened(self, spider):
21         self.file = open('bbsData.xml', 'wb')
22         self.expoter = XmlItemExporter(self.file)
23         self.expoter.start_exporting()
24     def spider_closed(self, spider):
25         self.expoter.finish_exporting()
26         self.file.close()
27         # process the crawled data, define and call dataProcess function
28         # dataProcess('bbsData.xml', 'text.txt')
29     def process_item(self, item, spider):
30         self.expoter.export_item(item)
31         return item
6.编辑setting.py，关联spider与ItemPipeline
  1 # -*- coding: utf-8 -*-
2 # Scrapy settings for bbs project
3 # For simplicity, this file contains only the most important settings by
4 # default. All the other settings are documented here:
5 # http://doc.scrapy.org/en/latest/topics/settings.html
6 BOT_NAME = 'bbsDomz'
7 CONCURRENT_REQUESTS = 200
8 LOG_LEVEL = 'INFO'
9 COOKIES_ENABLED = True
10 RETRY_ENABLED = True
11 SPIDER_MODULES = ['bbsDomz.spiders']
12 NEWSPIDER_MODULE = 'bbsDomz.spiders'
13 # JOBDIR = 'jobdir'
14 ITEM_PIPELINES = {
15     'bbsDomz.pipelines.XmlWritePipeline': 1000,
16 }
7.大功告成，运行一下测试一下吧。
进入项目的根目录bbsDomz/下，执行下列命令启动spider：
scrapy crawl bbsSpider
可以看到命令窗口飞速删除bbs的内容，当然是u1234类似的编码，data.xml文件在不断变大，用sublime可以看到其中的汉字内容。
ctrl+c可以中断运行，很奇怪，一中断后，xml文件读不出来了。
相关阅读:
renren fast快速开发系统平台
 质量体系推广中的APQP
人员能力管理——车间看板工具
 【生物信息】RPKM, FPKM和TPM
【深度学习与TensorFlow 2.0】图片分类——多层感知机
 【深度学习与TensorFlow 2.0】入门篇
 【bioinfo】生物信息学——代码遇见生物学的地方
 【web开发】docker中的数据库
 使用docker搭建数据分析环境
 【数据科学】Python数据可视化概述
原文地址：https://www.cnblogs.com/javajava/p/4798840.html

用Scrapy写一个爬虫

Spider代码