网络爬虫（4）-Scrapy增量爬虫

1.增量爬虫概念

通过爬虫程序监测某网站数据更新的情况，以便可以爬取到该网站更新出的新数据。

2.增量爬虫方法

在发送请求之前判断这个URL是不是之前爬取过
在解析内容后判断这部分内容是不是之前爬取过

写入存储介质时判断内容是不是已经在介质中存在

- 分析：不难发现，其实增量爬取的核心是去重，至于去重的操作在哪个步骤起作用，只能说各有利弊。在我看来，前两种思路需要根据实际情况取一个（也可能都用）。第一种思路适合不断有新页面出现的网站，比如说小说的新章节，每天的最新新闻等等；第二种思路则适合页面内容会更新的网站。第三个思路是相当于是最后的一道防线。这样做可以最大程度上达到去重的目的。
去重方法
- 将爬取过程中产生的url进行存储，存储在redis的set中。当下次进行数据爬取时，首先对即将要发起的请求对应的url在存储的url的set中做判断，如果存在则不进行请求，否则才进行请求。
- 对爬取到的网页内容进行唯一标识的制定，然后将该唯一表示存储至redis的set中。当下次爬取到网页数据的时候，在进行持久化存储之前，首先可以先判断该数据的唯一标识在redis的set中是否存在，在决定是否进行持久化存储。

3.爬虫案例

目标：爬取4567tv网站中所有的电影详情数据

1）爬虫文件

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from redis import Redis
from incrementPro.items import IncrementproItem
class MovieSpider(CrawlSpider):
    name = 'movie'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.4567tv.tv/frim/index7-11.html']

    rules = (
        Rule(LinkExtractor(allow=r'/frim/index7-d+.html'), callback='parse_item', follow=True),
    )
    #创建redis链接对象
    conn = Redis(host='127.0.0.1',port=6379)
    def parse_item(self, response):
        li_list = response.xpath('//li[@class="p1 m1"]')
        for li in li_list:
            #获取详情页的url
            detail_url = 'http://www.4567tv.tv'+li.xpath('./a/@href').extract_first()
            #将详情页的url存入redis的set中
            ex = self.conn.sadd('urls',detail_url)
            if ex == 1:
                print('该url没有被爬取过，可以进行数据的爬取')
                yield scrapy.Request(url=detail_url,callback=self.parst_detail)
            else:
                print('数据还没有更新，暂无新数据可爬取！')

    #解析详情页中的电影名称和类型，进行持久化存储
    def parst_detail(self,response):
        item = IncrementproItem()
        item['name'] = response.xpath('//dt[@class="name"]/text()').extract_first()
        item['kind'] = response.xpath('//div[@class="ct-c"]/dl/dt[4]//text()').extract()
        item['kind'] = ''.join(item['kind'])
        yield item

2）管道文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from redis import Redis
class IncrementproPipeline(object):
    conn = None
    def open_spider(self,spider):
        self.conn = Redis(host='127.0.0.1',port=6379)
    def process_item(self, item, spider):
        dic = {
            'name':item['name'],
            'kind':item['kind']
        }
        print(dic)
        self.conn.lpush('movieData',dic)
        return item

相关阅读:
vs2013中，自定义mvc 添加视图脚手架
 照猫画虎owin oauth for qq and sina
Microsoft.AspNet.Identity 的简单使用
 autofac使用Common Serivce Locator跟随wcf，mvc，web api的实例控制
 在iis搭建nuget server时遇到405 method not allow
mongodb int型id 自增
 使用T4Scaffolding 创建自己的代码生成
 vs2013 sn key
spring mvc与mybatis收集到博客
 spring的依赖注入DI(IOC)
原文地址：https://www.cnblogs.com/Iceredtea/p/11285574.html