引言
本篇介绍Crawlspider,相比于Spider,Crawlspider更适用于批量爬取网页
Crawlspider
- Crawlspider适用于对网站爬取批量网页,相对比Spider类,CrawSpider主要使用规则(rules)来提取链接,通过定义一组规则为跟踪链接提供了遍历的机制。
- Crawlspider 的强大体现在自动爬取页面所有符合规则的链接并深入下去!
全站数据爬取
编码流程
-
新建一个工程
-
cd 工程
-
创建爬虫文件: scrapy genspider -t crawl spiderName 地址
(crawlspider 继承 scrapy.Spider)
- 链接提取器 LinkExtractor
- 可以根据指定的规则对指定的链接进行提取
- 提取的规则就是构造方法中的 allow(‘正则表达式’) 参数决定
- 规则解析器 Rule
- 可以将链接提取器提到的链接进行请求,可以根据指定的规则(callback)对请求到的数据进行解析
Rule(link, callback='parse_item', follow=True)
scrapy中发送请求的几种方式:
- start_url
- self.Request()
- 链接提取器
例子
使用CrawlSpider模板批量爬取(阳光热线问政平台的投诉帖子)的主题、状态和详细内容
地址为:http://wz.sun0769.com/html/top/reply.shtml
① 定义spider
scrapy genspider -t crawl sun
创建一个spider
在该spider文件中:
- 定义
LinkExtractor
获取每个页面中的页码的url地址。
- 定义
Rule
,放入 LinkExtractor
以及 callback
,对于 follow 值得话,如果为True得话,将继续作用到 LinkExtractor
提取到的链接 所对应的 页码源码中。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
|
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from sunPro.items import Item1, Item2
class SunSpider(CrawlSpider): name = 'sun' start_urls = ['http://wz.sun0769.com/html/top/reply.shtml'] link = LinkExtractor(allow=r'page=d+') link_1 = LinkExtractor(allow=r'page=$') link_detail = LinkExtractor(allow=r'question/d+/d+.shtml')
rules = ( Rule(link, callback='parse_item', follow=True), Rule(link_1, callback='parse_item'), Rule(link_detail, callback='parse_detail'), )
def parse_item(self, response): tr_list = response.xpath('/html/body/div[8]/table[2]//tr')
for tr in tr_list: title = tr.xpath('./td[3]/a[1]/text()').extract_first() status = tr.xpath('./td[4]/span/text()').extract_first() num = tr.xpath('./td[1]/text()').extract_first() item = Item2() item['title'] = title item['status'] = status item['num'] = num yield item
def parse_detail(self, response): content = response.xpath('/html/body/div[9]/table[2]//tr[1]/td//text()').extract() if content: content = ''.join(content) num = response.xpath('/html/body/div[9]/table[1]//tr/td[2]/span[2]').extract_first().split(':')[-1].replace( r'</span>', '') item = Item1() item['content'] = content item['num'] = num yield item
|
② 定义Item类
- 两个Rule是为了拿到所有页码的url,它对应着 Item2
- 另一个Rule是为了拿到所有详情页的url,它对应着 Item1
1 2 3 4 5 6 7 8 9 10 11 12 13
|
import scrapy
class Item1(scrapy.Item): content = scrapy.Field() num = scrapy.Field()
class Item2(scrapy.Item): title = scrapy.Field() status = scrapy.Field() num = scrapy.Field()
|
③ 定义pipeline
定义pipeline做持久化存储!
- 在
open_spider
中开起连接
- 在
close_spider
中关闭连接
- 在
process_item
中执行数据库得插入操作。
遇到的问题:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
|
import pymysql
class SunproPipeline(object): conn = None cursor = None
def open_spider(self, spider): self.conn = pymysql.Connection(host='127.0.0.1', user='root', password="2296", database='spider', charset='utf8')
def process_item(self, item, spider): self.cursor = self.conn.cursor() if item.__class__.__name__ == 'Item1': content = item['content'] num = item['num'] query_sql = 'select * from sun where num = %s' self.cursor.execute(query_sql, (num,)) ret = self.cursor.fetchall() if ret: insert_sql = f'update sun set content = %s where num=%s' try: self.cursor.execute(insert_sql, (content, num)) self.conn.commit() except Exception as e: print(e) self.conn.rollback() else: insert_sql = f'insert into sun(num,content) values (%s, %s)' try: self.cursor.execute(insert_sql, (num, content)) self.conn.commit() except Exception as e: print(e) self.conn.rollback() else: title = item['title'] status = item['status'] num = item['num'] query_sql = f'select * from sun where num = %s' self.cursor.execute(query_sql, (num,)) ret = self.cursor.fetchall() if ret: insert_sql = f'update sun set title = %s,status = %s where num=%s' try: self.cursor.execute(insert_sql, (title, status, num)) self.conn.commit() except Exception as e: print(e) self.conn.rollback() else: insert_sql = f'insert into sun(num,title,status) values (%s, %s,%s)' try: self.cursor.execute(insert_sql, (num, title, status)) self.conn.commit() except Exception as e: print(e) self.conn.rollback()
return item
def close_spider(self, spider): self.cursor.close() self.conn.close()
|