- 目的:利用scrapy完成盗墓笔记小说的抓取
- 创建项目:
- scrapy startproject books
- cd books
- scrapy genspider dmbj
- 编写parse方法
-
1 # -*- coding: utf-8 -*- 2 import scrapy 3 4 5 class DmbjSpider(scrapy.Spider): 6 name = 'dmbj' 7 allowed_domains = ['www.cread.com/chapter/811400395/69162457.html'] 8 start_urls = ['http://www.cread.com/chapter/811400395/69162457.html/'] 9 10 def parse(self, response): 11 title = response.xpath('//h1/text()').extract_first() 12 content = response.xpath('//div[@class="chapter_con"]/text()').extract_first() 13 with open('{}.txt'.format(title), 'w') as f: 14 f.write(content)
观察网页源码,利用xpath对信息进行提取,然后写入一个txt文本文件
- 追踪爬取,在完成对单页的爬取之后,接下来对整篇小说进行爬取
- 首先分析网页:
- 单页爬取已经完成,想要爬取下一章就得找到下一章的url
- 网页的最后又一个"下一章"的按钮,我们拿到按钮里面的href属性的值就行了
- 注意href的值为相对url,我们需要将完整的url拼接起来
- 利用response.urljoin(你的相对url),即可完成完整的url拼接
- 提取到下一页的绝对url之后利用scrapy.Request方法来对下一页进行爬取
- 这里的allowed_domains 要改成"www.cread.com"
-
1 # -*- coding: utf-8 -*- 2 import scrapy 3 4 5 class DmbjSpider(scrapy.Spider): 6 name = 'dmbj' 7 allowed_domains = ['www.cread.com'] 8 start_urls = ['http://www.cread.com/chapter/811400395/69162457.html/'] 9 10 def parse(self, response): 11 title = response.xpath('//h1/text()').extract_first() 12 content = response.xpath('//div[@class="chapter_con"]/text()').extract_first() 13 with open('{}.txt'.format(title), 'w') as f: 14 f.write(content) 15 next_url = response.xpath('//a[@id="go_next"]/@href').extract_first() 16 url = response.urljoin(next_url) 17 return scrapy.Request(url)
- 最后scrapy crawl dmbj 运行爬虫开始抓取
-