使用爬取http://quotes.toscrape.com/内容,网站内容很简单
一. 使用scrapy创建项目
scrapy startproject myscrapy1
scrapy genspider quotes
二. 修改items.py和quotes.py
items.py用来保存爬取的数据,和字典的使用方法一样
import scrapy class Myscrapy1Item(scrapy.Item): # define the fields for your item here like: text = scrapy.Field() author = scrapy.Field() tags = scrapy.Field()
网页源文件中只需提取上面定义的3个字段
quotes.py
其中的parse函数负责解析start_urls返回的响应,提取数据以及进一步生成要处理的请求
# -*- coding: utf-8 -*- import scrapy from myscrapy1.items import Myscrapy1Item class QuotesSpider(scrapy.Spider): name = 'quotes' allowed_domains = ['quotes.toscrape.com'] start_urls = ['http://quotes.toscrape.com/'] def parse(self, response): quotes = response.css('.quote') for quote in quotes: item = Myscrapy1Item() item['text'] = quote.css('.text::text').extract_first() item['author'] = quote.css('.author::text').extract_first() item['tags'] = quote.css('.tags .tag::text').extract() yield item #获取多页内容 next = response.css('.pager .next a::attr("href")').extract_first() url = response.url.join(next) #生成绝对URL yield scrapy.Request(url=url, callback=self.parse) #构造请求时需要用scrapy.Request
二. 将数据保存到mongodb,以及把得到的数据限制显示50位,剩余的用省略号代替,这里需要设置settings.py和pipelines.py
settings.py
# -*- coding: utf-8 -*- BOT_NAME = 'tutorial' SPIDER_MODULES = ['tutorial.spiders'] NEWSPIDER_MODULE = 'tutorial.spiders' #让pipelines.py中的2个自定义类生效,序号越小优先级越高 ITEM_PIPELINES = { 'tutorial.pipelines.TextPipeline': 300, 'tutorial.pipelines.MongoPipeline': 400, } MONGO_URI='localhost' MONGO_DB='tutorial'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
pipelines.py
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html # 文本太长,限制最长为50,后面用省略号代替 from scrapy.exceptions import DropItem import pymongo class TextPipeline(object): def __init__(self): self.limit = 50 def process_item(self, item, spider): if item['text']: if len(item['text']) > self.limit: item['text'] = item['text'][0:self.limit].rstrip() + '...' return item else: return DropItem('Missing Text') #文本不存在,抛出异常MISSING TEXT # 保存到mongodb class MongoPipeline(object): def __init__(self,mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db #从setting.py中拿到配置信息 @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DB') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def process_item(self, item, spider): name = item.__class__.__name__ #其实值就是quotes self.db[name].insert(dict(item)) return item def close_spider(self, spider): self.client.close()
三. 常用的几个命令
1.创建一个爬虫项目 scrapy startproject test1 2. 生成一个爬虫文件 scrapy genspider baidu www.baidu.com
scrapy genspider -l :显示爬虫模板类型 例如:指定生成一个crawl 模板类型的爬虫文件 scrapy genspider -t crawl zhihu www.zhihu.com 3. 运行爬虫程序 scrapy crawl zhihu 4. 检查代码是否有错误 scrapy check 5. 返回项目中所有spider名称 scrapy list 6. 爬取内容保存到文件 scrapy crawl zhihu -o zhihu.json