Scrapy 简单操作

现在shell里面

scrapy startproject tutorial

然后

cd tutorial

scrapy genspider quotes quotes.toscrape.com

观察原始页面发现数据存储在3个内容里面

text

author

tags

然后修改Items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class QuoteItem(scrapy.Item): 
　　text= scrapy.Field() 
　　author=scrapy.Field() 
　　tags= scrapy.Field()

修改quotes.py为

# -*- coding: utf-8 -*-
import scrapy
from tutorial.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item=QuoteItem()
            item['text'] = quote.css('.text::text').extract_first()
            item['author'] = quote.css('.author::text').extract_first()
            item['tags'] = quote.css('.tags .tga::text').extract()
            yield item
        next=response.css('.pager .next a::attr(href)').extract_first()
        url = response.urljoin(next)
        yield scrapy.Request(url=url,callback=self.parse)

然后在shell里面cd到spiders目录下

scrapy crawl quotes -o quotes.csv

运行并输出到csv

如果要进行更复杂的操作，如将结果保存到MongoDb数据库，或者筛选某些有用的数据，将会用到pipelines.py

Item Pipeline 为项目管道，到Item生成后，自动传送到pipelines 进行处理。

常用pipelines做以下操作：

1，清理html数据

2.验证爬取数据，检查爬取字段。

3，查重并丢弃重复内容

4，将爬取结果保存到数据库

相关阅读:
1040 最大公约数之和
51nod 1215 数组的宽度
51nod 1423 最大二“货” 单调栈
51nod 1437 迈克步单调栈
1564 区间的价值
51nod 1294 修改数组
51nod1693 水群最短路
51nod1052 最大M子段和
我不管，这就是水题《1》
河工大校赛 Hmz 的女装 http://218.28.220.249:50015/JudgeOnline/problem.php?id=1265

原文地址：https://www.cnblogs.com/zj0724/p/9124756.html