Python Scrapy 爬虫入门

1、目标地址 http://quotes.toscrape.com

将页面的文章内容和作者爬下来，并保存到json文件里面。

下面代码：

用到的工具：scrapy ,xpath选择器，json，codecs编码

爬虫代码：

class ScrapeSpider(scrapy.Spider):
    name = 'toscrape'
    allowed_domains = ['toscrape.com']

    start_urls = [
        'http://quotes.toscrape.com'
    ]

    def parse(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            item = QuoteItem()
            item["text"] = quote.xpath('./span[@class="text"]/text()').get()
            item['author'] = quote.xpath('./span/small[@class="author"]/text()').get()
            yield item

        next_page = response.xpath('//nav/ul/li[@class="next"]/a/@href').get()
        if next_page and len(next_page) > 0:
            yield response.follow(next_page, self.parse)

在items.py 中添加数据

class QuoteItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    pass

定义pipelines: 保存到quotes.json文件中

import json
import codecs


class QuotePipeline(object):
    def __init__(self):
        self.file = codecs.open('quotes.json', 'wb', encoding='utf-8')
        pass

    def process_item(self, item, spider):
        lines = json.dumps(dict(item), ensure_ascii=False) + "
"
        self.file.write(lines)
        return item

    def close_spider(self, spider):
        self.file.close()
        pass

之后执行

scrapy crawl toscrape

爬下来的数据：

相关阅读:
Html.Partial和Html.RenderPartial, Html.Action和Html.RenderAction的区别
cygwin下git出现cabundle.crt相关错误的解决办法
Orchard CMS前台页面为什么没有Edit链接?
Entity Framework练习题
分析Autofac如何实现Controller的Ioc(Inversion of Control)
在Winform,Silvelight,WPF等程序中访问Asp.net MVC web api
适合.net程序员的.gitignore文件
如何处理Entity Framework中的DbUpdateConcurrencyException异常
Asp.net MVC中repository和service的区别
smplayer中使用srt字幕乱码问题

原文地址：https://www.cnblogs.com/roger-jc/p/12011384.html