• 利用Scrapy爬取自己的CSDN博客


          最近开始接触Scrapy这个开源的爬虫,看了一些文档和人家的技术博客,模仿一下,来爬取自己博客。

          首先创建项目:

    scrapy startproject myblog

          items.py的编写:

         我准备爬取博客文章标题,文章链接及文章被阅读的次数

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    
    #
    
    # See documentation in:
    
    # http://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    class MyBlogItem(scrapy.Item):
    
        article_name = scrapy.Field()
    
        article_url = scrapy.Field()
    
        article_readcount = scrapy.Field()

          pipelines.py的编写:

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    
    #
    
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    import json
    
    import codecs
    
    class MyBlogPipeline(object):
    
        def __init__(self):
    
            self.file = codecs.open('myblog_data.json',mode='wb',encoding='utf-8')
    
        def process_item(self, item, spider):
    
            line = json.dumps(dict(item))+'
    '
    
            self.file.write(line.decode('unicode_escape'))
    
            return item

          Scrapy爬虫框架抓取的中文结果为Unicode编码,对于如何转换为UTF-8编码。下面部分的代码算是比较好的解决了这个问题。

          settings.py的编写:

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for myblog project
    
    #
    
    # For simplicity, this file contains only the most important settings by
    
    # default. All the other settings are documented here:
    
    #
    
    #     http://doc.scrapy.org/en/latest/topics/settings.html
    
    #
    
    BOT_NAME = 'myblog'
    
    SPIDER_MODULES = ['myblog.spiders']
    
    NEWSPIDER_MODULE = 'myblog.spiders'
    
    COOKIES_ENABLED = False
    
    ITEM_PIPELINES = {
    
        'myblog.pipelines.MyBlogPipeline':300
    
    }
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    
    #USER_AGENT = 'myblog (+http://www.yourdomain.com)'

          这里将COOKIES_ENABLED参数置为True,使根据cookies判断访问的站点不能发现爬虫轨迹,防止被ban。

          ITEM_PIPELINES类型为字典,用于设置启动的pipeline,其中key为定义的pipeline类,value为启动顺序,默认0-1000。

          爬虫的编写:

    #!/usr/bin/env python
    
    # __author__ = 'root'
    
    from scrapy.spider import Spider
    
    from scrapy.selector import  Selector
    
    from scrapy.http import Request
    
    from myblog.items import MyBlogItem
    
    import  re
    
    class MyBlogSpider(Spider):
    
        name = "myblog"
    
        download_delay = 1
    
        allowed_domains=["blog.csdn.net"]
    
        start_urls=[
    
            "http://blog.csdn.net/bnxf00000/article/details/2785136"
    
        ]
    
        def parse(self, response):
    
            sel = Selector(response)
    
            item = MyBlogItem()
    
            templist=[]
    
            article_url = str(response.url)
    
            article_name = sel.xpath('//div[@id="article_details"]/div/h1/span/a/text()').extract()
    
            article_readcount = sel.xpath('//div[@id="article_details"]/div[2]/span[@class="link_view"]/text()').extract()
    
            for temp in article_readcount:
    
                result = re.match('(d+)',temp)
    
                if result:
    
                    templist.append(result.group(0))
    
            #article_readcount = re.match('d+',article_readcount)
    
            item['article_name'] = [n.encode('utf-8') for n in article_name]
    
            item['article_url'] = article_url.encode('utf-8')
    
            item['article_readcount']=[n.encode('utf-8') for n in templist]
    
            yield item
    
            urls = sel.xpath('//li[@class="next_article"]/a/@href').extract()
    
            for url in urls:
    
                #print url
    
                url = "http://blog.csdn.net" + url
    
                #print url
    
                yield Request(url, callback=self.parse)

          原理是分析网页得到“下一篇”的链接,并返回Request对象。进而继续爬取下一篇文章,直至没有。

          执行:

    scrapy crawl myblog

          部分结果图示:

    47

          第一个爬虫程序,参照别人的代码和讲解依葫芦画瓢,自己添加了对阅读次数的处理,后续准备对Scrapy爬虫源码进行阅读学习。

          参考链接:http://blog.csdn.net/u012150179/article/details/34486677

  • 相关阅读:
    liunx-centos-基础命令详解(1) -主要内容来自 —https://www.cnblogs.com/caozy/p/9261224.html
    阿里云搭建香港代理服务器 shadownsocks
    ssh 操作 esxi 基本命令
    surpace pro 检测维修记录
    新的开始
    Linux就该这么学04学习笔记
    博客园添加旋转的正方体特效
    博客园添加鼠标动态线条
    day01 python初识、数据类型、流程控制
    Hadoop学习(1)-hdfs安装及其一些操作
  • 原文地址:https://www.cnblogs.com/hiccup/p/4475631.html
Copyright © 2020-2023  润新知