• 使用mongodb作scrapy爬小说的存储


    一、背景:学习mongodb,考虑把原使用mysql作scrapy爬小说存储的程序修改为使用mongodb作存储。

    二、过程:

    1、安装mongodb

    (1)配置yum repo

    (python) [root@DL ~]# vi /etc/yum.repos.d/mongodb-org-4.0.repo

    [mngodb-org]
    name=MongoDB Repository
    baseurl=http://mirrors.aliyun.com/mongodb/yum/redhat/7Server/mongodb-org/4.0/x86_64/
    gpgcheck=0
    enabled=1

    (2)yum安装

    (python) [root@DL ~]# yum -y install mongodb-org

    (3)启动mongod服务

    (python) [root@DL ~]# systemctl start mongod

    (4)进入mongodb的shell

    (python) [root@DL ~]# mongo
    MongoDB shell version v4.0.20

    ...

    To enable free monitoring, run the following command: db.enableFreeMonitoring()
    To permanently disable this reminder, run the following command: db.disableFreeMonitoring()
    ---
    >

    (5)安装pymongo模块

    (python) [root@DL ~]# pip install pymongo
    Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
    Collecting pymongo
      Downloading https://pypi.tuna.tsinghua.edu.cn/packages/13/d0/819074b92295149e1c677836d72def88f90814d1efa02199370d8a70f7af/pymongo-3.11.0-cp38-cp38-manylinux2014_x86_64.whl (530kB)
         |████████████████████████████████| 532kB 833kB/s
    Installing collected packages: pymongo
    Successfully installed pymongo-3.11.0

    2、修改pipeline.py程序

    (python) [root@localhost xbiquge_w]# vi xbiquge/pipelines.py

     1 # -*- coding: utf-8 -*-
     2 
     3 # Define your item pipelines here
     4 #
     5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
     6 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
     7 import os
     8 import time
     9 from twisted.enterprise import adbapi
    10 from pymongo import MongoClient
    11 
    12 class XbiqugePipeline(object):
    13     conn = MongoClient('localhost',27017)
    14     db = conn.novels #建立数据库novels的连接对象db
    15     #name_novel = ''
    16 
    17     #定义类初始化动作
    18     #def __init__(self):
    19 
    20     #爬虫开始
    21     #def open_spider(self, spider):
    22 
    23         #return
    24     def clearcollection(self, name_collection):
    25         myset = self.db[name_collection]    
    26         myset.remove()
    27 
    28     def process_item(self, item, spider):
    29         #if self.name_novel == '':
    30         self.name_novel = item['name']
    31         self.url_firstchapter = item['url_firstchapter']
    32         self.name_txt = item['name_txt']
    33 
    34         exec('self.db.'+ self.name_novel + '.insert_one(dict(item))')
    35         return item
    36 
    37     #从数据库取小说章节内容写入txt文件
    38     def content2txt(self,dbname,firsturl,txtname):
    39         myset = self.db[dbname]
    40         record_num = myset.find().count() #获取小说章节数量
    41         print(record_num)
    42         counts=record_num
    43         url_c = firsturl
    44         start_time=time.time()  #获取提取小说内容程序运行的起始时间
    45         f = open(txtname+".txt", mode='w', encoding='utf-8')   #写方式打开小说名称加txt组成的文件
    46         for i in range(counts):  #括号中为counts
    47             record_m = myset.find({"url": url_c},{"content":1,"by":1,"_id":0})
    48             record_content_c2a0 = ''
    49             for item_content in record_m:
    50                 record_content_c2a0 = item_content["content"]  #获取小说章节内容
    51             #record_content=record_content_c2a0.replace(u'xa0', u'')  #消除特殊字符xc2xa0
    52             record_content=record_content_c2a0
    53             #print(record_content)
    54             f.write('
    ')
    55             f.write(record_content + '
    ')
    56             f.write('
    
    ')
    57             url_ct = myset.find({"url": url_c},{"next_page":1,"by":1,"_id":0})  #获取下一章链接的查询对象
    58             for item_url in url_ct:
    59                 url_c = item_url["next_page"]  #下一章链接地址赋值给url_c,准备下一次循环。
    60         f.close()
    61         print(time.time()-start_time)
    62         print(txtname + ".txt" + " 文件已生成!")
    63         return
    64 
    65     #爬虫结束,调用content2txt方法,生成txt文件
    66     def close_spider(self,spider):
    67         self.content2txt(self.name_novel,self.url_firstchapter,self.name_txt)
    68         return

    2、修改爬虫程序

    (python) [root@localhost xbiquge_w]# vi xbiquge/spiders/sancun.py

     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 from xbiquge.items import XbiqugeItem
     4 from xbiquge.pipelines import XbiqugePipeline
     5 
     6 class SancunSpider(scrapy.Spider):
     7     name = 'sancun'
     8     allowed_domains = ['www.xbiquge.la']
     9     #start_urls = ['http://www.xbiquge.la/10/10489/']
    10     url_ori= "http://www.xbiquge.la"
    11     url_firstchapter = "http://www.xbiquge.la/10/10489/4534454.html"
    12     name_txt = "./novels/三寸人间"
    13 
    14     pipeline=XbiqugePipeline()
    15     pipeline.clearcollection(name) #清空小说的数据集合(collection),mongodb的collection相当于mysql的数据表table
    16     item = XbiqugeItem()
    17     item['id'] = 0         #新增id字段,便于查询
    18     item['name'] = name
    19     item['url_firstchapter'] = url_firstchapter
    20     item['name_txt'] = name_txt
    21 
    22     def start_requests(self):
    23         start_urls = ['http://www.xbiquge.la/10/10489/']
    24         for url in start_urls:
    25             yield scrapy.Request(url=url, callback=self.parse)
    26 
    27     def parse(self, response):
    28         dl = response.css('#list dl dd')     #提取章节链接相关信息
    29         for dd in dl:
    30             self.url_c = self.url_ori + dd.css('a::attr(href)').extract()[0]   #组合形成小说的各章节链接
    31             #print(self.url_c)
    32             #yield scrapy.Request(self.url_c, callback=self.parse_c,dont_filter=True)
    33             yield scrapy.Request(self.url_c, callback=self.parse_c)    #以生成器模式(yield)调用parse_c方法获得各章节链接、上一页链接、下一页链接和章节内容信息。
    34             #print(self.url_c)
    35     def parse_c(self, response):
    36         #item = XbiqugeItem()
    37         #item['name'] = self.name
    38         #item['url_firstchapter'] = self.url_firstchapter
    39         #item['name_txt'] = self.name_txt
    40         self.item['id'] += 1
    41         self.item['url'] = response.url
    42         self.item['preview_page'] = self.url_ori + response.css('div .bottem1 a::attr(href)').extract()[1]
    43         self.item['next_page'] = self.url_ori + response.css('div .bottem1 a::attr(href)').extract()[3]
    44         title = response.css('.con_top::text').extract()[4]
    45         contents = response.css('#content::text').extract()
    46         text=''
    47         for content in contents:
    48             text = text + content
    49         #print(text)
    50         self.item['content'] = title + "
    " + text.replace('15', '
    ')     #各章节标题和内容组合成content数据,15是^M的八进制表示,需要替换为换行符。
    51         yield self.item     #以生成器模式(yield)输出Item对象的内容给pipelines模块。

    4、修改items.py

    (python) [root@DL xbiquge_w]# vi xbiquge/items.py

     1 # -*- coding: utf-8 -*-
     2 
     3 # Define here the models for your scraped items
     4 #
     5 # See documentation in:
     6 # https://docs.scrapy.org/en/latest/topics/items.html
     7 
     8 import scrapy
     9 
    10 
    11 class XbiqugeItem(scrapy.Item):
    12     # define the fields for your item here like:
    13     # name = scrapy.Field()
    14     id = scrapy.Field()
    15     name = scrapy.Field()
    16     url_firstchapter = scrapy.Field()
    17     name_txt = scrapy.Field()
    18     url = scrapy.Field()
    19     preview_page = scrapy.Field()
    20     next_page = scrapy.Field()
    21     content = scrapy.Field()

    三、小结

    mongodb作爬虫存储比mysql更简洁。

  • 相关阅读:
    多输出感知机及其梯度
    《机器学习实战》-线性回归
    《机器学习实战》-逻辑(Logistic)回归
    SQL Server 空间监测
    SQL Server 从数据库快照还原数据库
    SQL Server 创建数据库快照
    SQL Server 数据库的自动选项
    SQL Server 数据库游标选项
    SQL Server 数据库状态选项
    MYSQL 二进制还原
  • 原文地址:https://www.cnblogs.com/sfccl/p/13827422.html
Copyright © 2020-2023  润新知