• 使用mongodb保存爬取豆瓣电影的数据


    1. 创建爬虫项目douban

      scrapy startproject douban
      
      
    2. 设置items.py文件,存储要保存的数据类型和字段名称

      # -*- coding: utf-8 -*-
      
      import scrapy
      
      
      class DoubanItem(scrapy.Item):
      
          title = scrapy.Field()
          # 内容
          content = scrapy.Field()
          # 评分
          rating_num = scrapy.Field()
          # 简介
          quote = scrapy.Field()
      
      
    3. 设置爬虫文件doubanmovies.py

      # -*- coding: utf-8 -*-
      import scrapy
      from douban.items import DoubanItem
      
      
      class DoubanmoviesSpider(scrapy.Spider):
          name = 'doubanmovies'
          allowed_domains = ['movie.douban.com']
          offset = 0
          url = 'https://movie.douban.com/top250?start='
          start_urls = [url + str(offset)]
      
          def parse(self, response):
              # print('*'*60)
              # print(response.url)
              # print('*'*60)
      
              item = DoubanItem()
              info = response.xpath("//div[@class='info']")
              for each in info:
                  item['title'] = each.xpath(".//span[@class='title'][1]/text()").extract()
                  item['content'] = each.xpath(".//div[@class='bd']/p[1]/text()").extract()
                  item['rating_num'] = each.xpath(".//span[@class='rating_num']/text()").extract()
                  item['quote'] = each .xpath(".//span[@class='inq']/text()").extract()
                  yield item
                  # print(item)
      
              self.offset += 25
              if self.offset <= 250:
                  yield scrapy.Request(self.url + str(self.offset),callback=self.parse)
      
      
    4. 设置管道文件,使用mongodb数据库来保存爬取的数据。重点部分

      # -*- coding: utf-8 -*-
      
      from scrapy.conf import settings
      import pymongo
      
      class DoubanPipeline(object):
          def __init__(self):
              self.host = settings['MONGODB_HOST']
              self.port = settings['MONGODB_PORT']
      
      
      
          def process_item(self, item, spider):
              # 创建mongodb客户端连接对象,该例从settings.py文件里面获取mongodb所在的主机和端口参数,可直接书写主机和端口
              self.client = pymongo.MongoClient(self.host,self.port)
      
              # 创建数据库douban
              self.mydb = self.client['douban']
      
              # 在数据库douban里面创建表doubanmovies
      
              # 把类似字典的数据转换为phthon字典格式
              content = dict(item)
      
              # 把数据添加到表里面
              self.mysheetname.insert(content)
      
              return item
      
      
    5. 设置settings.py文件

      # -*- coding: utf-8 -*-
      
      BOT_NAME = 'douban'
      
      SPIDER_MODULES = ['douban.spiders']
      NEWSPIDER_MODULE = 'douban.spiders'
      
      USER_AGENT = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;'
      
      
      # Configure a delay for requests for the same website (default: 0)
      
      
      # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
      
      
      # See also autothrottle settings and docs
      
      DOWNLOAD_DELAY = 3
      
      # The download delay setting will honor only one of:
      
      
      #CONCURRENT_REQUESTS_PER_DOMAIN = 16
      
      
      #CONCURRENT_REQUESTS_PER_IP = 16
      
      
      
      # Disable cookies (enabled by default)
      
      COOKIES_ENABLED = False
      
      
      # Configure item pipelines
      
      
      # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
      
      ITEM_PIPELINES = {
         'douban.pipelines.DoubanPipeline': 300,
      }
      
      
      
      # mongodb数据库设置变量
      
      MONGODB_HOST = '127.0.0.1'
      MONGODB_PORT = 27017
      
      
    6. 终端测试

      scrapy crawl douban
      
      

    这博客园的代码片段缩进,难道要用4个空格才可以搞定?我发现只能使用4个空格才能解决如上图的代码块的缩进

  • 相关阅读:
    读取列表下标
    字典dict详解
    使用mysql的长连接
    oauth授权协议的原理
    安装性能测试工具:sysbench和使用apache的ab
    发送邮件出现问题
    获取用户的真实ip
    清理代码的阅读笔记
    开发中三个经典的原则
    要干大事就不能把面子看得太重
  • 原文地址:https://www.cnblogs.com/silence-cc/p/9398961.html
Copyright © 2020-2023  润新知