• scrapy 学习笔记2 数据持久化


    前情提要:校花网爬取,并进行数据持久化

    数据持久化操作

      --编码流程:

        1:数据解析

        2:封装item 类

        3: 将解析的数据存储到实例化好的item 对象中

        4:提交item

        5:管道接收item然后对item进行io操作

        6:开启管道

      -- 主意事项:   

        -将同一份数据存储到不同平台中:

          -: 管道文件中一个管道类负责将item存储到某一个平台中

          -: 配置文件中设定管道类的优先级

          -:process_item方法中return item 的操作将item 传递给下一个即将被执行的管道类

    全站数据爬取:

       - 手动请求的发送 

        -: 设定一个通用的url模板

        -: 手动请求的操作写在哪里: parse 方法:

        -: yield scrapy.Request(url,callback)

    POST请求

       - strat_requests(self)

       - post 请求手动的发送:yield scrapy.FormRequest(url,callback,formdata)

       - cookies_Ennalle =False

    日志等级和请求传参

       - LOG_LEVEL ='ERROR'

       - LOG_FILE ='path'

    请求传参的应用场景:

       -爬取且解析的数据没有在同一个页面上(如列表页,详情页)

       -在请求方法中使用meta(字典)参数,该字典会传递给回调函数

        -回调函数接收meta :response.meta['key']

    步骤:

      一: 创建项目

        scrapy startproject  xiaohua

      二: 进入目录后创建爬虫

        scrapy genspider xiaohuademo

      三: 开始操作

        1: 前提准备:

          首先注释掉域名限制,防止不必要的错误:

         2: 关闭setting里面的ROBOTSTXT_OBEY  ,更改UA头

          

              不遵守机器人协议

       3: 书写spider 爬虫

          

     

           4:代码:

          爬虫部分代码

    # -*- coding: utf-8 -*-
    import scrapy
    from xiaohua.items import XiaohuaItem
    
    
    class XiaohuaspiderSpider(scrapy.Spider):
        name = 'xiaohuaspider'
        # 注释掉域名方式下载发生问题
        # allowed_domains = ['www.xxx.com']
        # 写入起始url
        start_urls = ['http://www.521609.com/daxuemeinv/']
    
        def parse(self, response):
            li_list = response.xpath('//div[@class="index_img list_center"]/ul/li')
            for li in li_list:
                # 解析获取数据
                title = li.xpath('./a[2]/text()|./a[2]/b/text()').extract_first()
                img_url = 'http://www.521609.com' + li.xpath('./a[1]/img/@src').extract_first()
                #实例化一个item 管道对象
                item =XiaohuaItem()
                item['title']=title
                item['img_url']=img_url
                # 提交item 给管道
                yield item

           

       item 部分代码 (该部分用来定义接收数据名)----->需要在spider 中实例对象

        

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class XiaohuaItem(scrapy.Item):
        # define the fields for your item here like:
        title = scrapy.Field()
        img_url =scrapy.Field()

    pipelines 代码

      

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    # 利用管道,将数据通过管道解析到某一平台(数据库)
    # 从写方法进行封装到本地内存
    import pymysql
    class XiaohuaPipeline(object):
        fp = None
    
        # 从写open_spider 方法
    
        def open_spider(self, spider):
            print('开启爬虫')
            self.fp = open('./xiaohua.txt', 'w', encoding='utf-8')
    
        # 作用:实现持久化存储的操作
        # 从写open_spider可以降低 open的操作. 降低io 开启关闭减少内存
        # 该方法接收item 的数据
        # 只会被调用一次
    
        def process_item(self, item, spider):
            #获取解析内容
            title = item['title']
            img_url = item['img_url']
            # 将的内容写入文件
            self.fp.write(title+':'+img_url+'
    ')
            # 这里的return item 是方便其他管道内容调用
            return item
        def close_spider(self,spider):
            print('爬虫结束')
            self.fp.close()
    
    
    
    class MySqlPipeline(object):
        conn =None #  创建连接数据库表示
        cursor =None #创建事务标识回滚标识
        def open_spider(self,spider):
            self.conn =pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='',db='xiaohua')
            print(self.conn)
            print('开始存入数据库')
        def process_item(self,item,spider):
            self.cursor=self.conn.cursor()   #   创建回滚点
            try:
                self.cursor.execute('insert into demo1 value ("%s","%s")' %(item['title'],item['img_url']))
                self.conn.commit() # 提交事务
            except Exception as e:
                print(e)
                self.conn.rollback()   # 如果出现异常回滚事务
            return item
        def close_spider(self,spider):
            self.cursor.close()  #关闭事务
            self.conn.close() # 关闭连接
            print('数据库入完成')

    setting 代码:

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for xiaohua project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'xiaohua'
    
    SPIDER_MODULES = ['xiaohua.spiders']
    NEWSPIDER_MODULE = 'xiaohua.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'xiaohua.middlewares.XiaohuaSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'xiaohua.middlewares.XiaohuaDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    # 开启管道 ,开启管道才能进行数据存储
    ITEM_PIPELINES = {
       'xiaohua.pipelines.XiaohuaPipeline': 300,
       'xiaohua.pipelines.MySqlPipeline': 301,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

        5: 代码解析

          

          

        

  • 相关阅读:
    读入优化
    poj 3216 Repairing Company
    poj 2594 Treasure Exploration
    poj 1419 Graph Coloring
    POJ 3308 Paratroopers(最小点权覆盖)(对数乘转加)
    bzoj2007: [Noi2010]海拔
    bzoj4552: [Tjoi2016&Heoi2016]排序
    bzoj1041: [HAOI2008]圆上的整点
    oracle 的服务器进程(PMON, SMON,CKPT,DBWn,LGWR,ARCn)
    undo表空间居高不下和enq: US
  • 原文地址:https://www.cnblogs.com/baili-luoyun/p/10929046.html
Copyright © 2020-2023  润新知