• scrapy框架爬取糗妹妹网站qiumeimei.com图片


    1. 创建项目

      scrapy startproject qiumeimei

    2. 建蜘蛛文件qiumei.py

      cd qiumeimei

      scrapy genspider qiumei www.qiumeimei.com

    3. 考虑到只需要下载图片,先在items.py定义字段

      

    import scrapy
    
    class QiumeimeiItem(scrapy.Item):
        # define the fields for your item here like:
        img_path = scrapy.Field()
        pass
    

    4. 写蜘蛛文件qiumei.py

      

    # -*- coding: utf-8 -*-
    import scrapy
    
    from qiumeimei.items import QiumeimeiItem
    class QiumeiSpider(scrapy.Spider):
        name = 'qiumei'
        # allowed_domains = ['www.qiumeimei.com']
        start_urls = ['http://www.qiumeimei.com/image']
    
        def parse(self, response):
            img_url = response.css('.main>p>img::attr(data-lazy-src)').extract()
            # print(img_url)
            for url in img_url:
                # print(url)
                item = QiumeimeiItem()
                item['img_path'] = url
    
                yield item
    
            next_url = response.css('.pagination a.next::attr(href)').extract_first()
            if next_url:
                yield scrapy.Request(url=next_url,callback=self.parse)
    

    5. 管道文件pipelines.py      这里图片是全部放在了一个文件夹里,在settings.py中定义了一个路径,见下文第6步:

    import os,scrapy
    from scrapy.pipelines.images import ImagesPipeline
    from qiumeimei.settings import IMAGES_STORE as images_store
    class QiumeimeiPipeline(ImagesPipeline):
        def get_media_requests(self, item, info):
            img_path = item['img_path']
            # print(000)
            yield scrapy.Request(url=img_path)
    
        def item_completed(self, results, item, info):
    
            old_name_list = [x['path'] for t, x in results]
            old_name = images_store + old_name_list[0]
            # print(111)
            #图片名称
            from datetime import datetime
            i = str(datetime.now())
            # print(222)
            img_path = item['img_path']
            img_type = img_path.split('.')[-1]
            img_name = i[:4]+i[5:7]+i[8:10]+i[11:13]+i[14:16]+i[17:19]+i[20:]
            #图片路径   所有图片放在一个文件夹里
            # print(333)
            path = images_store + img_name +'.'+ img_type
            print(path+' 已下载...')
            os.rename(old_name,path)
    
            return item
    

    6. 设置文件settings.py

    USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    #图片路径,会自动创建
    IMAGES_STORE = './images/'
    
    #开启管道
    ITEM_PIPELINES = {
       'qiumeimei.pipelines.QiumeimeiPipeline': 300,
    }
    

      已成功:

  • 相关阅读:
    项目工程化之git提交规范以及 CHANGELOG生成
    移动端 弹窗-内容可滚动,背景不动
    项目readme文件目录生成工具 treer
    css animation动画
    【网络】从套接字到Web服务器
    【网络】图解HTTP-1
    【MySQL】搞懂ACID原则和事务隔离级别
    【MySQL】备份与恢复
    【MySQL】索引
    【Redis】主从复制原理
  • 原文地址:https://www.cnblogs.com/wshr210/p/11359977.html
Copyright © 2020-2023  润新知