• scrapy 图片管道学习笔记


    使用scrapy首先需要安装 

    python环境使用3.6  

    windows下激活进入python3.6环境

    activate python36 

    mac下 

    mac@macdeMacBook-Pro:~$     source activate python36
    (python36) mac@macdeMacBook-Pro:~$  

    安装 scrapy

    (python36) mac@macdeMacBook-Pro:~$     pip install scrapy
    (python36) mac@macdeMacBook-Pro:~$     scrapy --version
    Scrapy 1.8.0 - no active project
    
    Usage:
      scrapy <command> [options] [args]
    
    Available commands:
      bench         Run quick benchmark test
      fetch         Fetch a URL using the Scrapy downloader
      genspider     Generate new spider using pre-defined templates
      runspider     Run a self-contained spider (without creating a project)
      settings      Get settings values
      shell         Interactive scraping console
      startproject  Create new project
      version       Print Scrapy version
      view          Open URL in browser, as seen by Scrapy
    
      [ more ]      More commands available when run from project directory
    
    Use "scrapy <command> -h" to see more info about a command
    (python36) mac@macdeMacBook-Pro:~$     scrapy startproject images
    New Scrapy project 'images', using template directory '/Users/mac/anaconda3/envs/python36/lib/python3.6/site-packages/scrapy/templates/project', created in:
        /Users/mac/images
    
    You can start your first spider with:
        cd images
        scrapy genspider example example.com
    
    (python36) mac@macdeMacBook-Pro:~$     cd images
    (python36) mac@macdeMacBook-Pro:~/images$     scrapy genspider -t crawl pexels www.pexels.com
    Created spider 'pexels' using template 'crawl' in module:
      images.spiders.pexels
    (python36) mac@macdeMacBook-Pro:~/images$  

    setting.py里面 关闭robot.txt遵循

    ROBOTSTXT_OBEY = False
    

    分析目标网站规则 www.pexels.com

    https://www.pexels.com/photo/man-using-black-camera-3136161/

    https://www.pexels.com/video/beach-waves-and-sunset-855633/

    https://www.pexels.com/photo/white-vehicle-2569855/

    https://www.pexels.com/photo/monochrome-photo-of-city-during-daytime-3074526/

    得出要抓取的规则

    rules = (
    Rule(LinkExtractor(allow=r'^https://www.pexels.com/photo/.*/$'), callback='parse_item', follow=True),
    )


    图片管道 要定义两个item
    class ImagesItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        image_urls = scrapy.Field()
        images = scrapy.Field()

    images_url是抓取到的图片url 需要传递过来

    images 检测图片完整性,但是我打印好像没看到这个字段

    pexels.py里面引入item 并且定义对象

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    from images.items import ImagesItem
    
    class PexelsSpider(CrawlSpider):
        name = 'pexels'
        allowed_domains = ['www.pexels.com']
        start_urls = ['http://www.pexels.com/']
    
        rules = (
            Rule(LinkExtractor(allow=r'^https://www.pexels.com/photo/.*/$'), callback='parse_item', follow=False),
        )
    
        def parse_item(self, response):
            item = ImagesItem()
            item['image_urls'] = response.xpath('//img[contains(@src,"photos")]/@src').extract()
            print(item['image_urls'])
            return item
    

     设置setting.py里面启用图片管道 设置存储路劲

    ITEM_PIPELINES = {
       #'images.pipelines.ImagesPipeline': 300,
        'scrapy.pipelines.images.ImagesPipeline': 1
    }
    
    
    
    IMAGES_STORE = '/www/crawl'
    # 图片的下载地址 根据item中的字段来设置哪一个内容需要被下载
    IMAGES_URLS_FIELD = 'image_urls'

    启动爬虫

    scrapy crawl pexels --nolog

    发现已经下载下来了

    但是下载的图片不是高清的,要处理下图片的后缀

    setting.py打开默认管道 设置优先级高一些

    ITEM_PIPELINES = {
        'images.pipelines.ImagesPipeline': 1,
        'scrapy.pipelines.images.ImagesPipeline': 2
    }
    

    管道文件里面对后缀进行处理去掉

    class ImagesPipeline(object):
        def process_item(self, item, spider):
            tmp = item['image_urls']
            item['image_urls'] = []
            for i in tmp:
                if '?' in i:
                    item['image_urls'].append(i.split('?')[0])
                else:
                    item['image_urls'].append(i)
    
            return item

    最终下载的就是大图了,但是图片管道还是默认对图片会有压缩的,所以如果使用文件管道下载的才是完全的原图,非常大。

    如果不下载图片,直接存图片url到mysql的话参考 

    https://www.cnblogs.com/php-linux/p/11792393.html

    图片管道 配置最小宽度和高度分辨率

    IMAGES_MIN_HEIGHT=800

    IMAGES_MIN_WIDTH=600

    IMAGES_EXPIRES=90 天 不会对重复的进行下载

    生成缩略图

    IMAGES_THUMBS={

      ‘small’:(50,50),

         'big':(600,600)

    }

  • 相关阅读:
    Linux基础知识
    redis info
    记录: 解决 pycurl: libcurl link-time ssl backend (openssl) is different from compile-time ssl backend (none/other)
    IOS IAP 自动续订 之 利用rabbitmq延时队列自动轮询检查是否续订成功
    Python3.6 的字典为什么会快
    IAP 订阅后端踩坑总结之 Google 篇
    docker 命令合集
    Python Schema使用说明
    Apache Bench测试
    channels2.X 学习笔记
  • 原文地址:https://www.cnblogs.com/brady-wang/p/11795582.html
Copyright © 2020-2023  润新知