scrapy 图片管道学习笔记

使用scrapy首先需要安装

python环境使用3.6

windows下激活进入python3.6环境

activate python36

mac下

mac@macdeMacBook-Pro:~$     source activate python36
(python36) mac@macdeMacBook-Pro:~$

安装 scrapy

(python36) mac@macdeMacBook-Pro:~$     pip install scrapy
(python36) mac@macdeMacBook-Pro:~$     scrapy --version
Scrapy 1.8.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command
(python36) mac@macdeMacBook-Pro:~$     scrapy startproject images
New Scrapy project 'images', using template directory '/Users/mac/anaconda3/envs/python36/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /Users/mac/images

You can start your first spider with:
    cd images
    scrapy genspider example example.com

(python36) mac@macdeMacBook-Pro:~$     cd images
(python36) mac@macdeMacBook-Pro:~/images$     scrapy genspider -t crawl pexels www.pexels.com
Created spider 'pexels' using template 'crawl' in module:
  images.spiders.pexels
(python36) mac@macdeMacBook-Pro:~/images$

setting.py里面关闭robot.txt遵循

ROBOTSTXT_OBEY = False

分析目标网站规则 www.pexels.com

https://www.pexels.com/photo/man-using-black-camera-3136161/

https://www.pexels.com/video/beach-waves-and-sunset-855633/

https://www.pexels.com/photo/white-vehicle-2569855/

https://www.pexels.com/photo/monochrome-photo-of-city-during-daytime-3074526/

得出要抓取的规则

rules = (
    Rule(LinkExtractor(allow=r'^https://www.pexels.com/photo/.*/$'), callback='parse_item', follow=True),
)


图片管道 要定义两个item

class ImagesItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

images_url是抓取到的图片url 需要传递过来

images 检测图片完整性，但是我打印好像没看到这个字段

pexels.py里面引入item 并且定义对象

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from images.items import ImagesItem

class PexelsSpider(CrawlSpider):
    name = 'pexels'
    allowed_domains = ['www.pexels.com']
    start_urls = ['http://www.pexels.com/']

    rules = (
        Rule(LinkExtractor(allow=r'^https://www.pexels.com/photo/.*/$'), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        item = ImagesItem()
        item['image_urls'] = response.xpath('//img[contains(@src,"photos")]/@src').extract()
        print(item['image_urls'])
        return item

设置setting.py里面启用图片管道设置存储路劲

ITEM_PIPELINES = {
   #'images.pipelines.ImagesPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline': 1
}



IMAGES_STORE = '/www/crawl'
# 图片的下载地址 根据item中的字段来设置哪一个内容需要被下载
IMAGES_URLS_FIELD = 'image_urls'

启动爬虫

scrapy crawl pexels --nolog

发现已经下载下来了

但是下载的图片不是高清的，要处理下图片的后缀

setting.py打开默认管道设置优先级高一些

ITEM_PIPELINES = {
    'images.pipelines.ImagesPipeline': 1,
    'scrapy.pipelines.images.ImagesPipeline': 2
}

管道文件里面对后缀进行处理去掉

class ImagesPipeline(object):
    def process_item(self, item, spider):
        tmp = item['image_urls']
        item['image_urls'] = []
        for i in tmp:
            if '?' in i:
                item['image_urls'].append(i.split('?')[0])
            else:
                item['image_urls'].append(i)

        return item

最终下载的就是大图了，但是图片管道还是默认对图片会有压缩的，所以如果使用文件管道下载的才是完全的原图，非常大。

如果不下载图片，直接存图片url到mysql的话参考

https://www.cnblogs.com/php-linux/p/11792393.html

图片管道配置最小宽度和高度分辨率

IMAGES_MIN_HEIGHT=800

IMAGES_MIN_WIDTH=600

IMAGES_EXPIRES=90 天不会对重复的进行下载

生成缩略图

IMAGES_THUMBS={

　　‘small’:(50,50),

'big':(600,600)

}

相关阅读:
Linux基础知识
 redis info
记录: 解决 pycurl: libcurl link-time ssl backend (openssl) is different from compile-time ssl backend (none/other)
IOS IAP 自动续订之利用rabbitmq延时队列自动轮询检查是否续订成功
 Python3.6 的字典为什么会快
 IAP 订阅后端踩坑总结之 Google 篇
 docker 命令合集
 Python Schema使用说明
 Apache Bench测试
 channels2.X 学习笔记
原文地址：https://www.cnblogs.com/brady-wang/p/11795582.html