使用scrapy首先需要安装
python环境使用3.6
windows下激活进入python3.6环境
activate python36
mac下
mac@macdeMacBook-Pro:~$ source activate python36
(python36) mac@macdeMacBook-Pro:~$
安装 scrapy
(python36) mac@macdeMacBook-Pro:~$ pip install scrapy (python36) mac@macdeMacBook-Pro:~$ scrapy --version Scrapy 1.8.0 - no active project Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory Use "scrapy <command> -h" to see more info about a command (python36) mac@macdeMacBook-Pro:~$ scrapy startproject images New Scrapy project 'images', using template directory '/Users/mac/anaconda3/envs/python36/lib/python3.6/site-packages/scrapy/templates/project', created in: /Users/mac/images You can start your first spider with: cd images scrapy genspider example example.com (python36) mac@macdeMacBook-Pro:~$ cd images (python36) mac@macdeMacBook-Pro:~/images$ scrapy genspider -t crawl pexels www.pexels.com Created spider 'pexels' using template 'crawl' in module: images.spiders.pexels (python36) mac@macdeMacBook-Pro:~/images$
setting.py里面 关闭robot.txt遵循
ROBOTSTXT_OBEY = False
分析目标网站规则 www.pexels.com
https://www.pexels.com/photo/man-using-black-camera-3136161/
https://www.pexels.com/video/beach-waves-and-sunset-855633/
https://www.pexels.com/photo/white-vehicle-2569855/
https://www.pexels.com/photo/monochrome-photo-of-city-during-daytime-3074526/
得出要抓取的规则
rules = (
Rule(LinkExtractor(allow=r'^https://www.pexels.com/photo/.*/$'), callback='parse_item', follow=True),
)
图片管道 要定义两个item
class ImagesItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() image_urls = scrapy.Field() images = scrapy.Field()
images_url是抓取到的图片url 需要传递过来
images 检测图片完整性,但是我打印好像没看到这个字段
pexels.py里面引入item 并且定义对象
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from images.items import ImagesItem class PexelsSpider(CrawlSpider): name = 'pexels' allowed_domains = ['www.pexels.com'] start_urls = ['http://www.pexels.com/'] rules = ( Rule(LinkExtractor(allow=r'^https://www.pexels.com/photo/.*/$'), callback='parse_item', follow=False), ) def parse_item(self, response): item = ImagesItem() item['image_urls'] = response.xpath('//img[contains(@src,"photos")]/@src').extract() print(item['image_urls']) return item
设置setting.py里面启用图片管道 设置存储路劲
ITEM_PIPELINES = { #'images.pipelines.ImagesPipeline': 300, 'scrapy.pipelines.images.ImagesPipeline': 1 } IMAGES_STORE = '/www/crawl' # 图片的下载地址 根据item中的字段来设置哪一个内容需要被下载 IMAGES_URLS_FIELD = 'image_urls'
启动爬虫
scrapy crawl pexels --nolog
发现已经下载下来了
但是下载的图片不是高清的,要处理下图片的后缀
setting.py打开默认管道 设置优先级高一些
ITEM_PIPELINES = { 'images.pipelines.ImagesPipeline': 1, 'scrapy.pipelines.images.ImagesPipeline': 2 }
管道文件里面对后缀进行处理去掉
class ImagesPipeline(object): def process_item(self, item, spider): tmp = item['image_urls'] item['image_urls'] = [] for i in tmp: if '?' in i: item['image_urls'].append(i.split('?')[0]) else: item['image_urls'].append(i) return item
最终下载的就是大图了,但是图片管道还是默认对图片会有压缩的,所以如果使用文件管道下载的才是完全的原图,非常大。
如果不下载图片,直接存图片url到mysql的话参考
https://www.cnblogs.com/php-linux/p/11792393.html
图片管道 配置最小宽度和高度分辨率
IMAGES_MIN_HEIGHT=800
IMAGES_MIN_WIDTH=600
IMAGES_EXPIRES=90 天 不会对重复的进行下载
生成缩略图
IMAGES_THUMBS={
‘small’:(50,50),
'big':(600,600)
}