• Scrapy实战:使用scrapy再再次爬取干货集中营的妹子图片


     需要学习的知识:

    1.获取到的json数据如何处理

    2.保存到json文件

    3.保存到MongoDB数据库

    4.下载项目图片(含缩略图)

    1.创建项目

    scrapy startproject gank

    2.生成项目爬虫文件

    scrapy genspider gank_img gank.io

    注意:项目名称gank不能跟项目爬虫文件名gank_img一致

    3.gank_img.py文件

    import json
    import scrapy
    from gank.items import GankItem
    
    
    class GankImgSpider(scrapy.Spider):
        name = 'gank_img'
        allowed_domains = ['gank.io']
        # 开始链接为什么要这样写请参考:https://www.cnblogs.com/sanduzxcvbnm/p/10271493.html
        start_urls = ['https://gank.io/api/data/福利/700/1']
    
        def parse(self, response):
            # 返回的是json字符串,转换成字典,提取出需要的字段
            results = json.loads(response.text)['results']
    
            for i in results:
                item = GankItem()
                item['who'] = i['who']
                item['url'] = i['url']
    
                yield item

    4.items.py文件

    import scrapy
    
    class GankItem(scrapy.Item):
        # define the fields for your item here like:
        who = scrapy.Field()
        url = scrapy.Field()
        # 保存图片,生成图片路径
        image_paths = scrapy.Field()

    5.pipelines.py文件

    import json
    from scrapy.pipelines.images import ImagesPipeline
    from scrapy.exceptions import DropItem
    import pymongo
    import scrapy
    
    
    # 在settings.py文件中开启该pipeline,则主程序中yield的数据会传输到这边来进行处理
    
    # 保存成json文件
    class JsonWriterPipeline(object):
    
        def open_spider(self, spider):
            self.file = open('items.json', 'w')
    
        def close_spider(self, spider):
            self.file.close()
    
        def process_item(self, item, spider):
            line = json.dumps(dict(item)) + "
    "
            self.file.write(line)
            return item
    
    
    # 保存到MongoDB数据库
    class MongoPipeline(object):
        # 数据表名
        collection_name = 'scrapy_items'
    
        def __init__(self, mongo_uri, mongo_db):
            self.mongo_uri = mongo_uri
            self.mongo_db = mongo_db
    
        @classmethod
        # 从settings.py文件中获取参数
        def from_crawler(cls, crawler):
            return cls(
                mongo_uri=crawler.settings.get('MONGO_URI'),
                mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') # 数据库名
            )
    
        def open_spider(self, spider):
            self.client = pymongo.MongoClient(self.mongo_uri)
            self.db = self.client[self.mongo_db]
    
        def close_spider(self, spider):
            self.client.close()
    
        def process_item(self, item, spider):
            self.db[self.collection_name].insert_one(dict(item))
            return item
    
    # 下载项目图片
    class MyImagesPipeline(ImagesPipeline):
        def get_media_requests(self, item, info):
            # 图片链接是https的转换成http
            if item['url'][0:5] == 'https':
                item['url'] = item['url'].replace(item['url'][0:5], 'http')
            # for image_url in item['url']:
            #     print('400',image_url)
            yield scrapy.Request(item['url'])
    
        def item_completed(self, results, item, info):
            image_paths = [x['path'] for ok, x in results if ok]
            if not image_paths:
                raise DropItem("Item contains no images")
            item['image_paths'] = image_paths
            return item

    6.settings.py文件

    只修改如下配置,其余保持不变

    DEFAULT_REQUEST_HEADERS = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'zh-CN,en-US;q=0.8,zh;q=0.5,en;q=0.3',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0'
    }
    
    # MongoDB数据库参数
    MONGO_URI = '127.0.0.1'
    MONGO_DATABASE = 'gank'
    
    ITEM_PIPELINES = {
        'gank.pipelines.JsonWriterPipeline': 300,
        'gank.pipelines.MyImagesPipeline': 1,
        'gank.pipelines.MongoPipeline': 400,
    }
    # 图片保存路径
    IMAGES_STORE = 'D:\gank\images'
    
    # 90天的图片失效期限
    IMAGES_EXPIRES = 90
    
    # 缩略图
    IMAGES_THUMBS = {
        'small': (50, 50),
        'big': (270, 270),
    }

    7.执行爬虫程序

    scrapy crawl gank_img

    8.效果

    json文件

    MongoDB数据库

    保存的图片及缩略图

    其中full为图片本身大小所存放目录,thubmbs为缩略图存放目录,缩略图有big和small两种尺寸

     scrapy结尾会有相应的统计信息

    下载图片561个,无法下载的图片有108个

    为什么有的图片无法下载,请参考之前的文章:https://www.cnblogs.com/sanduzxcvbnm/p/10271493.html

  • 相关阅读:
    常见名词解释
    主板结构解析
    计算机网络原理的总结
    Nginx的介绍
    优雅的python
    python小技巧
    python列表小程序
    学会浏览器查东西
    列表推导式
    深度优先算法与广度优先算法
  • 原文地址:https://www.cnblogs.com/sanduzxcvbnm/p/10303280.html
Copyright © 2020-2023  润新知