• <scrapy爬虫>爬取360妹子图存入mysql(mongoDB还没学会,学会后加上去)


    1.创建scrapy项目

    dos窗口输入:

    scrapy startproject images360
    
    cd images360
    

    2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义)

    import scrapy
    
    
    class Images360Item(scrapy.Item):
        # define the fields for your item here like:
        #图片ID
        image_id = scrapy.Field()
        #链接
        url = scrapy.Field()
        #标题
        title = scrapy.Field()
        #缩略图
        thumb = scrapy.Field()
    

      

      

    3.创建爬虫文件

    dos窗口输入:

    scrapy genspider myspider images.so.com

    4.编写myspider.py文件(接收响应,处理数据)

    # -*- coding: utf-8 -*-
    from urllib.parse import urlencode
    import scrapy
    from images360.items import Images360Item
    import json
    
    
    class MyspiderSpider(scrapy.Spider):
        name = 'myspider'
        allowed_domains = ['images.so.com']
        urls = []
        data = {'ch': 'beauty', 'listtype': 'new'}
        base_url = 'https://image.so.com/zj?0'
        for page in range(1,51):
            data['sn'] = page * 30
            params = urlencode(data)
            url = base_url + params
            urls.append(url)
        print(urls)
        start_urls = urls
        
        # ch: beauty
        # sn: 120
        # listtype: new
        # temp: 1
        
        def parse(self, response):
            result = json.loads(response.text)
            for each in result.get('list'):
                item = Images360Item()
                item['image_id'] = each.get('imageid')
                item['url'] = each.get('qhimg_url')
                item['title'] = each.get('group_title')
                item['thumb'] = each.get('qhimg_thumb_url')
                yield item
    

      

    5.编写pipelines.py(存储数据)

    import pymysql.cursors
    
    
    class Images360Pipeline(object):
        def __init__(self):
            self.connect = pymysql.connect(
                host='localhost',
                user='root',
                password='',
                database='quotes',
                charset='utf8',
            )
            self.cursor = self.connect.cursor()
        
        def process_item(self, item, spider):
            item = dict(item)
            sql = 'insert into images360(image_id,url,title,thumb) values(%s,%s,%s,%s)'
            self.cursor.execute(sql, (item['image_id'], item['url'], item['title'],item['thumb']))
            self.connect.commit()
            return item
        
        def close_spider(self, spider):
            self.cursor.close()
            self.connect.close()
    

      

      

    6.编写settings.py(设置headers,pipelines等)

    robox协议

    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False  

    headers

    DEFAULT_REQUEST_HEADERS = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      # 'Accept-Language': 'en',
    }
    

    pipelines

    ITEM_PIPELINES = {
       'quote.pipelines.Images360Pipeline': 300,
    }
    

      

    7.运行爬虫

    dos窗口输入:

    scrapy crawl myspider 

    运行结果

     

  • 相关阅读:
    如何优雅的进行表结构设计
    获取windows身份认证网站页面内容
    angularjs filter 详解
    OpenFileDialog 害人的RestoreDirectory
    iscroll5 版本下的 上拉,下拉 加载数据
    EasyUI Combotree 只允许选择 叶子节点
    国内5款优秀的WEB前端框架
    Serv-U无法连接到服务器127.0.0.1,端口43958 FTP服务器不能启动
    Directory.GetCurrentDirectory和Application.StartupPath的区别
    Winform 窗体设计器 无法识别重复成员变量声明的问题
  • 原文地址:https://www.cnblogs.com/shuimohei/p/10492905.html
Copyright © 2020-2023  润新知