• 5. scrapy 请求参数


    1.请求传参

    深度爬取:爬取的数据没有存储在同一张页面中。
    创建工程:
        scrapy startproject moviePro
    创建爬虫文件:
        cd moviePro
        scrapy genspider movie www.xxx.com
    运行项目:
        scrapy crawl movie
    

    配置文件settings.py

    BOT_NAME = 'moviePro'
    
    SPIDER_MODULES = ['moviePro.spiders']
    NEWSPIDER_MODULE = 'moviePro.spiders'
    
    # 请求头
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # 只打印错误日志
    LOG_LEVEL = 'ERROR' 
    # 打开管道
    ITEM_PIPELINES = {
       'moviePro.pipelines.MovieproPipeline': 300,
    }
    

    items.py 配置两个字段

    import scrapy
    
    class MovieproItem(scrapy.Item):
        # define the fields for your item here like:
        title = scrapy.Field()
        actor = scrapy.Field()
    

    spiders/movie.py

    # 4567电影网:https://www.4567kan.com/index.php/vod/show/id/1.html
    
    import scrapy
    from moviePro.items import MovieproItem
    
    class MovieSpider(scrapy.Spider):
        name = 'movie'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://www.4567kan.com/index.php/vod/show/id/1.html']
        
        def parse(self, response):
            li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
            for li in li_list:
                title = li.xpath('./div/a/@title').extract_first()
                # 获取详情数据路径
                detail_url = 'https://www.4567kan.com'+li.xpath('./div/a/@href').extract_first()
                item = MovieproItem()
                item['title'] = title
    
                # 对详情页的url发起get请求
                # meta字典会传递给callback回调函数
                yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item})
        
        def parse_detail(self,response):
            actor = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()').extract_first()
            # 使用response接受传递过来的item
            item = response.meta['item']
            item['actor'] = actor
    
            yield item # 提交管道
    

    pipelines.py

    class MovePipeline:
        def process_item(self, item, spider):
            print(item) # 打印,查看是否传过来数据
            return item
    

    2.中间件

    - 中间件
        - 中间件的种类:
            - 爬虫中间件
            - 下载中间件(**)
        - 作用:拦截请求和响应 
        - 拦截请求:
            - 修改请求头信息
            - 进行代理操作
    

    middlewares.py

    from scrapy import signals
    
    # useful for handling different item types with a single interface
    from itemadapter import is_item, ItemAdapter
    
    class MiddleproDownloaderMiddleware:
        # 拦截所有的请求
        # request就是拦截到的请求
        # spider就是爬虫文件中定义类实例化的对象
        def process_request(self, request, spider):
            print('i am process_request')
            # 请求头信息的修改
            # request.headers['User-Agent'] = 'xxx'
            return None
    
        # 拦截响应
        # response:拦截到的响应对象
        # request:拦截到响应对象对应的请求对象
        def process_response(self, request, response, spider):
            print('i am process_response')
            return response
    
        # 拦截发生异常的请求对象 -- 修正请求对象
        def process_exception(self, request, exception, spider):
            print('i am process_exception')
            
            # 代理操作 -- 如果网站封禁本机ip 
            request.meta['proxy'] = 'https://ip:port'
            return request #对修正后的请求重新进行发送
    

    3.全站数据爬取

    - CrawlSpider父类
        1. 作用: 全站数据爬取
        2. 使用:
            - 创建项目: 
                scrapy startproject crawlPro(项目名称)
            - 创建爬虫文件:
                cd crawlPro
                scrapy genspider -t crawl sun(爬虫文件名) www.xx.com
            - 运行项目:
                scrapy crawl sun
        3. 连接提取器
            - 可以根据指定规则LinkExtractor(allow=正则)进行连接(url)的提取
        4. 规则解析器
            - 可以将link提取到的链接对应页面数据根据指定函数规则(callback回调函数)进行数据解析
    

    spiders/sun.py

    # 电影网站:https://www.4567kan.com/index.php/vod/show/id/1.html
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    
    class SunSpider(CrawlSpider):
        name = 'sun'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://www.4567kan.com/index.php/vod/show/id/1.html']
    
        # 链接提取器:可以根据指定规则(allow=正则)进行连接(url)的提取
        link = LinkExtractor(allow=r'page/d+.html')
    
        rules = (
            # 实例化了一个Rule(规则解析器)对象
            # 作用:可以将link提取到的链接对应页面数据根据指定函数规则(callback回调函数)进行数据解析
            # follow = False 只解析首次提取的页面, follow = True 可加载所有页面
            Rule(link, callback='parse_item', follow=False),
        )
        # 该方法调用的次数取决于link提取到的链接的个数
        def parse_item(self, response):
            print(response)
    
    

    4.分布式

    概念:

    1.概念:搭建一个分布式集群,对同一个网络资源进行联合且分布的数据爬取
    2.原生的scrapy框架是无法实现分布式原因?
        - 调度器无法被共享
        - 管道无法被共享
    3.实现分布式的解决方案:
        - 组件:scrapy-redis
        - 组件作用:提供可以被共享的管道和调度器
        - 安装: pip install scrapy-redis
    

    实现流程

    0.创建项目
    	- 创建项目: 
       	 	scrapy startproject fbsPro(项目名称)
        - 创建爬虫文件:
       	 	cd crawlPro
        	scrapy genspider -t crawl fbs(爬虫文件名) www.xx.com
    1.修改爬虫文件
    	- 导包:from scrapy_redis.spiders import RedisCrawlSpider
        - 将爬虫类的父类修改为RedisCrawlSpider
        - start_urls删除或注释掉
        - 添加一个新属性:redis_key = ’xxx‘ -- 可以被共享的调度器队列的名称
        - 编写请求和数据解析操作
    2.修改setting.py
        - 指定scrapy_redis管道:
            ITEM_PIPELINES = {
                'scrapy_redis.pipelines.RedisPipeline': 400
            }
        - 指定调度器:
            # 使用scrapy-redis组件的去重队列
            DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
            # 使用scrapy-redis组件自己的调度器
            SCHEDULER = "scrapy_redis.scheduler.Scheduler"
            # 是否允许暂停
            SCHEDULER_PERSIST = True
        - 指定redis数据库地址
            REDIS_HOST = 'redis服务的ip地址'
            REDIS_PORT = 6379
            REDIS_ENCODING = ‘utf-8’
            REDIS_PARAMS = {‘password’:’123456’}
    3.配置redis
        - 找到redis-x64文件夹 -- C:UsersAdministratorDownloadsRedis-x64-5.0.10
        - 配置文件:redis.window.conf 或 redis.window-service.conf
        - 注释56行:关闭默认绑定(#bind 127.0.0.1)
        - 关闭保护模式:75行(protected-mode no)
        - 启动redis服务端:redis-server.exe
        - 启动redis客户端:redis-cli.exe
    
    4.执行工程:
        - cd 到spiders目录下
        - scrapy runspider (爬虫文件名)fbs.py
    5.向调度器队列中仍入一个起始的url:(调度器的队列在redis中)
    	电影网址: https://www.4567kan.com/index.php/vod/show/id/1.html
    	- redis-cli客户端:lpush fbsQueue www.xxx.com(网站地址)
    

    代码

    settings.py

    BOT_NAME = 'fbsPro'
    
    SPIDER_MODULES = ['fbsPro.spiders']
    NEWSPIDER_MODULE = 'fbsPro.spiders'
    
    # 请求头
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    LOG_LEVEL = 'ERROR' # 只打印错误日志
    
    # redis共享管道
    ITEM_PIPELINES = {
       'scrapy_redis.pipelines.RedisPipeline': 400,
    }
    
    # redis共享调度器
    # 使用scrapy-redis组件的去重队列
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    # 使用scrapy-redis组件自己的调度器
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    # 是否允许暂停
    SCHEDULER_PERSIST = True
    
    # redis数据库
    REDIS_HOST = '192.168.19.47'
    REDIS_PORT = 6379
    

    items.py

    import scrapy
    
    class FbsproItem(scrapy.Item):
        # define the fields for your item here like:
        # 添加字段
        title = scrapy.Field()
        actor = scrapy.Field()
    

    spiders/fbs.py

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy_redis.spiders import RedisCrawlSpider
    from fbsPro.items import FbsproItem
    
    class FbsSpider(RedisCrawlSpider):
        name = 'fbs'
        # allowed_domains = ['www.xxx.com']
        # start_urls = ['http://www.xxx.com/']
    
        # 可以被共享的调度器队列的名称
        redis_key = 'fbsQueue'
    
        # 提取页码链接
        link = LinkExtractor(allow=r'page/d+.html')
        rules = (
            Rule(link,callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
           li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
           for li in li_list:
               title = li.xpath('./div/a/@title').extract_first()
               # 详细路径
               detail_url = 'https://www.4567kan.com'+li.xpath('./div/a/@href').extract_first()
               item = FbsproItem()
               item['title'] = title
               yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item})
        
        def parse_detail(self,response):
            item = response.meta['item']
            actor = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()').extract_first()
            item['actor'] = actor
            yield item
    

    5.增量式

    简介

    增量式:
        1. 作用:监测网站数据更新的情况。
        2. 记录表:记录爬取过的数据标识(数据指纹),爬过得不再重新爬取。
            - 谁充当记录表:redis的set
        
    0.创建项目
    	- 创建项目: 
       	 	scrapy startproject zlsPro(项目名称)
        - 创建爬虫文件:
       	 	cd crawlPro
        	scrapy genspider -t crawl zls(爬虫文件名) www.xx.com
            
    

    代码

    settings.py

    BOT_NAME = 'zlsPro'
    
    SPIDER_MODULES = ['zlsPro.spiders']
    NEWSPIDER_MODULE = 'zlsPro.spiders'
    
    # 请求头
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    # 只输出错误日志
    LOG_LEVEL = 'ERROR'
    # 开启管道
    ITEM_PIPELINES = {
       'zlsPro.pipelines.ZlsproPipeline': 300,
    }
    

    items.py

    import scrapy
    
    class ZlsproItem(scrapy.Item):
        # 添加字段
        title = scrapy.Field()
        actor = scrapy.Field()
    
    

    spiders/zls.py

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from redis import Redis
    from zlsPro.items import ZlsproItem
    
    class ZlsSpider(CrawlSpider):
        name = 'zls'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://www.4567kan.com/index.php/vod/show/id/1/page/1.html']
    
        # 创建redis数据库连接
        conn = Redis(host='127.0.0.1',port=6379)
    
        # 提取网页分页连接
        link = LinkExtractor(allow=r'page/d+.html')
    
        rules = (
            # 解释器
            Rule(link, callback='parse_item', follow=False),
        )
    
        def parse_item(self, response):
            li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
            for li in li_list:
                title = li.xpath('./div/a/@title').extract_first()
                detail_url = 'https://www.4567kan.com' + li.xpath('./div/a/@href').extract_first()
                item = ZlsproItem()
                item['title'] = title
    
                # 添加记录到记录表
                res = self.conn.sadd('movie_urls',detail_url)
                if res == 1:
                    print('有新数据更新,正在抓取......')
                    # 发送请求获取数据
                    yield scrapy.Request(detail_url, callback=self.parse_detail, meta={'item': item})
                else:
                    print('暂无新数据的更新!')
    
        def parse_detail(self, response):
            item = response.meta['item']
            actor = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()').extract_first()
            item['actor'] = actor
            yield item
    
    
  • 相关阅读:
    17-7-20-electron中主进程和渲染进程区别与通信
    17-7-19-书写规范和任务的延续性
    17-7-19----起
    一年没回来了
    django-BBS(2)
    django-BBS(1)
    nmap使用教程
    利用谷歌黑客语法挖掘漏洞
    PHP代码审计之XSS操作
    PHP安装文件的审计
  • 原文地址:https://www.cnblogs.com/jia-shu/p/14701357.html
Copyright © 2020-2023  润新知