• 爬虫篇 6 手动请求发送 五大核心组件 请求传参 中间件初识 虎牙全站爬取


    - 管道的持久化存储:
      - 数据解析(爬虫类)
      - 将解析的数据封装到item类型的对象中(爬虫类)
      - 将item提交给管道:yield item(爬虫类)
      - 在官大类的process_item中接收item对象并且进行任意形式的持久化存储操作(管道类)
      - 在配置文件中开启管道

    - 细节:
    - 将爬取的数据进行备份?
      - 一个管道类对应一种平台的持久化存储
    - 有多个管道类是否意味着多个管道类都可以接受到爬虫文件提交的item?
      - 只有优先级最高的管道才可以接受到item,剩下的管道类是需要从优先级最高的管道类中接收item

    - 基于Spider父类进行全站数据的爬取
    - 全站数据的爬取:将所有页码对应的页面数据进行爬取
    - 手动请求的发送(get):
    yield scrapy.Request(url,callback)
    - 对yield的总结:
    - 向管道提交item的时候:yield item
    - 手动请求发送:yield scrapy.Request(url,callback)
    - 手动发起post请求:
    yield scrapy.FormRequest(url,formdata,callback):formdata是一个字典表示的是请求参数

    虎牙全站爬取(多页面)

    定义一个

    import scrapy
    from huyaAll.items import HuyaallItem
    
    
    class HuyaSpider(scrapy.Spider):
        name = 'huya'
        # allowed_domains = ['www.ccc.com']
        start_urls = ['https://www.huya.com/g/xingxiu']
        url = 'https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&gameId=1663&tagAll=0&page=%s'
    
        def parse(self, response):
            li_list = response.xpath('//*[@id="js-live-list"]/li')
            all_data = []
            for li in li_list:
                title = li.xpath('./a[2]/text()').extract_first()  # 去【】
                author = li.xpath(' ./span/span[1]/i/text()').extract_first()
                hot = li.xpath('./ span / span[ 2] / i[2]/text()').extract_first()
                #     print(title,author,hot)
                #     dic = {
                #         'title':title,
                #         'author':author,
                #         'hot':hot
                #     }
                #     all_data.append(dic)
                # return all_data
                item = HuyaallItem()
                item['title'] = title
                item['author'] = author
                item['hot'] = hot
                yield item  # 提交给管道
                print(1)
            for page in range(2, 5):
                new_url = format(self.url % page)
                print(2)
                yield scrapy.Request(url=new_url, callback=self.parse_other)
    
        def parse_other(self, response):
            print(3)
            print(response.text)
            # 解析方法没有写
    View Code

     害!难受

    哥们自己解析了一下

            for page in range(2,3):
                new_url = format(self.url % page)
    
                yield scrapy.Request(url=new_url, callback=self.parse_other, meta={'item': item})
    
        def parse_other(self, response,):
            item = response.meta['item']
            s = ''
            htmlBody = response.xpath('//text()').extract()
            for aa in htmlBody:
                s = s + str(aa)
            res = json.loads(s)
            res_data = res['data']['datas']
            for i in res_data:
                item['title'] = i['roomName']
                item['author'] = i['nick']
                item['hot'] =i['totalCount']
                yield item

    - scrapy五大核心组件
    引擎(Scrapy)
      用来处理整个系统的数据流处理, 触发事务(框架核心)


    调度器(Scheduler)
      用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址

    下载器(Downloader)
      用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)

    爬虫(Spiders)

      爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
    项目管道(Pipeline)
      负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过  几个特定的次序处理数据。


    - scrapy的请求传参
    - 作用:实现深度爬取。
    - 使用场景:如果使用scrapy爬取的数据没有存在同一张页面中
    - 传递item:yield scrapy.Request(url,callback,meta)
    - 接收item:response.meta

    import scrapy
    from moivePro.items import MoiveproItem
    
    
    class MovieSpider(scrapy.Spider):
        name = 'movie'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['http://www.4567kan.com/index.php/vod/show/class/喜剧/id/1.html']
        url = 'http://www.4567kan.com/index.php/vod/show/class/喜剧/id/1/page/%s.html'
        page = 1
    
        def parse(self, response):
            print('正在爬取第{}页电影。。。'.format(self.page))
    
            li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
            url_h = 'http://www.4567kan.com'
            for li in li_list:
                item = MoiveproItem()
                name = li.xpath('./div/div/h4/a').extract_first()
                item['name'] = name
                # 请求传参:Requset将一个字典{meta=}传递给回调函数
                detail_url = url_h + li.xpath('./div/div/h4/a/@href').extract_first()
                yield scrapy.Request(detail_url, callback=self.parse_other, meta={'item': item})
            if self.page < 5:
                self.page += 1
                new_url = format(self.url % self.page)
                yield scrapy.Request(new_url, callback=self.parse)
    
        def parse_other(self, response):
            # 接收请求传参的数据(字典)
            item = response.meta['item']
            desc = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[3]/text()').extract_first()
            item['desc'] = desc
            yield item
    爬取全站电影名和简介

    - 提升scrapy爬取数据的效率
      - 在配置文件中进行相关的配置即可:


    增加并发:
      默认scrapy开启的并发线程为32个,可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。

    降低日志级别:
      在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写:LOG_LEVEL =   ‘INFO’

    禁止cookie:
      如果不是真的需要cookie,则在scrapy爬取数据时可以禁止cookie从而减少CPU的使用率,提升爬取效率。在配置文件中编写:COOKIES_ENABLED = False

    禁止重试:
      对失败的HTTP进行重新请求(重试)会减慢爬取速度,因此可以禁止重试。在配置文件中编写:RETRY_ENABLED = False

    减少下载超时:
      如果对一个非常慢的链接进行爬取,减少下载超时可以能让卡住的链接快速被放弃,从而提升效率。在配置文件中进行编写:DOWNLOAD_TIMEOUT = 10 超时时间为10s


    - scrapy的中间件
    - 爬虫中间件
      - 下载中间件(***):处于引擎和下载器之间
      - 作用:批量拦截所有的请求和响应

    - 为什么拦截请求
      - 篡改请求的头信息(UA伪装)

    1、在设置里面打开下载中间键

    2、不在设置里面写UA 

    3、在middlewares.py中重写 ,class MiddleproDownloaderMiddleware:、

    可以创建按一个UA池

    # Define here the models for your spider middleware
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    
    from scrapy import signals
    
    # useful for handling different item types with a single interface
    from itemadapter import is_item, ItemAdapter
    import random
    
    # class MiddleproSpiderMiddleware:
    #     # Not all methods need to be defined. If a method is not defined,
    #     # scrapy acts as if the spider middleware does not modify the
    #     # passed objects.
    #
    #     @classmethod
    #     def from_crawler(cls, crawler):
    #         # This method is used by Scrapy to create your spiders.
    #         s = cls()
    #         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
    #         return s
    #
    #     def process_spider_input(self, response, spider):
    #         # Called for each response that goes through the spider
    #         # middleware and into the spider.
    #
    #         # Should return None or raise an exception.
    #         return None
    #
    #     def process_spider_output(self, response, result, spider):
    #         # Called with the results returned from the Spider, after
    #         # it has processed the response.
    #
    #         # Must return an iterable of Request, or item objects.
    #         for i in result:
    #             yield i
    #
    #     def process_spider_exception(self, response, exception, spider):
    #         # Called when a spider or process_spider_input() method
    #         # (from other spider middleware) raises an exception.
    #
    #         # Should return either None or an iterable of Request or item objects.
    #         pass
    #
    #     def process_start_requests(self, start_requests, spider):
    #         # Called with the start requests of the spider, and works
    #         # similarly to the process_spider_output() method, except
    #         # that it doesn’t have a response associated.
    #
    #         # Must return only requests (not items).
    #         for r in start_requests:
    #             yield r
    #
    #     def spider_opened(self, spider):
    #         spider.logger.info('Spider opened: %s' % spider.name)
    user_agent_list = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
            "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    
    class MiddleproDownloaderMiddleware:
    
        # 拦截请求
        def process_request(self, request, spider):
            # 进行UA伪装
            request.headers['User-Agent'] = random.choice(user_agent_list)
            print(request.headers['User-Agent'])
            return None
        # 拦截所有的响应
        def process_response(self, request, response, spider):
    
            return response
    
        # 拦截发生异常的请求对象
        def process_exception(self, request, exception, spider):
    
            pass
    
        # def spider_opened(self, spider):
        #     spider.logger.info('Spider opened: %s' % spider.name)
    middlewares


      - 修改请求对应的ip(代理)

    # Define here the models for your spider middleware
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    
    from scrapy import signals
    
    # useful for handling different item types with a single interface
    from itemadapter import is_item, ItemAdapter
    import random
    
    # class MiddleproSpiderMiddleware:
    #     # Not all methods need to be defined. If a method is not defined,
    #     # scrapy acts as if the spider middleware does not modify the
    #     # passed objects.
    #
    #     @classmethod
    #     def from_crawler(cls, crawler):
    #         # This method is used by Scrapy to create your spiders.
    #         s = cls()
    #         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
    #         return s
    #
    #     def process_spider_input(self, response, spider):
    #         # Called for each response that goes through the spider
    #         # middleware and into the spider.
    #
    #         # Should return None or raise an exception.
    #         return None
    #
    #     def process_spider_output(self, response, result, spider):
    #         # Called with the results returned from the Spider, after
    #         # it has processed the response.
    #
    #         # Must return an iterable of Request, or item objects.
    #         for i in result:
    #             yield i
    #
    #     def process_spider_exception(self, response, exception, spider):
    #         # Called when a spider or process_spider_input() method
    #         # (from other spider middleware) raises an exception.
    #
    #         # Should return either None or an iterable of Request or item objects.
    #         pass
    #
    #     def process_start_requests(self, start_requests, spider):
    #         # Called with the start requests of the spider, and works
    #         # similarly to the process_spider_output() method, except
    #         # that it doesn’t have a response associated.
    #
    #         # Must return only requests (not items).
    #         for r in start_requests:
    #             yield r
    #
    #     def spider_opened(self, spider):
    #         spider.logger.info('Spider opened: %s' % spider.name)
    user_agent_list = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
            "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    
    class MiddleproDownloaderMiddleware:
    
        # 拦截请求
        def process_request(self, request, spider):
            # 进行UA伪装
            request.headers['User-Agent'] = random.choice(user_agent_list)
            print(request.headers['User-Agent'])
    
            # 代理
            request.meta['proxy'] = 'http://163.204.94.131:9999'
            print(request.meta['proxy'])
            return None
        # 拦截所有的响应
        def process_response(self, request, response, spider):
    
            return response
    
        # 拦截发生异常的请求对象
        def process_exception(self, request, exception, spider):
    
            pass
    
        # def spider_opened(self, spider):
        #     spider.logger.info('Spider opened: %s' % spider.name)
    proxy

    - 为什么拦截响应
      - 篡改响应数据,篡改响应对象
      - 爬取网易新闻的新闻标题和内容

    - selenium在scrapy中的使用流程
    - 在爬虫类中定义一个bro的属性,就是实例化的浏览器对象
    - 在爬虫类重写父类的一个closed(self,spider),在方法中关闭bro
    - 在中间件中进行浏览器自动化的操作
    - 作业:
    - 网易新闻
    - http://sc.chinaz.com/tupian/xingganmeinvtupian.html网站中的图片数据进行爬取

  • 相关阅读:
    Fortran编译器之一GUN Fortran安装(Windows XP)
    c++动态绑定的技术实现
    c++标准库比较
    java array
    java常用的基础容器
    mac sublime text 3 add ctags plugin
    git fetch
    查看远程分支的log
    git删除分支
    detached HEAD state
  • 原文地址:https://www.cnblogs.com/zhuangdd/p/13721206.html
Copyright © 2020-2023  润新知