• 请求传参、日志等级和爬虫优化


    请求传参

    在某些情况下,我们爬取的数据不在同一个页面中,例如,我们爬取一个电影网站,电影的名称,评分在一级页面,而要爬取的其他电影详情在其二级子页面中。这时我们就需要用到请求传参

     案例展示:爬取http://www.55xia.com电影网,将一级页面中的电影名称,名字,评分     二级页面中的导演,演员进行爬取

    爬虫文件

    # -*- coding: utf-8 -*-
    import scrapy
    
    from moviePro.items import MovieproItem
    class MovieSpider(scrapy.Spider):
        name = 'movie'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['http://www.55xia.com/']
    
        def parse(self, response):
            div_list = response.xpath('//div[@class="col-xs-1-5 movie-item"]')
            for div in div_list:
                item = MovieproItem()
                item['name'] = div.xpath('.//div[@class="meta"]/h1/a/text()').extract_first()
                item['score'] = div.xpath('.//div[@class="meta"]/h1/em/text()').extract_first()
                if item['score'] == None:
                    item['score'] = '0'
                detail_url = 'https:'+div.xpath('.//div[@class="meta"]/h1/a/@href').extract_first()
    
                #对详情页的url发请求
                #使用meta参数实现请求传参
                yield scrapy.Request(url=detail_url,callback=self.getDetailPage,meta={'item':item})
    
        def getDetailPage(self,response):
            item = response.meta['item']
            deactor = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[1]/td[2]/a/text()').extract_first()
            desc = response.xpath('/html/body/div[1]/div/div/div[1]/div[2]/div[2]/p/text()').extract_first()
            item['desc'] = desc
            item['deactor'] =deactor
    
            yield item
    
    
            #总结:当使用scrapy进行数据爬取的时候,如果发现爬取的数据值没有在同一张页面中进行存储.则必须使用请求传参进行处理(持久化)
    movie
    import scrapy
    
    
    class MovieproItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        name = scrapy.Field()
        score = scrapy.Field()
        deactor = scrapy.Field()
        desc = scrapy.Field()
    items
    class MovieproPipeline(object):
        fp = None
    
        def open_spider(self, spider):
            self.fp = open('./movie.txt', 'w', encoding='utf-8')
    
        def process_item(self, item, spider):
            self.fp.write(item['name'] + ':' + item['score'] + ':' + item['deactor'] + ':' + item['desc'] + '
    ')
            return item
    
        def close_spider(self, spider):
            self.fp.close()
    pipelines.py
    # UA
    USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
    
    
    ROBOTSTXT_OBEY = False
    #管道
    ITEM_PIPELINES = {
       'moviePro.pipelines.MovieproPipeline': 300,
    }
    settings.py

    提升爬虫效率

    1.增加并发:
        默认scrapy开启的并发线程为32个,可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。
    
    2.降低日志级别:
        在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写:LOG_LEVEL = ‘INFO’
    
    禁止cookie:
        如果不是真的需要cookie,则在scrapy爬取数据时可以进制cookie从而减少CPU的使用率,提升爬取效率。在配置文件中编写:COOKIES_ENABLED = False
    
    禁止重试:
        对失败的HTTP进行重新请求(重试)会减慢爬取速度,因此可以禁止重试。在配置文件中编写:RETRY_ENABLED = False
    
    减少下载超时:
        如果对一个非常慢的链接进行爬取,减少下载超时可以能让卡住的链接快速被放弃,从而提升效率。在配置文件中进行编写:DOWNLOAD_TIMEOUT = 10 超时时间为10s

    爬取彼岸图网示例

    爬虫程序

    # -*- coding: utf-8 -*-
    import scrapy
    from picPro.items import PicproItem
    
    
    class PicSpider(scrapy.Spider):
        name = 'pic'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['http://pic.netbian.com/']
    
        def parse(self, response):
            li_list = response.xpath('//div[@class="slist"]/ul/li')  # 获取图片的列表
            for li in li_list:
                img_url = 'http://pic.netbian.com' + li.xpath('./a/span/img/@src').extract_first()  # 获取图片地址
                img_name = img_url.split('/')[-1]  # 获取图片名字
                item = PicproItem()
                item['name'] = img_name
    
                yield scrapy.Request(url=img_url, callback=self.getImgData, meta={'item': item})
    
        def getImgData(self, response):
            item = response.meta['item']
            item['img_data'] = response.body
    
            yield item
    pic

    管道文件

    import os
    
    
    class PicproPipeline(object):
        def open_spider(self, spider):
            if not os.path.exists('picLib'):
                os.mkdir('./picLib')
    
        def process_item(self, item, spider):
            imgPath = './picLib/' + item['name']
            with open(imgPath, 'wb') as fp:
                fp.write(item['img_data'])
                print(imgPath + '下载成功!')
            return item
    pipelines.py

    items

    import scrapy
    
    
    class PicproItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        name = scrapy.Field()
        img_data = scrapy.Field()
    items

    settings文件

    ROBOTSTXT_OBEY = False
    USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
    CONCURRENT_REQUESTS = 30  # 设置连接数
    LOG_LEVEL = 'ERROR'  # 降低日志级别
    COOKIES_ENABLED = False  # 禁止cookies
    RETRY_ENABLED = False  # 禁止重试
    DOWNLOAD_TIMEOUT = 5  # 下载超时时间
    settings.py

    UA池和ip代理池

    首先先要看下medmiddlewaresy文件

    主要是两类

    Spider Middleware 

    主要功能是在爬虫运行过程中进行一些处理              一般不用

           - process_spider_input 接收一个response对象并处理,

             位置是Downloader-->process_spider_input-->Spiders(Downloader和Spiders是scrapy官方结构图中的组件)

           - process_spider_exception spider出现的异常时被调用

           - process_spider_output 当Spider处理response返回result时,该方法被调用

           - process_start_requests 当spider发出请求时,被调用

        位置是Spiders-->process_start_requests-->Scrapy Engine(Scrapy Engine是scrapy官方结构图中的组件)         

    Downloader Middleware

     主要功能在请求到网页后,页面被下载时进行一些处理

    使用

    添加UA池和ip池就在  ProxyproDownloaderMiddleware  类中  process_request 方法中添加就可以

    class ProxyproDownloaderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the downloader middleware does not modify the
        # passed objects.
        # 拦截请求:request参数就是拦截到的请求
       # 添加 http池 和https池 PROXY_http
    = [ '58.45.195.51:9000', '111.230.113.238:9999', ] PROXY_https = [ '120.83.49.90:9000', '106.14.162.110:8080', ]
        # 添加 ua池 user_agent_list
    = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 " "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called # 使用随机ip print('下载中间件', request) if request.url.split(':')[0] == 'http': # 如果使用的是网站开头是http的 request.meta['proxy'] = random.choice(self.PROXY_http) else: request.meta['proxy'] = random.choice(self.PROXY_https) # 使用随机UA伪装 request.headers['User-Agent'] = random.choice(self.user_agent_list) print(request.headers['User-Agent']) return None

     注意  一定开启settings的 !!!

    # 开启下载中间件
    DOWNLOADER_MIDDLEWARES = { 'proxyPro.middlewares.ProxyproDownloaderMiddleware': 543, }
  • 相关阅读:
    一定间隔时间下重复执行一个函数的几个方法
    关于 MonoDevelop on Linux 单步调试问题的解决
    MonoDevelop 4.2.2/Mono 3.4.0 in CentOS 6.5 安装笔记
    在ASP.NET MVC 4 on Mono中使用OracleClient in CentOS 6.x的问题记录
    在CentOS 6.4 x86_32中使用Rhythmbox听MP3
    MonoDevelop 4.0.9 on CentOS 6.3 安装笔记
    MemoryMappedFile 在 Mono in Linux 的开发笔记
    Mono on CentOS 6.3 安装笔记
    庆祝下:iOS 开发者企业级计划(299美元/年帐户+邓白氏码免费) 和 Windows Phone公司应用(公司帐户99美元+Symantec企业证书299美元/年))顺利发布成功
    PowerDesigner中NAME和COMMENT的互相转换,需要执行语句
  • 原文地址:https://www.cnblogs.com/clbao/p/10269384.html
Copyright © 2020-2023  润新知