• scrapy框架(2)


    一、使用scrapy框架发送post请求

    1、需求一:使用scrapy发送百度翻译中的ajax请求

      创建一个项目,如下目录,修改settings.py文件中的 "ROBOTSTXT_OBEY"和"USER_AGENT"

    # postPro/postPro/spiders/post.py
    
    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class PostSpider(scrapy.Spider):
        name = 'post'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://fanyi.baidu.com/sug']
    
        def start_requests(self):
            data = {
                "kw":"dog"
            }
            for url in self.start_urls:
                yield scrapy.FormRequest(url=url, formdata=data, callback=self.parse)
    
        def parse(self, response):
            print(response.text)

    2、需求二:基于scrapy实现模拟登录

      自己尝试实现!

    二、请求传参

      我们知道有时候需要爬取的数据并不是都在一个页面中,而是在不同页面中,这时候用scrapy框架该如何做呢?下面以爬取:https://www.4567tv.tv/frim/index1.html 中的数据为例说明。

    1、创建一个项目,目录结构如下,并修改settings.py文件中的 "ROBOTSTXT_OBEY"和"USER_AGENT"

    2、各文件内容如下

    # moviePro/moviePro/spiders/movie.py
    
    # -*- coding: utf-8 -*-
    import scrapy
    
    from moviePro.items import MovieproItem
    
    class MovieSpider(scrapy.Spider):
        name = 'movie'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://www.4567tv.tv/frim/index1.html']
    
      # 解析详情页中的数据
        def parse_detail(self, response):
         # response.meta 返回接收到的meta字典
            item = response.meta['item'] 
            actor = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()').extract_first()
            item['actor'] = actor
    
            yield item
    
    
        def parse(self, response):
            li_list = response.xpath('//li[@class="col-md-6 col-sm-4 col-xs-3"]')
            for li in li_list:
                item = MovieproItem()
                name = li.xpath('./div/a/@title').extract_first()
                detail_url = 'https://www.4567tv.tv' + li.xpath('./div/a/@href').extract_first()
                item['name'] = name
                yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={'item': item})
    # moviePro/moviePro/items.py
    
    import scrapy
    
    class MovieproItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        name = scrapy.Field()
        actor = scrapy.Field()
    # moviePro/moviePro/pipelines.py
    
    class MovieproPipeline(object):
        def process_item(self, item, spider):
            print(item)
            return item

      注意:要将settings.py中的"ITEM_PIPELINES"放开注释!

    3、运行结果

    三、日志等级

    1、修改日志等级

      通过修改settings.py中的参数 LOG_LEVEL 来修改日志等级,比如可以改为 ERROR

    # settings.py
    LOG_LEVEL = "ERROR"
    2、指定日志输出文件
    # settings.py
    LOG_FILE = './log.txt'

    四、scrapy的五大核心组件

    1、五大核心组件工作流程如下图:

      解释如下:

      - 引擎(Scrapy)

        用来处理整个系统的数据流处理,触发事务(框架核心)

      - 调度器(Scheduler)

        用来接收引擎发过来的请求,压入队列中,并在引擎再次请求的时候返回,可以想象成一个URL(抓取网页的网址或者说是链接)的优先队列,由它来决定下一个要抓取的网址是什么,同时去除重复的网址

      - 下载器(Downloader)

        用于下载网页内容,并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)

      - 爬虫(Spiders)

        爬虫是主要干活的,用于从特定的网页中提取自己需要的信息,即所谓的实体(Item)。用户也可以从中提取出链接,让scrapy继续抓取下一个页面

      - 项目管道(Pipeline)

        负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体,验证实体的有效性,清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据

    2、案例一

      当批量请求时,对不同的请求使用不同的UA伪装和IP代理,可以利用DownloaderMiddleware,新建一个项目,目录结构如下:

      各文件内容如下:

    # middlePro/middlePro/middlewares.py
    
    import random
    
    from scrapy import signals
    
    
    class MiddleproDownloaderMiddleware(object):
    
        user_agent_list = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
            "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
        ]
        # 可被选用的代理IP
        PROXY_http = [
            '153.180.102.104:80',
            '195.208.131.189:56055',
        ]
        PROXY_https = [
            '120.83.49.90:9000',
            '95.189.112.214:35508',
        ]
    
        # 拦截所有未发生异常的请求
        def process_request(self, request, spider):
    
            # 使用UA池进行请求的UA伪装
            request.headers["User-Agent"] = random.choice(self.user_agent_list)
            print(request.headers['User-Agent'])
            return None
    
    
        # 拦截所有的响应
        def process_response(self, request, response, spider):
            return response
    
    
        # 拦截到产生异常的请求
        def process_exception(self, request, exception, spider):
            print('this is process_exception!')
            if request.url.split(':')[0] == 'http':
                request.meta['proxy'] = random.choice(self.PROXY_http)
            else:
                request.meta['proxy'] = random.choice(self.PROXY_https)
    # middlePro/middlePro/spiders/middle.py
    
    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class MiddleSpider(scrapy.Spider):
        name = 'middle'
        allowed_domains = ['www.xxx.com']
        start_urls = ['https://www.baidu.com/s?wd=ip']
    
        def parse(self, response):
            pass

      注意:settings.py中的参数 "DOWNLOADER_MIDDLEWARES" 解开注释!

    3、案例二

      抓取网易新闻"军事"模块(http://war.163.com/)的页面信息,注意动态加载。

      思路提示:页面是有动态加载信息,因此我们要使用selenium模块。

      新建一个scrapy项目,目录结构如下:

      各文件代码如下:

    # wangyiPro/wangyiPro/spieders/wangyi.py
    
    # -*- coding: utf-8 -*-
    import scrapy
    from selenium import webdriver
    
    class WangyiSpider(scrapy.Spider):
        name = 'wangyi'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['http://war.163.com/']
    
        def __init__(self):
            self.bro = webdriver.Chrome(executable_path=r'D:@Lilymyprojectpachongchromedriver.exe')
    
    
        def parse(self, response):
            div_list = response.xpath('//div[@class="data_row news_article clearfix "]')
            for div in div_list:
                title = div.xpath('.//div[@class="news_title"]/h3/a/text()').extract_first()
                print(title)
    
    
        def closed(self, spider):
            print('关闭浏览器对象!')
            self.bro.quit()
    # wangyiPro/wangyiPro/middlewares.py
    
    from time import sleep
    from scrapy import signals
    from scrapy.http import HtmlResponse
    
    
    class WangyiproDownloaderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the downloader middleware does not modify the
        # passed objects.
    
        @classmethod
        def from_crawler(cls, crawler):
            # This method is used by Scrapy to create your spiders.
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
            return s
    
        def process_request(self, request, spider):
            return None
    
        def process_response(self, request, response, spider):
            # 获取动态加载出来的数据
            bro = spider.bro
            bro.get(url=request.url)
            sleep(3)
            # 包含了动态加载出来的新闻数据
            page_text = bro.page_source
            sleep(3)
            return HtmlResponse(url=spider.bro.current_url, body=page_text, encoding='utf-8', request=request)
    
    
        def process_exception(self, request, exception, spider):
            pass
    

      注意:settings.py文件中要配置好参数"ROBOTSTXT_OBEY"和"ROBOTSTXT_OBEY",并且解开中间件参数"DOWNLOADER_MIDDLEWARES"的注释。

      运行结果如下:

      总结:在scrapy中使用selenium的编码流程:

      1)在spider的构造方法中创建一个浏览器对象(作为当前spider的一个属性)
      2)重写spider的一个方法closed(self,spider),在该方法中执行浏览器关闭的操作
      3)在下载中间件的process_response方法中,通过spider参数获取浏览器对象
      4)在中间件的process_response中定制基于浏览器自动化的操作代码(获取动态加载出来的页面源码数据)
      5)实例化一个响应对象,且将page_source返回的页面源码封装到该对象中
      6)返回该新的响应对象

    五、相关参考博客

      https://www.cnblogs.com/bobo-zhang/p/10069001.html

      https://www.cnblogs.com/bobo-zhang/p/10069004.html

      https://www.cnblogs.com/bobo-zhang/p/10013011.html

  • 相关阅读:
    Red and Black POJ
    Catch That Cow HDU
    Lotus and Horticulture HDU
    进击的绿色
    北京办护照
    女码农真诚征gg
    bitset
    long long
    cnblogs latex公式
    2050 Programming Competition (CCPC)
  • 原文地址:https://www.cnblogs.com/li-li/p/10469495.html
Copyright © 2020-2023  润新知