• CrawlSpider 用法(页面链接提取解析 例如:下一页)


    创建基于CrawlSpider的爬虫文件

      scrapy genspider -t crawl 爬虫名称 链接

    注意follow参数

    例1:follow = False

    spider/chouti.py

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    
    class ChoutiSpider(CrawlSpider):
        name = 'chouti'
        allowed_domains = ['dig.chouti.com']
        start_urls = ['https://dig.chouti.com/']
        # 实例化一个链接提取器对象
        # 链接提取器:用来提取指定的链接(url)
        # allow参数:赋值一个正则表达式
        # 链接提取器可以根据正则表达式在页面中提取指定的链接
        # 提取到的链接会全部交给规则解析器
        link = LinkExtractor(allow=r'/all/hot/recent/d+')
        rules = (
            # 实例话一个规则解析器
            # 规则解析器在接收链接提起器发送的链接后,就会对链接发起请求,获取链接对应的页面内容
            # callback:指定一个解析规则(方法/函数)
            # follow:是否将链接提取器继续作用到链接提取器已经提取出的页面数据中
            Rule(link, callback='parse_item', follow=False),
        )
    
        def parse_item(self, response):
            print(response)

    执行结果 : 没有允许链接提取器继续在提取到的链接中继续作用

    C:UsersAdministratorPycharmProjects
    ewCrawlspiderPro>scrapy crawl chouti --nolog
    <200 https://dig.chouti.com/all/hot/recent/1>
    <200 https://dig.chouti.com/all/hot/recent/3>
    <200 https://dig.chouti.com/all/hot/recent/9>
    <200 https://dig.chouti.com/all/hot/recent/6>
    <200 https://dig.chouti.com/all/hot/recent/2>
    <200 https://dig.chouti.com/all/hot/recent/4>
    <200 https://dig.chouti.com/all/hot/recent/10>
    <200 https://dig.chouti.com/all/hot/recent/7>
    <200 https://dig.chouti.com/all/hot/recent/8>
    <200 https://dig.chouti.com/all/hot/recent/5>

    例2:

    follow = True

    spider/chouti.py

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    
    class ChoutiSpider(CrawlSpider):
        name = 'chouti'
        allowed_domains = ['dig.chouti.com']
        start_urls = ['https://dig.chouti.com/']
        # 实例化一个链接提取器对象
        # 链接提取器:用来提取指定的链接(url)
        # allow参数:赋值一个正则表达式
        # 链接提取器可以根据正则表达式在页面中提取指定的链接
        # 提取到的链接会全部交给规则解析器
        link = LinkExtractor(allow=r'/all/hot/recent/d+')
        rules = (
            # 实例话一个规则解析器
            # 规则解析器在接收链接提起器发送的链接后,就会对链接发起请求,获取链接对应的页面内容
            # callback:指定一个解析规则(方法/函数)
            # follow:是否将链接提取器继续作用到链接提取器已经提取出的页面数据中
            Rule(link, callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            print(response)

    执行结果

    C:UsersAdministratorPycharmProjects
    ewCrawlspiderPro>scrapy crawl chouti --nolog
    <200 https://dig.chouti.com/all/hot/recent/1>
    <200 https://dig.chouti.com/all/hot/recent/3>
    <200 https://dig.chouti.com/all/hot/recent/5>
    <200 https://dig.chouti.com/all/hot/recent/2>
    <200 https://dig.chouti.com/all/hot/recent/10>
    <200 https://dig.chouti.com/all/hot/recent/4>
    <200 https://dig.chouti.com/all/hot/recent/6>
    <200 https://dig.chouti.com/all/hot/recent/7>
    <200 https://dig.chouti.com/all/hot/recent/8>
    <200 https://dig.chouti.com/all/hot/recent/9>
    <200 https://dig.chouti.com/all/hot/recent/13>
    <200 https://dig.chouti.com/all/hot/recent/14>
    <200 https://dig.chouti.com/all/hot/recent/11>
    <200 https://dig.chouti.com/all/hot/recent/12>
    <200 https://dig.chouti.com/all/hot/recent/16>
    <200 https://dig.chouti.com/all/hot/recent/17>
    <200 https://dig.chouti.com/all/hot/recent/15>
    <200 https://dig.chouti.com/all/hot/recent/18>
    <200 https://dig.chouti.com/all/hot/recent/20>
    <200 https://dig.chouti.com/all/hot/recent/19>
    <200 https://dig.chouti.com/all/hot/recent/22>
    <200 https://dig.chouti.com/all/hot/recent/21>
    <200 https://dig.chouti.com/all/hot/recent/24>
    <200 https://dig.chouti.com/all/hot/recent/23>
    <200 https://dig.chouti.com/all/hot/recent/26>
    <200 https://dig.chouti.com/all/hot/recent/25>
    <200 https://dig.chouti.com/all/hot/recent/28>
    <200 https://dig.chouti.com/all/hot/recent/27>
    <200 https://dig.chouti.com/all/hot/recent/30>
    <200 https://dig.chouti.com/all/hot/recent/29>
    <200 https://dig.chouti.com/all/hot/recent/31>
    <200 https://dig.chouti.com/all/hot/recent/32>
    <200 https://dig.chouti.com/all/hot/recent/33>
    <200 https://dig.chouti.com/all/hot/recent/34>
    <200 https://dig.chouti.com/all/hot/recent/37>
    <200 https://dig.chouti.com/all/hot/recent/36>
    <200 https://dig.chouti.com/all/hot/recent/38>
    <200 https://dig.chouti.com/all/hot/recent/35>
    <200 https://dig.chouti.com/all/hot/recent/40>
    <200 https://dig.chouti.com/all/hot/recent/41>
    <200 https://dig.chouti.com/all/hot/recent/39>
    <200 https://dig.chouti.com/all/hot/recent/42>
    <200 https://dig.chouti.com/all/hot/recent/45>
    <200 https://dig.chouti.com/all/hot/recent/43>
    <200 https://dig.chouti.com/all/hot/recent/44>
    <200 https://dig.chouti.com/all/hot/recent/46>
    <200 https://dig.chouti.com/all/hot/recent/49>
    <200 https://dig.chouti.com/all/hot/recent/48>
    <200 https://dig.chouti.com/all/hot/recent/47>
    <200 https://dig.chouti.com/all/hot/recent/50>
    <200 https://dig.chouti.com/all/hot/recent/51>
    <200 https://dig.chouti.com/all/hot/recent/52>
    <200 https://dig.chouti.com/all/hot/recent/53>
    <200 https://dig.chouti.com/all/hot/recent/54>
    <200 https://dig.chouti.com/all/hot/recent/55>
    <200 https://dig.chouti.com/all/hot/recent/56>
    <200 https://dig.chouti.com/all/hot/recent/58>
    <200 https://dig.chouti.com/all/hot/recent/57>
    <200 https://dig.chouti.com/all/hot/recent/60>
    <200 https://dig.chouti.com/all/hot/recent/59>
    <200 https://dig.chouti.com/all/hot/recent/61>
    <200 https://dig.chouti.com/all/hot/recent/62>
    <200 https://dig.chouti.com/all/hot/recent/64>
    <200 https://dig.chouti.com/all/hot/recent/63>
    <200 https://dig.chouti.com/all/hot/recent/65>
    <200 https://dig.chouti.com/all/hot/recent/66>
    <200 https://dig.chouti.com/all/hot/recent/68>
    <200 https://dig.chouti.com/all/hot/recent/67>
    <200 https://dig.chouti.com/all/hot/recent/69>
    <200 https://dig.chouti.com/all/hot/recent/70>
    <200 https://dig.chouti.com/all/hot/recent/71>
    <200 https://dig.chouti.com/all/hot/recent/72>
    <200 https://dig.chouti.com/all/hot/recent/73>
    <200 https://dig.chouti.com/all/hot/recent/74>
    <200 https://dig.chouti.com/all/hot/recent/75>
    <200 https://dig.chouti.com/all/hot/recent/76>
    <200 https://dig.chouti.com/all/hot/recent/78>
    <200 https://dig.chouti.com/all/hot/recent/77>
    <200 https://dig.chouti.com/all/hot/recent/79>
    <200 https://dig.chouti.com/all/hot/recent/80>
    <200 https://dig.chouti.com/all/hot/recent/82>
    <200 https://dig.chouti.com/all/hot/recent/81>
    <200 https://dig.chouti.com/all/hot/recent/84>
    <200 https://dig.chouti.com/all/hot/recent/83>
    <200 https://dig.chouti.com/all/hot/recent/85>
    <200 https://dig.chouti.com/all/hot/recent/86>
    <200 https://dig.chouti.com/all/hot/recent/87>
    <200 https://dig.chouti.com/all/hot/recent/88>
    <200 https://dig.chouti.com/all/hot/recent/89>
    <200 https://dig.chouti.com/all/hot/recent/90>
    <200 https://dig.chouti.com/all/hot/recent/91>
    <200 https://dig.chouti.com/all/hot/recent/92>
    <200 https://dig.chouti.com/all/hot/recent/94>
    <200 https://dig.chouti.com/all/hot/recent/93>
    <200 https://dig.chouti.com/all/hot/recent/96>
    <200 https://dig.chouti.com/all/hot/recent/95>
    <200 https://dig.chouti.com/all/hot/recent/98>
    <200 https://dig.chouti.com/all/hot/recent/97>
    <200 https://dig.chouti.com/all/hot/recent/100>
    <200 https://dig.chouti.com/all/hot/recent/99>
    <200 https://dig.chouti.com/all/hot/recent/102>
    <200 https://dig.chouti.com/all/hot/recent/101>
    <200 https://dig.chouti.com/all/hot/recent/103>
    <200 https://dig.chouti.com/all/hot/recent/104>
    <200 https://dig.chouti.com/all/hot/recent/105>
    <200 https://dig.chouti.com/all/hot/recent/106>
    <200 https://dig.chouti.com/all/hot/recent/107>
    <200 https://dig.chouti.com/all/hot/recent/108>
    <200 https://dig.chouti.com/all/hot/recent/109>
    <200 https://dig.chouti.com/all/hot/recent/110>
    <200 https://dig.chouti.com/all/hot/recent/111>
    <200 https://dig.chouti.com/all/hot/recent/112>
    <200 https://dig.chouti.com/all/hot/recent/113>
    <200 https://dig.chouti.com/all/hot/recent/114>
    <200 https://dig.chouti.com/all/hot/recent/115>
    <200 https://dig.chouti.com/all/hot/recent/116>
    <200 https://dig.chouti.com/all/hot/recent/118>
    <200 https://dig.chouti.com/all/hot/recent/117>
    <200 https://dig.chouti.com/all/hot/recent/119>
    <200 https://dig.chouti.com/all/hot/recent/120>

    注意:

      如果后续对爬取的页面数据进行处理,用xpath获取数据,yield到 管道再进行相应的存储操作

  • 相关阅读:
    BaseDao
    url中文参数解决方案
    Ajax实现步骤和原理
    在服务器端使用文件时的路径解决方案
    用户验证登录拦截器
    jenkins环境搭建
    gitlab环境搭建
    nexus3.X环境搭建
    base64文件大小计算
    JVM远程调试功能
  • 原文地址:https://www.cnblogs.com/cjj-zyj/p/10144860.html
Copyright © 2020-2023  润新知