创建基于CrawlSpider的爬虫文件
scrapy genspider -t crawl 爬虫名称 链接
注意follow参数
例1:follow = False
spider/chouti.py
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class ChoutiSpider(CrawlSpider): name = 'chouti' allowed_domains = ['dig.chouti.com'] start_urls = ['https://dig.chouti.com/'] # 实例化一个链接提取器对象 # 链接提取器:用来提取指定的链接(url) # allow参数:赋值一个正则表达式 # 链接提取器可以根据正则表达式在页面中提取指定的链接 # 提取到的链接会全部交给规则解析器 link = LinkExtractor(allow=r'/all/hot/recent/d+') rules = ( # 实例话一个规则解析器 # 规则解析器在接收链接提起器发送的链接后,就会对链接发起请求,获取链接对应的页面内容 # callback:指定一个解析规则(方法/函数) # follow:是否将链接提取器继续作用到链接提取器已经提取出的页面数据中 Rule(link, callback='parse_item', follow=False), ) def parse_item(self, response): print(response)
执行结果 : 没有允许链接提取器继续在提取到的链接中继续作用
C:UsersAdministratorPycharmProjects
ewCrawlspiderPro>scrapy crawl chouti --nolog
<200 https://dig.chouti.com/all/hot/recent/1>
<200 https://dig.chouti.com/all/hot/recent/3>
<200 https://dig.chouti.com/all/hot/recent/9>
<200 https://dig.chouti.com/all/hot/recent/6>
<200 https://dig.chouti.com/all/hot/recent/2>
<200 https://dig.chouti.com/all/hot/recent/4>
<200 https://dig.chouti.com/all/hot/recent/10>
<200 https://dig.chouti.com/all/hot/recent/7>
<200 https://dig.chouti.com/all/hot/recent/8>
<200 https://dig.chouti.com/all/hot/recent/5>
例2:
follow = True
spider/chouti.py
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class ChoutiSpider(CrawlSpider): name = 'chouti' allowed_domains = ['dig.chouti.com'] start_urls = ['https://dig.chouti.com/'] # 实例化一个链接提取器对象 # 链接提取器:用来提取指定的链接(url) # allow参数:赋值一个正则表达式 # 链接提取器可以根据正则表达式在页面中提取指定的链接 # 提取到的链接会全部交给规则解析器 link = LinkExtractor(allow=r'/all/hot/recent/d+') rules = ( # 实例话一个规则解析器 # 规则解析器在接收链接提起器发送的链接后,就会对链接发起请求,获取链接对应的页面内容 # callback:指定一个解析规则(方法/函数) # follow:是否将链接提取器继续作用到链接提取器已经提取出的页面数据中 Rule(link, callback='parse_item', follow=True), ) def parse_item(self, response): print(response)
执行结果
C:UsersAdministratorPycharmProjects
ewCrawlspiderPro>scrapy crawl chouti --nolog
<200 https://dig.chouti.com/all/hot/recent/1>
<200 https://dig.chouti.com/all/hot/recent/3>
<200 https://dig.chouti.com/all/hot/recent/5>
<200 https://dig.chouti.com/all/hot/recent/2>
<200 https://dig.chouti.com/all/hot/recent/10>
<200 https://dig.chouti.com/all/hot/recent/4>
<200 https://dig.chouti.com/all/hot/recent/6>
<200 https://dig.chouti.com/all/hot/recent/7>
<200 https://dig.chouti.com/all/hot/recent/8>
<200 https://dig.chouti.com/all/hot/recent/9>
<200 https://dig.chouti.com/all/hot/recent/13>
<200 https://dig.chouti.com/all/hot/recent/14>
<200 https://dig.chouti.com/all/hot/recent/11>
<200 https://dig.chouti.com/all/hot/recent/12>
<200 https://dig.chouti.com/all/hot/recent/16>
<200 https://dig.chouti.com/all/hot/recent/17>
<200 https://dig.chouti.com/all/hot/recent/15>
<200 https://dig.chouti.com/all/hot/recent/18>
<200 https://dig.chouti.com/all/hot/recent/20>
<200 https://dig.chouti.com/all/hot/recent/19>
<200 https://dig.chouti.com/all/hot/recent/22>
<200 https://dig.chouti.com/all/hot/recent/21>
<200 https://dig.chouti.com/all/hot/recent/24>
<200 https://dig.chouti.com/all/hot/recent/23>
<200 https://dig.chouti.com/all/hot/recent/26>
<200 https://dig.chouti.com/all/hot/recent/25>
<200 https://dig.chouti.com/all/hot/recent/28>
<200 https://dig.chouti.com/all/hot/recent/27>
<200 https://dig.chouti.com/all/hot/recent/30>
<200 https://dig.chouti.com/all/hot/recent/29>
<200 https://dig.chouti.com/all/hot/recent/31>
<200 https://dig.chouti.com/all/hot/recent/32>
<200 https://dig.chouti.com/all/hot/recent/33>
<200 https://dig.chouti.com/all/hot/recent/34>
<200 https://dig.chouti.com/all/hot/recent/37>
<200 https://dig.chouti.com/all/hot/recent/36>
<200 https://dig.chouti.com/all/hot/recent/38>
<200 https://dig.chouti.com/all/hot/recent/35>
<200 https://dig.chouti.com/all/hot/recent/40>
<200 https://dig.chouti.com/all/hot/recent/41>
<200 https://dig.chouti.com/all/hot/recent/39>
<200 https://dig.chouti.com/all/hot/recent/42>
<200 https://dig.chouti.com/all/hot/recent/45>
<200 https://dig.chouti.com/all/hot/recent/43>
<200 https://dig.chouti.com/all/hot/recent/44>
<200 https://dig.chouti.com/all/hot/recent/46>
<200 https://dig.chouti.com/all/hot/recent/49>
<200 https://dig.chouti.com/all/hot/recent/48>
<200 https://dig.chouti.com/all/hot/recent/47>
<200 https://dig.chouti.com/all/hot/recent/50>
<200 https://dig.chouti.com/all/hot/recent/51>
<200 https://dig.chouti.com/all/hot/recent/52>
<200 https://dig.chouti.com/all/hot/recent/53>
<200 https://dig.chouti.com/all/hot/recent/54>
<200 https://dig.chouti.com/all/hot/recent/55>
<200 https://dig.chouti.com/all/hot/recent/56>
<200 https://dig.chouti.com/all/hot/recent/58>
<200 https://dig.chouti.com/all/hot/recent/57>
<200 https://dig.chouti.com/all/hot/recent/60>
<200 https://dig.chouti.com/all/hot/recent/59>
<200 https://dig.chouti.com/all/hot/recent/61>
<200 https://dig.chouti.com/all/hot/recent/62>
<200 https://dig.chouti.com/all/hot/recent/64>
<200 https://dig.chouti.com/all/hot/recent/63>
<200 https://dig.chouti.com/all/hot/recent/65>
<200 https://dig.chouti.com/all/hot/recent/66>
<200 https://dig.chouti.com/all/hot/recent/68>
<200 https://dig.chouti.com/all/hot/recent/67>
<200 https://dig.chouti.com/all/hot/recent/69>
<200 https://dig.chouti.com/all/hot/recent/70>
<200 https://dig.chouti.com/all/hot/recent/71>
<200 https://dig.chouti.com/all/hot/recent/72>
<200 https://dig.chouti.com/all/hot/recent/73>
<200 https://dig.chouti.com/all/hot/recent/74>
<200 https://dig.chouti.com/all/hot/recent/75>
<200 https://dig.chouti.com/all/hot/recent/76>
<200 https://dig.chouti.com/all/hot/recent/78>
<200 https://dig.chouti.com/all/hot/recent/77>
<200 https://dig.chouti.com/all/hot/recent/79>
<200 https://dig.chouti.com/all/hot/recent/80>
<200 https://dig.chouti.com/all/hot/recent/82>
<200 https://dig.chouti.com/all/hot/recent/81>
<200 https://dig.chouti.com/all/hot/recent/84>
<200 https://dig.chouti.com/all/hot/recent/83>
<200 https://dig.chouti.com/all/hot/recent/85>
<200 https://dig.chouti.com/all/hot/recent/86>
<200 https://dig.chouti.com/all/hot/recent/87>
<200 https://dig.chouti.com/all/hot/recent/88>
<200 https://dig.chouti.com/all/hot/recent/89>
<200 https://dig.chouti.com/all/hot/recent/90>
<200 https://dig.chouti.com/all/hot/recent/91>
<200 https://dig.chouti.com/all/hot/recent/92>
<200 https://dig.chouti.com/all/hot/recent/94>
<200 https://dig.chouti.com/all/hot/recent/93>
<200 https://dig.chouti.com/all/hot/recent/96>
<200 https://dig.chouti.com/all/hot/recent/95>
<200 https://dig.chouti.com/all/hot/recent/98>
<200 https://dig.chouti.com/all/hot/recent/97>
<200 https://dig.chouti.com/all/hot/recent/100>
<200 https://dig.chouti.com/all/hot/recent/99>
<200 https://dig.chouti.com/all/hot/recent/102>
<200 https://dig.chouti.com/all/hot/recent/101>
<200 https://dig.chouti.com/all/hot/recent/103>
<200 https://dig.chouti.com/all/hot/recent/104>
<200 https://dig.chouti.com/all/hot/recent/105>
<200 https://dig.chouti.com/all/hot/recent/106>
<200 https://dig.chouti.com/all/hot/recent/107>
<200 https://dig.chouti.com/all/hot/recent/108>
<200 https://dig.chouti.com/all/hot/recent/109>
<200 https://dig.chouti.com/all/hot/recent/110>
<200 https://dig.chouti.com/all/hot/recent/111>
<200 https://dig.chouti.com/all/hot/recent/112>
<200 https://dig.chouti.com/all/hot/recent/113>
<200 https://dig.chouti.com/all/hot/recent/114>
<200 https://dig.chouti.com/all/hot/recent/115>
<200 https://dig.chouti.com/all/hot/recent/116>
<200 https://dig.chouti.com/all/hot/recent/118>
<200 https://dig.chouti.com/all/hot/recent/117>
<200 https://dig.chouti.com/all/hot/recent/119>
<200 https://dig.chouti.com/all/hot/recent/120>
注意:
如果后续对爬取的页面数据进行处理,用xpath获取数据,yield到 管道再进行相应的存储操作