黑板客爬虫闯关之关卡一
分析:从起始界面获得下一个界面的地址信息然后开始跳转,然后又在另外界面获得下一个界面的地址信息,直到通关
闯关地址:http://www.heibanke.com/lesson/crawler_ex00/
注意二者的区别
1 import re 2 import datetime 3 import requests 4 def Go1(url,i): 5 headers = {'authorization':'Client-ID c94869b36aa272dd62dfaeefed769d4115fb3189a9d1ec88ed457207747be626'} 6 html =requests.get(url=url,headers=headers) 7 text = html.text 8 number = re.findall(r'数字([0-9]{5})',text)#匹配 9 url = url +number[0] 10 print(url+' '+str(i)) 11 return url 12 13 def Go2(url,i): 14 headers = {'authorization':'Client-ID c94869b36aa272dd62dfaeefed769d4115fb3189a9d1ec88ed457207747be626'} 15 html =requests.get(url=url,headers=headers) 16 text = html.text 17 number = re.findall(r'数字是([0-9]{5})',text)#注意这是调整界面跟起始界面的区别,网页源码中多了一个'是'字 18 url = 'http://www.heibanke.com/lesson/crawler_ex00/' + number[0] 19 print(url+' '+str(i)) 20 return url 21 22 def main(): 23 i=1 24 url = 'http://www.heibanke.com/lesson/crawler_ex00/' 25 begin_time=datetime.datetime.now() 26 url = Go1(url,i) 27 while True: 28 i=i+1 29 try: 30 url = Go2(url,i) 31 except: 32 print('最后的界面地址是:'+url) 33 print('耗时为:'+str(datetime.datetime.now()-begin_time)) 34 break; 35 main() 36 37 """ 38 结果: 39 http://www.heibanke.com/lesson/crawler_ex00/65392 1 40 http://www.heibanke.com/lesson/crawler_ex00/36133 2 41 http://www.heibanke.com/lesson/crawler_ex00/72324 3 42 http://www.heibanke.com/lesson/crawler_ex00/57633 4 43 http://www.heibanke.com/lesson/crawler_ex00/91251 5 44 http://www.heibanke.com/lesson/crawler_ex00/87016 6 45 http://www.heibanke.com/lesson/crawler_ex00/77055 7 46 http://www.heibanke.com/lesson/crawler_ex00/30366 8 47 http://www.heibanke.com/lesson/crawler_ex00/83679 9 48 http://www.heibanke.com/lesson/crawler_ex00/31388 10 49 http://www.heibanke.com/lesson/crawler_ex00/99446 11 50 http://www.heibanke.com/lesson/crawler_ex00/69428 12 51 http://www.heibanke.com/lesson/crawler_ex00/34798 13 52 http://www.heibanke.com/lesson/crawler_ex00/16780 14 53 http://www.heibanke.com/lesson/crawler_ex00/36499 15 54 http://www.heibanke.com/lesson/crawler_ex00/21070 16 55 http://www.heibanke.com/lesson/crawler_ex00/96749 17 56 http://www.heibanke.com/lesson/crawler_ex00/71822 18 57 http://www.heibanke.com/lesson/crawler_ex00/48739 19 58 http://www.heibanke.com/lesson/crawler_ex00/62816 20 59 http://www.heibanke.com/lesson/crawler_ex00/80182 21 60 http://www.heibanke.com/lesson/crawler_ex00/68171 22 61 http://www.heibanke.com/lesson/crawler_ex00/45458 23 62 http://www.heibanke.com/lesson/crawler_ex00/56056 24 63 http://www.heibanke.com/lesson/crawler_ex00/87450 25 64 http://www.heibanke.com/lesson/crawler_ex00/52695 26 65 http://www.heibanke.com/lesson/crawler_ex00/36675 27 66 http://www.heibanke.com/lesson/crawler_ex00/25997 28 67 http://www.heibanke.com/lesson/crawler_ex00/73222 29 68 http://www.heibanke.com/lesson/crawler_ex00/93891 30 69 http://www.heibanke.com/lesson/crawler_ex00/29052 31 70 http://www.heibanke.com/lesson/crawler_ex00/72996 32 71 http://www.heibanke.com/lesson/crawler_ex00/73999 33 72 http://www.heibanke.com/lesson/crawler_ex00/23814 34 73 http://www.heibanke.com/lesson/crawler_ex00/98084 35 74 http://www.heibanke.com/lesson/crawler_ex00/51103 36 75 http://www.heibanke.com/lesson/crawler_ex00/39603 37 76 http://www.heibanke.com/lesson/crawler_ex00/34316 38 77 http://www.heibanke.com/lesson/crawler_ex00/55719 39 78 http://www.heibanke.com/lesson/crawler_ex00/53685 40 79 http://www.heibanke.com/lesson/crawler_ex00/77771 41 80 http://www.heibanke.com/lesson/crawler_ex00/69187 42 81 http://www.heibanke.com/lesson/crawler_ex00/89677 43 82 http://www.heibanke.com/lesson/crawler_ex00/71935 44 83 http://www.heibanke.com/lesson/crawler_ex00/98538 45 84 http://www.heibanke.com/lesson/crawler_ex00/79152 46 85 http://www.heibanke.com/lesson/crawler_ex00/70999 47 86 http://www.heibanke.com/lesson/crawler_ex00/35102 48 87 http://www.heibanke.com/lesson/crawler_ex00/75956 49 88 http://www.heibanke.com/lesson/crawler_ex00/19122 50 89 最后的界面地址是:http://www.heibanke.com/lesson/crawler_ex00/19122 90 耗时为:0:01:40.219459 91 """