• 黑板客爬虫闯关之关卡一


                                                                  黑板客爬虫闯关之关卡一

    分析:从起始界面获得下一个界面的地址信息然后开始跳转,然后又在另外界面获得下一个界面的地址信息,直到通关

    闯关地址:http://www.heibanke.com/lesson/crawler_ex00/

     注意二者的区别

     1 import re
     2 import datetime
     3 import requests
     4 def Go1(url,i):
     5     headers = {'authorization':'Client-ID c94869b36aa272dd62dfaeefed769d4115fb3189a9d1ec88ed457207747be626'}
     6     html =requests.get(url=url,headers=headers)
     7     text = html.text
     8     number = re.findall(r'数字([0-9]{5})',text)#匹配
     9     url = url +number[0]
    10     print(url+'     '+str(i))
    11     return url
    12 
    13 def Go2(url,i):
    14     headers = {'authorization':'Client-ID c94869b36aa272dd62dfaeefed769d4115fb3189a9d1ec88ed457207747be626'}
    15     html =requests.get(url=url,headers=headers)
    16     text = html.text
    17     number = re.findall(r'数字是([0-9]{5})',text)#注意这是调整界面跟起始界面的区别,网页源码中多了一个'是'字
    18     url = 'http://www.heibanke.com/lesson/crawler_ex00/' + number[0]
    19     print(url+'     '+str(i))
    20     return url
    21 
    22 def main():
    23     i=1
    24     url = 'http://www.heibanke.com/lesson/crawler_ex00/'
    25     begin_time=datetime.datetime.now()
    26     url = Go1(url,i)
    27     while True:
    28         i=i+1
    29         try:
    30             url = Go2(url,i)
    31         except:
    32             print('最后的界面地址是:'+url)
    33             print('耗时为:'+str(datetime.datetime.now()-begin_time))
    34             break;
    35 main()
    36 
    37 """
    38 结果:
    39 http://www.heibanke.com/lesson/crawler_ex00/65392     1
    40 http://www.heibanke.com/lesson/crawler_ex00/36133     2
    41 http://www.heibanke.com/lesson/crawler_ex00/72324     3
    42 http://www.heibanke.com/lesson/crawler_ex00/57633     4
    43 http://www.heibanke.com/lesson/crawler_ex00/91251     5
    44 http://www.heibanke.com/lesson/crawler_ex00/87016     6
    45 http://www.heibanke.com/lesson/crawler_ex00/77055     7
    46 http://www.heibanke.com/lesson/crawler_ex00/30366     8
    47 http://www.heibanke.com/lesson/crawler_ex00/83679     9
    48 http://www.heibanke.com/lesson/crawler_ex00/31388     10
    49 http://www.heibanke.com/lesson/crawler_ex00/99446     11
    50 http://www.heibanke.com/lesson/crawler_ex00/69428     12
    51 http://www.heibanke.com/lesson/crawler_ex00/34798     13
    52 http://www.heibanke.com/lesson/crawler_ex00/16780     14
    53 http://www.heibanke.com/lesson/crawler_ex00/36499     15
    54 http://www.heibanke.com/lesson/crawler_ex00/21070     16
    55 http://www.heibanke.com/lesson/crawler_ex00/96749     17
    56 http://www.heibanke.com/lesson/crawler_ex00/71822     18
    57 http://www.heibanke.com/lesson/crawler_ex00/48739     19
    58 http://www.heibanke.com/lesson/crawler_ex00/62816     20
    59 http://www.heibanke.com/lesson/crawler_ex00/80182     21
    60 http://www.heibanke.com/lesson/crawler_ex00/68171     22
    61 http://www.heibanke.com/lesson/crawler_ex00/45458     23
    62 http://www.heibanke.com/lesson/crawler_ex00/56056     24
    63 http://www.heibanke.com/lesson/crawler_ex00/87450     25
    64 http://www.heibanke.com/lesson/crawler_ex00/52695     26
    65 http://www.heibanke.com/lesson/crawler_ex00/36675     27
    66 http://www.heibanke.com/lesson/crawler_ex00/25997     28
    67 http://www.heibanke.com/lesson/crawler_ex00/73222     29
    68 http://www.heibanke.com/lesson/crawler_ex00/93891     30
    69 http://www.heibanke.com/lesson/crawler_ex00/29052     31
    70 http://www.heibanke.com/lesson/crawler_ex00/72996     32
    71 http://www.heibanke.com/lesson/crawler_ex00/73999     33
    72 http://www.heibanke.com/lesson/crawler_ex00/23814     34
    73 http://www.heibanke.com/lesson/crawler_ex00/98084     35
    74 http://www.heibanke.com/lesson/crawler_ex00/51103     36
    75 http://www.heibanke.com/lesson/crawler_ex00/39603     37
    76 http://www.heibanke.com/lesson/crawler_ex00/34316     38
    77 http://www.heibanke.com/lesson/crawler_ex00/55719     39
    78 http://www.heibanke.com/lesson/crawler_ex00/53685     40
    79 http://www.heibanke.com/lesson/crawler_ex00/77771     41
    80 http://www.heibanke.com/lesson/crawler_ex00/69187     42
    81 http://www.heibanke.com/lesson/crawler_ex00/89677     43
    82 http://www.heibanke.com/lesson/crawler_ex00/71935     44
    83 http://www.heibanke.com/lesson/crawler_ex00/98538     45
    84 http://www.heibanke.com/lesson/crawler_ex00/79152     46
    85 http://www.heibanke.com/lesson/crawler_ex00/70999     47
    86 http://www.heibanke.com/lesson/crawler_ex00/35102     48
    87 http://www.heibanke.com/lesson/crawler_ex00/75956     49
    88 http://www.heibanke.com/lesson/crawler_ex00/19122     50
    89 最后的界面地址是:http://www.heibanke.com/lesson/crawler_ex00/19122
    90 耗时为:0:01:40.219459
    91 """
  • 相关阅读:
    kettle部分传输场景应用(每个作业都实验过啦)
    Java设计模式之《适配器模式》及应用场景
    Mysql笔记
    Spring知识点
    Java基础系列-浅拷贝和深拷贝
    前端-javascript知识点
    前端-jquery知识点
    Java基础系列-substring的原理
    Java设计模式之《抽象工厂模式》及使用场景
    Java基础系列-Enum深入解析
  • 原文地址:https://www.cnblogs.com/yinbiao/p/8145547.html
Copyright © 2020-2023  润新知