• Python爬虫Scrapy框架(3) -- 反爬虫


    爬取代理

    Python3中urllib详细使用方法(header,代理,超时,认证,异常处理),详见https://www.cnblogs.com/ifso/p/4707135.html

    验证代理

     1 import urllib.request
     2 import re
     3 import threading
     4 
     5 
     6 class TestProxy(object):
     7     def __init__(self):
     8         self.sFile = r'proxy.txt'
     9         self.dFile = r'alive.txt'
    10         self.URL = r'http://www.baidu.com/'
    11         self.threads = 10
    12         self.timeout = 3
    13         self.regex = re.compile(r'baidu.com')
    14         self.aliveList = []
    15 
    16         self.run()
    17 
    18     def run(self):
    19         with open(self.sFile, 'r') as fp:
    20             lines = fp.readlines()
    21             line = lines.pop()
    22             while lines:
    23                 for i in range(self.threads):
    24                     t = threading.Thread(target=self.linkWithProxy, args=(line,))
    25                     t.start()
    26                     if lines:
    27                         line = lines.pop()
    28                     else:
    29                         continue
    30 
    31         with open(self.dFile, 'w') as fp:
    32             for i in range(len(self.aliveList)):
    33                 fp.write(self.aliveList[i])
    34 
    35 
    36 
    37     def linkWithProxy(self, line):
    38                 lineList=line.split('	')
    39                 protocol=lineList[2].lower()
    40                 server=protocol+r'://'+lineList[0]+':'+lineList[1]
    41                 opener=urllib.request.build_opener(urllib.request.ProxyHandler({protocol:server}))
    42                 urllib.request.install_opener(opener)
    43                 try:
    44                         response=urllib.request.urlopen(self.URL,timeout=self.timeout)
    45                 except:
    46                         print('%s connect faild' %server)
    47                         return
    48                 else:
    49                         try:
    50                                 strli=response.read()51                         except:
    52                                 print('%s connect failed' %server)
    53                                 return
    54                         if self.regex.search(strli):
    55                                 print('%s connect success ..........' %server)
    56                                 self.aliveList.append(line)
    57 
    58 if __name__ == '__main__':
    59     TP = TestProxy()

    第50行报错,TypeError: cannot use a string pattern on a bytes-like object 

    改成 strli=response.read().decode('utf-8') 

    反爬虫

    1、robots协议:当爬虫访问一个站点时,它会检查该目录下是否存在robot.txt,如果存在,按照文件的内容确定访问范围

          解决方法:A. 伪装浏览器   B.设置setting文件, ROBOTSTXT_OBEY = False  

    2、IP流量异常:网站发现IP流程异常增多时就会封IP

         解决方法:A. 增大爬取间隔时间,设置随机时间  B.更改IP

     

  • 相关阅读:
    164 Maximum Gap 最大间距
    162 Find Peak Element 寻找峰值
    160 Intersection of Two Linked Lists 相交链表
    155 Min Stack 最小栈
    154 Find Minimum in Rotated Sorted Array II
    153 Find Minimum in Rotated Sorted Array 旋转数组的最小值
    152 Maximum Product Subarray 乘积最大子序列
    151 Reverse Words in a String 翻转字符串里的单词
    bzoj3994: [SDOI2015]约数个数和
    bzoj 4590: [Shoi2015]自动刷题机
  • 原文地址:https://www.cnblogs.com/Hyacinth-Yuan/p/8005730.html
Copyright © 2020-2023  润新知