• 爬虫利用keep-alive实现“减员增效”


    背景

    爬虫单位时间内请求数多,对己方机器、对方服务器都会形成压力,如果每个请求都开启一个新连接,更是如此;如果服务器支持keep-alive,爬虫就可以通过多个请求共用一个连接实现“减员增效”:单位时间内新建、关闭的连接的数目少了,但可实现的有效请求多了,并且也能有效降低给目标服务器造成的压力。

    keep-alive的好处:(HTTP persistent connection

    • Lower CPU and memory usage (because fewer connections are open simultaneously).
    • Enables HTTP pipelining of requests and responses.
    • Reduced network congestion (fewer TCP connections).
    • Reduced latency in subsequent requests (no handshaking).
    • Errors can be reported without the penalty of closing the TCP connection.

    实现

    各http client对http protocol实现的程度不一,有些是不支持keep-alive的,就python来说:

    下面是requests的实现代码(来自于本人项目代码,做了缩减、抽象)。

    import sys
    import time
    import requests
    
    def getSession():
    	s = requests.Session()
    	s.mount('http://', requests.adapters.HTTPAdapter(pool_connections=1, pool_maxsize=1, max_retries=0, pool_block=False))
    	return s
    
    def main():
    	# start time of the current session
    	st = time.time()
    	# init the first session
    	s = getSession()
    	# init the keep-alive timeout value
    	kato = 5
    	# loop work
    	for task in tasks:
    		# use time of the current session
    		ut = time.time() - st
    		# rebuild the session according to the use time
    		if ut >= kato:
    			s = getSession()
    			# clear the start time of the current session
    			st = time.time()
    		url = "https://www.example.com/%s" % task
    		# to bypass the antiSpider solutions
    		headers = {'user-agent': "a new ua", "Cookie": "a new cookie id"}
    		# get response
    		try:
    			r = s.get(url, headers = headers, allow_redirects = False)
    			# need some robust fix
    			kato = int(r.headers["Keep-Alive"].replace("timeout=")) - 3
    		except Exception,e:
    			tasks.insert(0, task)
    			print str(e)
    			continue
    		# handle the response according to the status_code, etc.
    		if r.status_code == 404:
    			pass
    		elif r.status_code == 301:
    			pass
    		elif r.status_code == 200:
    			info = "your info"
    			pass
    		elif r.status_code == 403:
    			tasks.insert(0, task)
    			print "as triggered, will sleep for 5 minutes"
    			time.sleep(300)
    			s = getSession()
    			continue
    		else:
    			print sid, r.status_code
    			sys.exit()
    		sn += 1
    		print "%s, %s, %s" % (sn, sid, info)
    
    
    本文原创发表于http://www.cnblogs.com/qijj,转载请保留此声明。
  • 相关阅读:
    Mysql 密码过期
    【Linux】 设置Linux的系统时间
    【Ubantu】ubantu 安装 .net core 3.0 的环境
    【Ubantu】Ubantu的常用命令
    【SQL】一个简单的查询所有层级(直到顶层)的sql语句
    【C++】C++中的异常解析
    【C++】C++中的类模板
    【Eclipse】Eclipse如何导出java项目为jar包
    【Kali】kali linux的安装和配置
    【Python】解析Python中的线程与进程
  • 原文地址:https://www.cnblogs.com/qijj/p/6344118.html
Copyright © 2020-2023  润新知