python爬虫之User-Agent用户信息
爬虫是自动的爬取网站信息,实质上我们也只是一段代码,并不是真正的浏览器用户,加上User-Agent(用户代理,简称UA)信息,只是让我们伪装成一个浏览器用户去访问网站,然而一个用户频繁的访问一个网站很容易被察觉,既然我们可以伪装成浏览器,那么也同样可以通过UA信息来变换我们的身份。
整理部分UA信息
Opera
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60
Opera/8.0 (Windows NT 5.1; U; en)
Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50
Firefox
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0
Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10
Safari
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2
chrome
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16
360
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
淘宝浏览器
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11
猎豹浏览器
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)"
QQ浏览器
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)
sogou浏览器
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)
maxthon浏览器
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36
UC浏览器
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36
------UA信息链接:https://blog.csdn.net/tao_627/article/details/42297443 ------
User-Agent的添加方法
UA的添加方法有三种:1.实例化Request类时添加;2.调用Request类的实例方法add_header()动态添加;3.创建opener,赋值opener.addheaders修改。
方法一
1 import urllib.request 2 3 4 def load_message(): 5 url = 'https://www.baidu.com' 6 7 header = { 8 'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 9 Safari/537.36', 10 'test': '123' 11 } 12 13 request = urllib.request.Request(url, headers=header) # 添加请求头 14 15 response = urllib.request.urlopen(request) 16 response_str = response.read().decode('utf-8') 17 18 request_header_get = request.get_header('User-agent') # 坑:查看时必须首字母大写,其他字母均小写,否则返回None 19 print(request_header_get) # 获取请求头的指定内容方式 20 21 # request_header_get = request.get_header('Test') 22 # print(request_header_get) 23 # 24 # request_header_get = request.get_header('test') 25 # print(request_header_get) 26 27 return response.headers, request.headers, response_str 28 29 30 response_header, request_header, response_data = load_message() 31 print(response_header) 32 print('------------------------------------') 33 print(request_header) 34 print('------------------------------------') 35 print(response_data)
方法二
1 import urllib.request 2 3 4 def load_message(): 5 url = 'https://www.baidu.com' 6 7 request = urllib.request.Request(url) 8 9 request.add_header('user-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) 10 Chrome/69.0.3497.92Safari/537.36') # 动态添加请求头信息 11 12 response = urllib.request.urlopen(request) 13 response_str = response.read().decode('utf-8') 14 15 request_header_get = request.get_header('User-agent') # 坑:查看时必须首字母大写,其他字母均小写,否则返回None 16 print(request_header_get) # 获取请求头的指定内容方式 17 18 return response.headers, request.headers, response_str 19 20 21 response_header, request_header, response_data = load_message() 22 print(response_header) 23 print('------------------------------------') 24 print(request_header) 25 print('------------------------------------') 26 print(response_data)
方法三
1 import urllib.request 2 3 4 url= "http://blog.csdn.net/weiwei_pig/article/details/51178226" 5 headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0") 6 7 opener = urllib.request.build_opener() 8 opener.addheaders = [headers] 9 data=opener.open(url).read()
随机UA信息添加至请求头的案例
1 #!/usr/bin/env python 2 # -*- coding=utf-8 -*- 3 # Author: Snow 4 5 import urllib.request 6 import random 7 8 9 def random_agent(): 10 url = 'https://www.baidu.com' 11 12 user_agent_list = [ 13 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 ', 14 'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50', 15 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0', 16 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2', 17 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36', 18 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko', 19 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrow 20 ser/2.0 Safari/536.11' 21 ] 22 23 user_agent_value = random.choice(user_agent_list) 24 25 request = urllib.request.Request(url) 26 request.add_header('User-Agent', user_agent_value) 27 28 request_user_agent = request.get_header('User-agent') 29 30 response = urllib.request.urlopen(request) 31 response_str = response.read().decode('utf-8') 32 33 return request_user_agent, response_str 34 35 36 cat_user_agent, _ = random_agent() 37 print(cat_user_agent)