• python爬虫之User-Agent用户信息


    python爬虫之User-Agent用户信息

      爬虫是自动的爬取网站信息,实质上我们也只是一段代码,并不是真正的浏览器用户,加上User-Agent(用户代理,简称UA)信息,只是让我们伪装成一个浏览器用户去访问网站,然而一个用户频繁的访问一个网站很容易被察觉,既然我们可以伪装成浏览器,那么也同样可以通过UA信息来变换我们的身份。

      整理部分UA信息

    Opera
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60
    Opera/8.0 (Windows NT 5.1; U; en)
    Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50
    Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50

    Firefox
    Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0
    Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10

    Safari
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 

    chrome
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36
    Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11
    Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16

    360
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
    Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko

    淘宝浏览器
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11

    猎豹浏览器
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER
    Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER) 
    Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)" 

    QQ浏览器
    Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)
    Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E) 

    sogou浏览器
    Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0
    Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) 

    maxthon浏览器
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36

    UC浏览器
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36

    ------UA信息链接:https://blog.csdn.net/tao_627/article/details/42297443 ------

     User-Agent的添加方法

      UA的添加方法有三种:1.实例化Request类时添加;2.调用Request类的实例方法add_header()动态添加;3.创建opener,赋值opener.addheaders修改。

      方法一

     1 import urllib.request
     2 
     3 
     4 def load_message():
     5     url = 'https://www.baidu.com'
     6 
     7     header = {
     8         'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 
     9 Safari/537.36',
    10         'test': '123'
    11     }
    12 
    13     request = urllib.request.Request(url, headers=header)  # 添加请求头
    14 
    15     response = urllib.request.urlopen(request)
    16     response_str = response.read().decode('utf-8')
    17 
    18     request_header_get = request.get_header('User-agent')  # 坑:查看时必须首字母大写,其他字母均小写,否则返回None
    19     print(request_header_get)  # 获取请求头的指定内容方式
    20 
    21     # request_header_get = request.get_header('Test')
    22     # print(request_header_get)
    23     #
    24     # request_header_get = request.get_header('test')
    25     # print(request_header_get)
    26 
    27     return response.headers, request.headers, response_str
    28 
    29 
    30 response_header, request_header, response_data = load_message()
    31 print(response_header)
    32 print('------------------------------------')
    33 print(request_header)
    34 print('------------------------------------')
    35 print(response_data)
    View Code

      方法二

     1 import urllib.request
     2 
     3 
     4 def load_message():
     5     url = 'https://www.baidu.com'
     6 
     7     request = urllib.request.Request(url)
     8 
     9     request.add_header('user-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
    10 Chrome/69.0.3497.92Safari/537.36')  # 动态添加请求头信息
    11 
    12     response = urllib.request.urlopen(request)
    13     response_str = response.read().decode('utf-8')
    14 
    15     request_header_get = request.get_header('User-agent')  # 坑:查看时必须首字母大写,其他字母均小写,否则返回None
    16     print(request_header_get)  # 获取请求头的指定内容方式
    17 
    18     return response.headers, request.headers, response_str
    19 
    20 
    21 response_header, request_header, response_data = load_message()
    22 print(response_header)
    23 print('------------------------------------')
    24 print(request_header)
    25 print('------------------------------------')
    26 print(response_data)
    View Code

    方法三

    1 import urllib.request
    2 
    3 
    4 url= "http://blog.csdn.net/weiwei_pig/article/details/51178226"
    5 headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0")
    6 
    7 opener = urllib.request.build_opener()
    8 opener.addheaders = [headers]
    9 data=opener.open(url).read()
    View Code

    随机UA信息添加至请求头的案例

     1 #!/usr/bin/env python
     2 # -*- coding=utf-8 -*-
     3 # Author: Snow
     4 
     5 import urllib.request
     6 import random
     7 
     8 
     9 def random_agent():
    10     url = 'https://www.baidu.com'
    11 
    12     user_agent_list = [
    13         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 ',
    14         'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
    15         'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
    16         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
    17         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
    18         'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
    19         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrow
    20 ser/2.0 Safari/536.11'
    21     ]
    22 
    23     user_agent_value = random.choice(user_agent_list)
    24 
    25     request = urllib.request.Request(url)
    26     request.add_header('User-Agent', user_agent_value)
    27 
    28     request_user_agent = request.get_header('User-agent')
    29 
    30     response = urllib.request.urlopen(request)
    31     response_str = response.read().decode('utf-8')
    32 
    33     return request_user_agent, response_str
    34 
    35 
    36 cat_user_agent, _ = random_agent()
    37 print(cat_user_agent)

     

  • 相关阅读:
    python pytesseract模块,报错
    CSS清除浮动_清除float浮动
    CSS中@import与link的具体区别
    js substr和substring的区别
    WebStorm 自定义字体+颜色+语法高亮+导入导出用户设置
    Jquery中的重置
    table标签中thead、tbody、tfoot的作用
    下拉列表框的几个属性
    使用GDI绘制文本
    使用GDI绘制一条直线
  • 原文地址:https://www.cnblogs.com/snow-lanuage/p/10362287.html
Copyright © 2020-2023  润新知