python爬虫request模块详解

requests模块

使用requests可以模拟浏览器的请求，比起之前用到的urllib，requests模块的api更加便捷（本质就是封装了urllib3）

注意：requests库发送请求将网页内容下载下来以后，并不会执行js代码，这需要我们自己分析目标站点然后发起新的request请求

官方文档：http://cn.python-requests.org/zh_CN/latest/

安装：pip3 install requests

requests模块的各种请求方式

源码构成如下

# 以上方法均是在此方法的基础上构建

requests.request(method, url, **kwargs)

其中最常用的请求方式就是post和get请求，泵智商，post和get就是封装了request请求的请求方式

>>> r = requests.get('https://api.github.com/events')
相当于requests,request(method='get', 'https://api.github.com/events')
>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})
相当于requests,request(method='post', 'https://api.github.com/events', data = {'key':'value'})

requests,request方法详解

request（）源码

def request(method, url, **kwargs):
    """Constructs and sends a :class:`Request <Request>`.

    :param method: method for the new :class:`Request` object.
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param data: (optional) Dictionary or list of tuples ``[(key, value)]`` (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
    :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
    :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
    :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
        ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
        or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
        defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
        to add for the file.
    :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
    :param timeout: (optional) How many seconds to wait for the server to send data
        before giving up, as a float, or a :ref:`(connect timeout, read
        timeout) <timeouts>` tuple.
    :type timeout: float or tuple
    :param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``.
    :type allow_redirects: bool
    :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
    :param verify: (optional) Either a boolean, in which case it controls whether we verify
            the server's TLS certificate, or a string, in which case it must be a path
            to a CA bundle to use. Defaults to ``True``.
    :param stream: (optional) if ``False``, the response content will be immediately downloaded.
    :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response

    Usage::

      >>> import requests
      >>> req = requests.request('GET', 'http://httpbin.org/get')
      <Response [200]>
    """

    # By using the 'with' statement we are sure the session is closed, thus we
    # avoid leaving sockets open which can trigger a ResourceWarning in some
    # cases, and look like a memory leak in others.
    with sessions.Session() as session:
        return session.request(method=method, url=url, **kwargs)

　　下面对源码中的各个属性进行分析

method和url

指名请求方式和请求路径

requests.request(method='get', url='http://127.0.0.1:8000/test/')
requests.request(method='post', url='http://127.0.0.1:8000/test/')

params

requests模块发送请求有data、json、params三种携带参数的方法。

params在get请求中使用，data、json在post请求中使用。

params可以接收的参数：

- 可以是字典
- 可以是字符串
字典字符串都会被自动编码发送到url
- 可以是字节（必须是ascii编码以内）

接收字典字符串都会被自动编码发送到url，如下

import requests
wd='egon老师'
pn=1

response=requests.get('https://www.baidu.com/s',
                      params={
                          'wd':wd,
                          'pn':pn
                      },
                      headers={
                        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
                      })
print(response.url)
# 输出为：https://www.baidu.com/s?wd=egon%E8%80%81%E5%B8%88&pn=1
# 可见url已被自动编码

　　上面代码相当于如下代码，params编码转换本质上是用urlencode

import requests
from urllib.parse import urlencode
wd='egon老师'
encode_res=urlencode({'k':wd},encoding='utf-8')
keyword=encode_res.split('=')[1]
print(keyword)
# 然后拼接成url
url='https://www.baidu.com/s?wd=%s&pn=1' %keyword

response=requests.get(url,
                      headers={
                        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
                      })
print(response.url)
# 输出为：https://www.baidu.com/s?wd=egon%E8%80%81%E5%B8%88&pn=1

　　还有一点注意的就是接收字节数据时，不能传非ASCII码外的字符，如下就是错误的

import requests

# re = requests.request(method='get',
#                  url='http://127.0.0.1:8000/test/',
#                  params=bytes("k1=v1&k2=水电费&k3=v3&k3=vv3", encoding='utf8'))

data

requests模块发送请求有data、json、params三种携带参数的方法。params在get请求中使用，data、json在post请求中使用。

data可以接收的参数为：字典，字符串，字节，文件对象，data和json两者的区别在于data的请求体为name=alex&age=18格式而json请求体为‘{'k1': 'v1', 'k2': '水电费'}’（字符串）

requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 data={'k1': 'v1', 'k2': '水电费'})

requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 data="k1=v1; k2=v2; k3=v3; k3=v4"
                 )

requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 data="k1=v1;k2=v2;k3=v3;k3=v4",
                 headers={'Content-Type': 'application/x-www-form-urlencoded'}
                 )

requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 data=open('data_file.py', mode='r', encoding='utf-8'),  # 文件内容是：k1=v1;k2=v2;k3=v3;k3=v4
                 headers={'Content-Type': 'application/x-www-form-urlencoded'}
                 )

json

将json中对应的数据进行序列化成一个字符串，json.dumps(...)

然后发送到服务器端的body中，并且Content-Type是 {'Content-Type': 'application/json'}

标志：payload

requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 json={'k1': 'v1', 'k2': '水电费'})

headers

发送请求头到服务器

requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 json={'k1': 'v1', 'k2': '水电费'},
                 headers={'Content-Type': 'application/x-www-form-urlencoded'}
                 )

cookies

# 发送Cookie到服务器端
requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 data={'k1': 'v1', 'k2': 'v2'},
                 cookies={'cook1': 'value1'},
                 )
# 也可以使用CookieJar（字典形式就是在此基础上封装）
from http.cookiejar import CookieJar
from http.cookiejar import Cookie

obj = CookieJar()
obj.set_cookie(Cookie(version=0, name='c1', value='v1', port=None, domain='', path='/', secure=False, expires=None,
                      discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False,
                      port_specified=False, domain_specified=False, domain_initial_dot=False, path_specified=False)
               )
requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 data={'k1': 'v1', 'k2': 'v2'},
                 cookies=obj)

files

发送文件
file_dict = {
    'f1': open('readme', 'rb')
}
requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 files=file_dict)

发送文件，定制文件名
file_dict = {
    'f1': ('test.txt', open('readme', 'rb'))
}
requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 files=file_dict)

发送文件，定制文件名
file_dict = {
    'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf")
}
requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 files=file_dict)

发送文件，定制文件名
file_dict = {
    'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf", 'application/text', {'k1': '0'})
}
requests.request(method='POST',
                 url='http://127.0.0.1:8000/test/',
                 files=file_dict)

auth认证

解决浏览器的自带认证问题

认证设置:登陆网站是,弹出一个框,要求你输入用户名密码（与alter很类似），此时是无法获取html的，但本质原理是拼接成请求头发送

r.headers['Authorization'] = _basic_auth_str(self.username, self.password)

一般的网站都不用默认的加密方式，都是自己写，那么我们就需要按照网站的加密方式，自己写一个类似于_basic_auth_str的方法
得到加密字符串后添加到请求头：r.headers['Authorization'] =func('.....')

HTTPBasicAuth实际是向浏览器发一个带有Authorization:.................的请求

HTTPBasicAuth
from requests.auth import HTTPBasicAuth, HTTPDigestAuth

ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
print(ret.text)

　　auth别的使用方式

# ret = requests.get('http://192.168.1.1',
# auth=HTTPBasicAuth('admin', 'admin'))
# ret.encoding = 'gbk'
# print(ret.text)

# ret = requests.get('http://httpbin.org/digest-auth/auth/user/pass', auth=HTTPDigestAuth('user', 'pass'))
# print(ret)

timeout

两种超时:float or tuple
timeout=0.1 #代表接收数据的超时时间
timeout=(0.1,0.2)#0.1代表链接超时 0.2代表接收数据的超时时间

import requests
respone=requests.get('https://www.baidu.com',
                     timeout=0.0001)

redirects

ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)
print(ret.text)

proxies

代理设置

# 根据协议来确定发送请求时候的ip地址
proxies = {
    "http": "61.172.249.96:80",
    "https": "http://61.185.219.126:3128",
}

# 根据接收请求的地址来确定用什么地址发送

proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'}

ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies)
print(ret.headers)

from requests.auth import HTTPProxyAuth

proxyDict = {
    'http': '77.75.105.165',
    'https': '77.75.105.165'
}
auth = HTTPProxyAuth('username', 'mypassword')

r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth)
print(r.text)

#支持socks代理,安装:pip install requests[socks]
import requests
proxies = {
    'http': 'socks5://user:pass@host:port',
    'https': 'socks5://user:pass@host:port'
}
respone=requests.get('https://www.12306.cn',
                     proxies=proxies)

print(respone.status_code)

stream

ret = requests.get('http://127.0.0.1:8000/test/', stream=True)
print(ret.content)
ret.close()

# from contextlib import closing
# with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
# # 在此处理响应。
# for i in r.iter_content():
# print(i)

session

import requests

session = requests.Session()

### 1、首先登陆任何页面，获取cookie

i1 = session.get(url="http://dig.chouti.com/help/service")

### 2、用户登陆，携带上一次的cookie，后台对cookie中的 gpsd 进行授权
i2 = session.post(
    url="http://dig.chouti.com/login",
    data={
        'phone': "8615131255089",
        'password': "xxxxxx",
        'oneMonth': ""
    }
)

i3 = session.post(
    url="http://dig.chouti.com/link/vote?linksId=8589623",
)
print(i3.text)

编码问题

import requests
response=requests.get('http://www.autohome.com/news')
# response.encoding='gbk' #汽车之家网站返回的页面内容为gb2312编码的，而requests的默认编码为ISO-8859-1，如果不设置成gbk则中文乱码
print(response.text)

相关阅读:
镇中7日做题小结 day2
镇中7日做题小结 day1
关于开通luogu博客
 bitset用法和ch2101可达性统计
 离散化下标与数值的深入理解
 黄题 P2038 无线网络发射器选址被坑之痛
 最蒟蒻bug，没有之一
 http://www.laomaotao.net/?H4068
C++设计模式——简单工厂模式
 面向对象的七个基本设计原则
原文地址：https://www.cnblogs.com/wlx97e6/p/9950678.html