python爬虫之requests库

requests库介绍

发送http请求的第三方库，兼容python2和python3

安装：

pip install requests

使用：

import requests
发送请求
response = requests.get(url)
response = requests.post(url)

响应内容
请求返回的值是一个response对象，是对http协议服务端返回数据的封装
response对象主要属性和方法:

response.status_code	返回码
response.headers	返回的头信息，字典类型
response.content	响应的原始数据字节类型，图片、音频、视频一般用这种
response.encoding	text数据转码格式，先设置encoding，然后再取出text，就解决了乱码问题
response.text	响应的网页源代码，数据经过转码的字符串
response.cookies	服务器返回的cookies
response.json()	当结果为json格式数据时，把它转成字典

response = requests.get('http://www.baidu.com')
print(response.status_code) #200
print(response.headers) # 服务器返回的头信息
print(response.content) # 原始数据，字节类型
print(response.content.decode()) # 网页源码 已转码
print(response.text) # 网页源码 因转码方式为iso-8859,中文乱码  当返回的头信息中的content-type 有charset属性时，
                     # 转码按照charset的值来，如果没有charset而有text类型，则按照iso-8859来
response.encoding = 'utf-8'
print(response.text) # 网页源码 把转码方式设为utf-8,解决中文乱码
print(response.cookies)

查询参数

get请求对url进行传参（url拼接）

import requests
payload = {'wd':'python'}
response = requests.get('http://www.baidu.com/s?',params=payload)
response.encoding = 'utf-8'
print(response.text)
print(response.url) #打印最终请求的url http://www.baidu.com/s?wd=python

post请求提交参数

import requests
data = {'user':'qqq'} #参数
response = requests.post('http://httpbin.org/post',data=data)
response.encoding = 'utf-8'
print(response.text)

超时设置

import requests
response = requests.get('https://www.google.com',timeout=5) #5秒后还没有应答，就会报错超时，后续可以进行异常处理

cookies处理

比如登录页面之后，把cookies保存起来，然后在后续请求中，把cookies传入

data = {
    'account_name': 'asda',
    'password':'qwe123'
}
result = requests.post('https://qiye.163.com/login/',data=data)
if result.status_code == 200:
    cookies = result.cookies
    response = requests.get('https://qiye.163.com/News',cookies=cookies)

这样，带上登录后的cookies的请求，就可以正常地访问登录后数据了

session

为了维持客户端和服务端的通信状态

session= requests.session()
session.get()  #session对象的api和requests基本一样 并且用session请求，会自动保存cookies，并且下次请求会自己带上，方便

SSL证书认证

无证书访问

import requests
 
response = requests.get('https://www.12306.cn')
# 在请求https时，request会进行证书的验证，如果验证失败则会抛出异常
print(response.status_code)

关闭证书验证

import requests
 
# 关闭验证，但是仍然会报出证书警告
response = requests.get('https://www.12306.cn',verify=False)
print(response.status_code)

消除关闭证书验证的警告

from requests.packages import urllib3
import requests
 
# 关闭警告
urllib3.disable_warnings()
response = requests.get('https://www.12306.cn',verify=False)
print(response.status_code)

手动设置证书

import requests
 
# 设置本地证书
response = requests.get('https://www.12306.cn', cert=('/path/server.crt','/path/key'))
print(response.status_code)

携带headers头信息

headers = {
'User=Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
}
r = requests.get('https://www.zhihu.com',headers=headers,verify=False) #添加头信息发送请求,不添加会被知乎拒绝访问

关闭重定向：allow_redirects=False

headers = {
'User=Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
}
r = requests.get('https://www.zhihu.com',headers=headers,verify=False,allow_redirects=False) #关闭重定向

设置代理

普通代理

proxies = {'http':'183.232.188.18:80','https':'183.232.188.18:80'}
r = requests.get(url='www.baidu.com',proxies=proxies) #使用代理进行请求

有密码的代理

import requests

proxies = {
"http":"http://user:password@127.0.0.1:9743/",
}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code）

SOCKS 代理

Requests 自 2.10.0 版起，开始支持 SOCKS 协议的代理，如果要使用，我们还需安装一个第三方库：

pip install requests[socks]

SOCKS 代理的使用和 HTTP 代理类似：

import requests

proxies = {
  "http": "socks5://user:pass@host:port",
  "https": "socks5://user:pass@host:port",
}

requests.get("http://example.org", proxies=proxies)

转换json格式数据

r = requests.get('http://httpbin.org/ip')
print(r.json()) #当返回的数据是json格式时，可以直接通过json()方法把json格式的数据转成字典

文件上传

import requests
 
files = {'file':open('favicon.ico','rb')}
# 往POST请求头中设置文件(files)
response = requests.post('http://httpbin.org/post',files=files)
print(response.text)

上传多个分块编码的文件

你可以在一个请求中发送多个文件。例如，假设你要上传多个图像文件到一个 HTML 表单，使用一个多文件 field 叫做 "images":

<input type="file" name="images" multiple="true" required="true"/>

要实现，只要把文件设到一个元组的列表中，其中元组结构为 (form_field_name, file_info):

>>> url = 'http://httpbin.org/post'
>>> multiple_files = [
        ('images', ('foo.png', open('foo.png', 'rb'), 'image/png')),
        ('images', ('bar.png', open('bar.png', 'rb'), 'image/png'))]
>>> r = requests.post(url, files=multiple_files)
>>> r.text
{
  ...
  'files': {'images': 'data:image/png;base64,iVBORw ....'}
  'Content-Type': 'multipart/form-data; boundary=3131623adb2043caaeb5538cc7aa0b3a',
  ...
}

认证设置

有时请求某个网站，但是那个网站会弹出账户密码的框，输入账号密码才能访问，

import requests
from requests.auth import HTTPBasicAuth

r = requests.get('http://120.27.34.24:9001', auth=HTTPBasicAuth('user','123'))
# r = requests.get('http://120.27.34.24:9001', auth=('user', '123'))
print(r.status_code)

下载大文件

当使用requests的get下载大文件/数据时，建议使用使用stream模式。

当把get函数的stream参数设置成False时，它会立即开始下载文件并放到内存中，如果文件过大，有可能导致内存不足。

当把get函数的stream参数设置成True时，它不会立即开始下载，当你使用iter_content或iter_lines遍历内容或访问内容属性时才开始下载。需要注意一点：文件没有下载之前，它也需要保持连接。

iter_content：一块一块的遍历要下载的内容
iter_lines：一行一行的遍历要下载的内容

使用上面两个函数下载大文件可以防止占用过多的内存，因为每次只下载小部分数据。

示例代码：

r = requests.get(url_file, stream=True)
f = open("file_path", "wb")
for chunk in r.iter_content(chunk_size=512):
    if chunk:
        f.write(chunk)

实例:用requests模拟github登录

'''
思路：github登录需要携带首页的cookies，并且设置头信息中的UA，而且post表单中有一个token参数需要请求首页才能得到
'''
import re
import requests
import urllib3
urllib3.disable_warnings()  # 取消警告


def get_params():
    start_url = 'https://github.com/login' # 从login页面获取cookies和token参数
    response = requests.get(start_url,verify=False)  # 关闭ssl验证
    cookies = response.cookies
    # print(response.text)
    token = re.findall(r'<input type="hidden" name="authenticity_token" value="(.*?)" /> ',response.text)[0] # 正则取出token
    return cookies,token


def login():
    post_url = 'https://github.com/session' #真正登录提交数据的页面
    cookies,token = get_params()
    # headers里面注意要有referer，表明是从该链接过来的，防盗链
    headers = {
        'Host': 'github.com',
        'Referer': 'https://github.com/login',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
        'Accept-Encoding': 'gzip, deflate, br',
    }
    # data是通过抓包获取的，是登录时提交的表单参数
    data = {
        'commit': 'Sign in',
        'utf8': '✓',
        'authenticity_token': token,
        'login': 'xxxxxx',
        'password': 'xxxxxxxx',
    }
    r = requests.post(url=post_url,data=data,headers=headers,cookies=cookies,verify=False)
    print(r.text)


if __name__ == '__main__':
    login()

最后在输出的文本中搜索一下 Start a project(我们在浏览器进入github，首页里有这个)

搜索到说明登录成功了！

相关阅读:
MySQL-基本sql命令
 Java for LeetCode 203 Remove Linked List Elements
Java for LeetCode 202 Happy Number
Java for LeetCode 201 Bitwise AND of Numbers Range
Java for LeetCode 200 Number of Islands
Java for LeetCode 199 Binary Tree Right Side View
Java for LeetCode 198 House Robber
Java for LeetCode 191 Number of 1 Bits
Java for LeetCode 190 Reverse Bits
Java for LeetCode 189 Rotate Array
原文地址：https://www.cnblogs.com/woaixuexi9999/p/9251292.html