python 网络爬虫requests模块

一、requests模块

requests模块是python中原生的基于网络请求的模块，其主要作用是用来模拟浏览器发起请求。功能强大，用法简洁高效。

1.1 模块介绍及请求过程

requests模块模拟浏览器发送请求

请求流程：指定url --> 发起请求 --> 获取响应对象中存储的数据 --> 持久化存储

1.2 爬取百度首页

#!/usr/bin/env python
# -*- coding:utf-8-*-

import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}
url = 'https://www.baidu.com/'

response = requests.get(url=url)
response.encoding = 'utf-8'                         # 修改字符编码
page_text = response.text                           # 获取的类型为字符型<class 'str'>

with open('./baidu.html', mode='w', encoding='utf-8') as f:
    f.write(page_text)

# page_text = response.content                       # 返回二进制数据类型 <class 'bytes'>
# response.status_code                               # 获取响应状态码
# response.headers['Content-Type'] == 'text/json'    # 类型是 'text/json' 则可以使用response.json方法
# response.json                                      # 如果响应头中存储了json数据，该方法可以返回json数据

1.3 爬取百度指定词条搜索后的页面数据

#!/usr/bin/env python
# -*- coding:utf-8-*-
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}
url = 'https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&'
kw = input('请输入要搜索的内容：')
param = {'wd': kw}
response = requests.get(url=url, params=param, headers=headers)

page_text = response.content
fileName = kw+'.html'
with open(fileName, 'wb') as fp:
    fp.write(page_text)
    print(fileName+'爬取成功。')

1.4 获取百度翻译的翻译结果使用post方法

页面使用的ajax的请求方式，通过浏览器抓包得到请求的地址和提交From表单的内容。

#!/usr/bin/env python
# -*- coding:utf-8-*-

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}
url = 'https://fanyi.baidu.com/sug'

kw = input('请输入要翻译的内容：')
data = {
    'kw': kw
}
response = requests.post(url=url, data=data, headers=headers)
dic = response.json()
print(dic['data'])

-----------------------------------执行结果--------------------------------------
请输入要翻译的内容：美女
[{'k': '美女', 'v': '[měi nǚ] beauty; belle; beautiful woman; femme fat'}, {'k': '美女与野兽', 'v': '名 Beauty and the Beast;'}, {'k': '美女蛇', 'v': 'merino;'}]
--------------------------------------------------------------------------------

1.5 爬取豆瓣电影排名电影

#!/usr/bin/env python
# -*- coding:utf-8-*-

import requests
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}

url = 'https://movie.douban.com/j/chart/top_list'

param = {
    'type': '5',
    'interval_id': '100:90',
    'action': '',
    'start': '0',
    'limit': '20'
}

json_data = requests.get(url=url, headers=headers, params=param).json()

print(json_data)

相关阅读:
参考文献bib管理
 linux开启防火墙指定端口
 Linux rabbitmq 新增用户和角色
 JAVA导出Excel并弹出下载框
 Base64 文件图片加密解密【java】
Minio-JAVA使用
 Linux下Minio搭建
 ORACLE跨越时间点的恢复
 重做日志损坏之后的处理
 转：关于PLSQL Developer报"动态执行表不可访问,本会话的自动统计被禁止"错的解决方法
原文地址：https://www.cnblogs.com/cyleon/p/10577791.html