一、requests模块
requests模块是python中原生的基于网络请求的模块,其主要作用是用来模拟浏览器发起请求。功能强大,用法简洁高效。
1.1 模块介绍及请求过程
requests模块模拟浏览器发送请求
请求流程:指定url --> 发起请求 --> 获取响应对象中存储的数据 --> 持久化存储
1.2 爬取百度首页
#!/usr/bin/env python # -*- coding:utf-8-*- import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36' } url = 'https://www.baidu.com/' response = requests.get(url=url) response.encoding = 'utf-8' # 修改字符编码 page_text = response.text # 获取的类型为字符型<class 'str'> with open('./baidu.html', mode='w', encoding='utf-8') as f: f.write(page_text) # page_text = response.content # 返回二进制数据类型 <class 'bytes'> # response.status_code # 获取响应状态码 # response.headers['Content-Type'] == 'text/json' # 类型是 'text/json' 则可以使用response.json方法 # response.json # 如果响应头中存储了json数据,该方法可以返回json数据
1.3 爬取百度指定词条搜索后的页面数据
#!/usr/bin/env python # -*- coding:utf-8-*- import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36' } url = 'https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&' kw = input('请输入要搜索的内容:') param = {'wd': kw} response = requests.get(url=url, params=param, headers=headers) page_text = response.content fileName = kw+'.html' with open(fileName, 'wb') as fp: fp.write(page_text) print(fileName+'爬取成功。')
1.4 获取百度翻译的翻译结果使用post方法
页面使用的ajax的请求方式,通过浏览器抓包得到请求的地址和提交From表单的内容。
#!/usr/bin/env python # -*- coding:utf-8-*- import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36' } url = 'https://fanyi.baidu.com/sug' kw = input('请输入要翻译的内容:') data = { 'kw': kw } response = requests.post(url=url, data=data, headers=headers) dic = response.json() print(dic['data'])
-----------------------------------执行结果-------------------------------------- 请输入要翻译的内容:美女 [{'k': '美女', 'v': '[měi nǚ] beauty; belle; beautiful woman; femme fat'}, {'k': '美女与野兽', 'v': '名 Beauty and the Beast;'}, {'k': '美女蛇', 'v': 'merino;'}] --------------------------------------------------------------------------------
1.5 爬取豆瓣电影排名电影
#!/usr/bin/env python # -*- coding:utf-8-*- import requests headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' } url = 'https://movie.douban.com/j/chart/top_list' param = { 'type': '5', 'interval_id': '100:90', 'action': '', 'start': '0', 'limit': '20' } json_data = requests.get(url=url, headers=headers, params=param).json() print(json_data)