urllib模块

编辑本随笔

urllib模块

python中自带的一个基于爬虫的模块。

作用

可以使用代码模拟浏览器发起请求

子模块

request

parse

使用流程

指定url，url中不能存在非ASCII编码的字符数据
针对指定url发起请求
获取页面数据
持久化存储

尝试用urllib获取指定url代码：

#需求：获取指定url的页面数据
from urllib import request
#指定url地址
url="http://127.0.0.1:8888"

#对给定的url发起请求，且返回一个响应对象
response=request.urlopen(url=url)

#获取页面数据，即对响应对象执行read函数,返回二进制数据
page_test=response.read()

#进行持久化存储
with open('local.html','wb') as f:
    f.write(page_test)
    print("写入数据成功！")

url编码处理

#需求：爬取指定词条的数据
import urllib.request
import urllib.parse

#指定url
#url特性：url不可以存在非ASCII编码的字符数据
url="https://www.sogou.com/web?query="
world=urllib.parse.quote("人民币")
url+=world

#发起请求
response=rullib.request.urlopen(rul=rul)
#获取页面数据

page_text=response.read()
print(page_text)
with open('renminbi.html','wb') as fp:
    fp.write(page_text)

UA身份伪装

反爬机制：

检查请求UA，如果请求的UA为爬虫程序，拒绝提供网站数据

反反爬机制：

伪装爬虫程序的请求UA

User-Agent：请求载体的身份标识

import urllib.request
url="https://www.baidu.com/"

#UA请求头伪装
headers={
    #存储任意的请求头信息
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:64.0) Gecko/20100101 Firefox/64.0"
}
request=urllib.request.Request(url=url,headers=headers)

#发起请求
response=urllib.request.urlopen(request)

page_test=response.read()
print(page_test)

POST请求

#需求：获取百度翻译结果
#1、指定URL
url="https://fanyi.baidu.com/sug/"

#2、处理POST携带的参数
#2.1 将POST参数封装到字典中
data={
    "kw":"欢迎"
}
#2.2 使用parse模块中的urlencode进行编码处理,返回的是str类型
data=urllib.parse.urlencode(data)

#2.3 将步骤2的编码结果转换成byte类型
data=data.encode()

#3、发起post请求,data参数表示经过处理之后的post请求携带的参数
response=urllib.request.urlopen(url=url,data=data)

response.read()

urllib模块高级操作

1、代理操作

一些网站会有相应的反爬虫措施，例如检测某一段时间某个IP的访问次数，如果访问频率太大，可能会禁用这个IP的访问。所以我们需要设置代理IP，每隔一段时间换一个代理IP。

代理分类：

正向代理：代理客户端获取数据，正向代理是为了保护客户端防止被追究责任
反向代理：代理服务器提供数据，为了保护服务器或负责负载均衡

#需求：通过代理的方式爬取数据
from urllib import request,parse

#1、创建处理对象，在内部封装代理ip和端口
handler=urllib.request.ProxyHeadler(proxies={"http":"61.128.128.68:8888"})
#2、创建一个opener对象，然后使用该对象发起请求
opener=urllib.request.build_opener(handler)

url="http://www.baidu.com/s?ie=UTF-8&wd=ip"

headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:64.0) Gecko/20100101 Firefox/64.0"
}
#构建request对象
request=urllib.request.Request(url=url,headers=headers)
#通过自定义的opener发起open请求
response=opener.open(request)

with open('daili_get.html','wb') as f:
    f.write(response.read())

2、cookie操作

#需求：使用cookiejar实现人人网登陆

from urllib import request,parse
from http import cookiejar

#创建一个cj，用于自动存储cookie
cj=http.cookiejar.CookieJar()

#创建处理器对象，并携带上cookiejar对象
handler=request.HTTPCookieProcessor(cj)

#创建opener对象，并携带上cookiejar对象
opener=urllib.request.build_opener(handler)

url=""
data={}
data=parse.urlencode(data).encode()
request=request.Request(url,data=data)

#用opener发起请求,并自动保存cookie，下次再用opener访问其他需验证的url即可自动验证
response=opener.open(request)

相关阅读:
pip install uwsgi 报错 AttributeError: module 'os' has no attribute 'uname'
npm安装vue
Node.js安装及环境配置之Windows篇
 Centos7 安装nodejs
Centos7 Jenkins 插件下载速度慢、安装失败
 Centos7 使用docker 安装redis
Centos7 安装jdk
supervisor配置文件详解
 MySQL5.7 group by新特性，报错1055
配置python虚拟环境Virtualenv及pyenv
原文地址：https://www.cnblogs.com/yaya625202/p/10308505.html