Python网络爬虫（实践篇）

欢迎关注公众号：Python爬虫数据分析挖掘，回复【开源源码】免费获取更多开源项目源码

01 快速爬取网页

1.1 urlopen()函数

import urllib.requestfile=urllib.request.urlopen("http://www.baidu.com")data=file.read()fhandle=open("./1.html","wb")fhandle.write(data)fhandle.close()

读取内容常见的3种方式，其用法是：
file.read()读取文件的全部内容，并把读取到的内容赋给一个字符串变量
file.readlines()读取文件的全部内容，并把读取到的内容赋给一个列表变量
file.readline()读取文件的一行内容

1.2 urlretrieve()函数

urlretrieve()函数可以直接将对应信息写入本地文件。

import urllib.requestfilename=urllib.request.urlretrieve("http://edu.51cto.com",filename="./1.html")# urlretrieve()执行过程中，会产生一些缓存，可以使用urlcleanup()进行清除urllib.request.urlcleanup()

1.3 urllib中其他常见用法

import urllib.requestfile=urllib.request.urlopen("http://www.baidu.com")# 获取与当前环境有关的信息print(file.info()) # Bdpagetype: 1# Bdqid: 0xb36679e8000736c1# Cache-Control: private# Content-Type: text/html;charset=utf-8# Date: Sun, 24 May 2020 10:53:30 GMT# Expires: Sun, 24 May 2020 10:52:53 GMT# P3p: CP=" OTI DSP COR IVA OUR IND COM "# P3p: CP=" OTI DSP COR IVA OUR IND COM "# Server: BWS/1.1# Set-Cookie: BAIDUID=D5BBF02F4454CBA7D3962001F33E17C6:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com# Set-Cookie: BIDUPSID=D5BBF02F4454CBA7D3962001F33E17C6; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com# Set-Cookie: PSTM=1590317610; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com# Set-Cookie: BAIDUID=D5BBF02F4454CBA7FDDF8A87AF5416A6:FG=1; max-age=31536000; expires=Mon, 24-May-21 10:53:30 GMT; domain=.baidu.com; path=/; version=1; comment=bd# Set-Cookie: BDSVRTM=0; path=/# Set-Cookie: BD_HOME=1; path=/# Set-Cookie: H_PS_PSSID=31729_1436_21118_31592_31673_31464_31322_30824; path=/; domain=.baidu.com# Traceid: 1590317610038396263412927153817753433793# Vary: Accept-Encoding# Vary: Accept-Encoding# X-Ua-Compatible: IE=Edge,chrome=1# Connection: close# Transfer-Encoding: chunked# 获取当前爬取网页的状态码print(file.getcode())                     # 200# 获取当前爬取的URL地址print(file.geturl())                      # 'http://www.baidu.com'

一般来说，URL标准中只会允许一部分ASCII字符比如数字，字母，部分符号等，而其他一些字符，比如汉子等，是不符合URL标准的。这种情况，需要进行URL编码方可解决。

import urllib.requestprint(urllib.request.quote("http://www.baidu.com"))# http%3A//www.baidu.comprint(urllib.request.unquote("http%3A//www.baidu.com"))# http://www.baidu.com

02 浏览器的模拟——Header属性

一些网页为了防止别人恶意采集其信息，进行了一些反爬虫的设置，当我们爬取时，会出现403错误。
可以设置一些Headers信息，模拟称浏览器取访问这些网站。
可以使用俩种让爬虫模拟成浏览器访问的设置方法。

br

2.1使用build_opener()修改报头

import urllib.requesturl= "http://www.baidu.com"headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0")opener = urllib.request.build_opener()opener.addheaders = [headers]data=opener.open(url).read()fhandle=open("./2.html","wb")fhandle.write(data)fhandle.close()

2.2使用add_header()添加报头

import urllib.requesturl= "http://www.baidu.com"req=urllib.request.Request(url)req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0')data=urllib.request.urlopen(req).read()fhandle=open("./2.html","wb")fhandle.write(data)fhandle.close()

03 超时设置

当访问一个网页时，如果该网页长时间未响应，那么系统就会判断该网页超时，即无法打开该网页。

import urllib.request# timeout设置超时时间，单位秒file = urllib.request.urlopen("http://yum.iqianyue.com", timeout=1)data = file.read()

04 代理服务器

使用代理服务器去爬取某个网站的内容时，在对方网站显示的不是我们真实的IP地址，而是代理服务器的IP地址，这样，即使对方将显示的IP地址屏蔽了，也无关紧要，因为我们可以换另一个IP地址继续爬取。

def use_proxy(proxy_addr,url):    import urllib.request    proxy= urllib.request.ProxyHandler({'http':proxy_addr})    opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)    urllib.request.install_opener(opener)    data = urllib.request.urlopen(url).read().decode('utf-8')    return dataproxy_addr="xxx.xx.xxx.xx:xxxx"data=use_proxy(proxy_addr,"http://www.baidu.com")print(len(data))

使用urllib.request.install_opener()创建全局的opener对象，那么，在使用urlopen()时亦会使用我们安装的opener对象。

05 Cookie

仅使用HTTP协议的话，我们登录一个网站的时候，假如登陆成功了，但是当我们访问该网站的其他网页的时候，该登录状态就会消失，此时还需要登录一次，所以我们需要将对应的会话信息，比如登录成功等信息通过一些方式保存下来。
常用的方式有俩种：
1）通过Cookie保存会话信息
2）通过Session保存会话信息
但是，不管通过哪种方式进行会话控制，在大部分时候，都会用到Cookie。
进行Cookie处理的一种常用步骤如下：
1）导入Cookie处理模块http.cookiejar。
2）使用http.cookiejar.CookieJar()创建CookieJar对象。
3）使用HTTPCookieProcessor创建cookie处理器，并以其为参数构建opener对象。
4）创建全局默认的opener对象。

import urllib.requestimport urllib.parseimport http.cookiejarurl = "http://xx.xx.xx/1.html"postdata = urllib.parse.urlencode({    "username":"xxxxxx",    "password":"xxxxxx"}).encode("utf-8")req = urllib.request.Request(url,postdata)req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0')# 使用http.cookiejar.CookieJar()创建CookieJar对象cjar = http.cookiejar.CookieJar()# 使用HTTPCookieProcessor创建cookie处理器，并以其为参数构建opener对象opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cjar))# 创建全局默认的opener对象urllib.request.install_opener(opener)file = opener.open(req)data=file.read()fhandle=open("./4.html","wb")fhandle.write(data)fhandle.close()url1 = "http://xx.xx.xx/2.html"data1= urllib.request.urlopen(url1).read()fhandle1=open("./5.html","wb")fhandle1.write(data1)fhandle1.close()

06 DebugLog

边执行程序，边打印调试的Log日志。

import urllib.requesthttphd=urllib.request.HTTPHandler(debuglevel=1)httpshd=urllib.request.HTTPSHandler(debuglevel=1)opener=urllib.request.build_opener(httphd,httpshd)urllib.request.install_opener(opener)data=urllib.request.urlopen("http://www.baidu.com")

07异常处理——URLError

import urllib.requestimport urllib.errortry:    urllib.request.urlopen("http://blog.baidusss.net")except urllib.error.HTTPError as e:    print(e.code)    print(e.reason)except urllib.error.URLError as e:    print(e.reason)

或者

import urllib.requestimport urllib.errortry:    urllib.request.urlopen("http://blog.csdn.net")except urllib.error.URLError as e:    if hasattr(e,"code"):        print(e.code)    if hasattr(e,"reason"):        print(e.reason)

08 HTTP协议请求实战

HTTP协议请求主要分为6种类型，各类型的主要作用如下：
1）GET请求：GET请求会通过URL网址传递信息，可以直接在URL中写上要传递的信息，也可以由表单进行传递。
如果使用表单进行传递，这表单中的信息会自动转为URL地址中的数据，通过URL地址传递。
2）POST请求：可以向服务器提交数据，时一种比较主流也比较安全的数据传递方式。
3）PUT请求：请求服务器存储一个资源，通常要指定存储的位置。
4）DELETE请求：请求服务器删除一个资源。
5）HEAD请求：请求获取对应的HTTP报头信息。
6）OPTIONS请求：可以获得当前URL所支持的请求类型
除此之外，还有TRACE请求与CONNECT请求，TRACE请求主要用于测试或诊断。

8.1 GET请求实例

使用GET请求，步骤如下：
1）构建对应的URL地址，该URL地址包含GET请求的字段名和字段内容等信息。
GET请求格式：http://网址?字段1=字段内容&字段2=字段内容
2）以对应的URL为参数，构建Request对象。
3）通过urlopen()打开构建的Request对象。
4）按照需求进行后续处理操作。

import urllib.requesturl="http://www.baidu.com/s?wd="key="你好"key_code=urllib.request.quote(key)url_all=url+key_codereq=urllib.request.Request(url_all)data=urllib.request.urlopen(req).read()fh=open("./3.html","wb")fh.write(data)fh.close()

8.2 POST请求实例

使用POSt请求，步骤如下：
1）设置好URL网址。
2）构建表单数据，并使用urllib.parse.urlencode对数据进行编码处理。
3）创建Request对象，参数包括URL地址和要传递的数据。
4）使用add_header()添加头信息，模拟浏览器进行爬取。
5）使用urllib.request.urlopen()打开对应的Request对象，完成信息的传递。
6）后续处理。

import urllib.requestimport urllib.parseurl = "http://www.xxxx.com/post/"postdata =urllib.parse.urlencode({"name":"xxx@xxx.com","pass":"xxxxxxx"}).encode('utf-8') req = urllib.request.Request(url,postdata)req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0')data=urllib.request.urlopen(req).read()fhandle=open("D:/Python35/myweb/part4/6.html","wb")fhandle.write(data)fhandle.close()

耐得住寂寞，才能登得顶
Gitee码云：https://gitee.com/lyc96/projects

相关阅读:
使用Xtrabackup 备份mysql数据库
 Myeclipse总结
 intellij idea问题及技巧
 Tomcat相关配置
 Spark常用算子总结
 前端开发经验
 最近用到的SQL语句
 subline text使用心得
 天龙八部谁是主角？（MR词频统计）
elasticsearch CURL命令
原文地址：https://www.cnblogs.com/chenlove/p/14038631.html