Python 三种网页抓取方法

Python 三种网页抓取方法
摘要：本文讲的是利用Python实现网页数据抓取的三种方法；分别为正则表达式（re）、BeautifulSoup模块和lxml模块。本文所有代码均是在python3.5中运行的。

本文抓取的是[中央气象台](http://www.nmc.cn/)首页头条信息：

其HTML层次结构为：

抓取其中href、title和标签的内容。

一、正则表达式
copy outerHTML：

<a target="_blank" href="/publish/country/warning/megatemperature.html" title="中央气象台7月13日18时继续发布高温橙色预警">高温预警</a>

代码：
```
# coding=utf-8
import  re, urllib.request

url = 'http://www.nmc.cn'
html = urllib.request.urlopen(url).read()
html = html.decode('utf-8')     #python3版本中需要加入
links = re.findall('<a target="_blank" href="(.+?)" title',html)
titles = re.findall('<a target="_blank" .+? title="(.+?)">',html)
tags = re.findall('<a target="_blank" .+? title=.+?>(.+?)</a>',html)
for link,title,tag in zip(links,titles,tags):
    print(tag,url+link,title)
```
正则表达式符号’.’表示匹配任何字符串（除\n之外）；‘+’表示匹配0次或者多次前面出现的正则表达式；‘？’表示匹配0次或者1次前面出现的正则表达式。更多内容可以参考Python中的正则表达式教程
输出结果如下：

高温预警 http://www.nmc.cn/publish/country/warning/megatemperature.html 中央气象台7月13日18时继续发布高温橙色预警
山洪灾害气象预警 http://www.nmc.cn/publish/mountainflood.html 水利部和中国气象局7月13日18时联合发布山洪灾害气象预警
强对流天气预警 http://www.nmc.cn/publish/country/warning/strong_convection.html 中央气象台7月13日18时继续发布强对流天气蓝色预警
地质灾害气象风险预警 http://www.nmc.cn/publish/geohazard.html 国土资源部与中国气象局7月13日18时联合发布地质灾害气象风险预警

二、BeautifulSoup 模块
Beautiful Soup是一个非常流行的Python模块。该模块可以解析网页，并提供定位内容的便捷接口。
copy selector：

#alarmtip > ul > li.waring > a:nth-child(1)

因为这里我们抓取的是多个数据，不单单是第一条，所以需要改成：

#alarmtip > ul > li.waring > a

代码：
```
from bs4 import BeautifulSoup
import urllib.request

url = 'http://www.nmc.cn'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,'lxml')
content = soup.select('#alarmtip > ul > li.waring > a')

for n in content:
    link = n.get('href')
    title = n.get('title')
    tag = n.text
    print(tag, url + link, title)
```
输出结果同上。

三、lxml 模块
Lxml是基于libxml2这一XML解析库的Python封装。该模块使用C语言编写，解析速度比Beautiful Soup更快，不过安装过程也更为复杂。
代码：
```
import urllib.request,lxml.html

url = 'http://www.nmc.cn'
html = urllib.request.urlopen(url).read()
tree = lxml.html.fromstring(html)
content = tree.cssselect('li.waring > a')

for n in content:
    link = n.get('href')
    title = n.get('title')
    tag = n.text
    print(tag, url + link, title)
```
输出结果同上。

四、将抓取的数据存储到列表或者字典中
以BeautifulSoup 模块为例：
```
from bs4 import BeautifulSoup
import urllib.request

url = 'http://www.nmc.cn'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,'lxml')
content = soup.select('#alarmtip > ul > li.waring > a')

######### 添加到列表中
link = []
title = []
tag = []
for n in content:
    link.append(url+n.get('href'))
    title.append(n.get('title'))
    tag.append(n.text)

######## 添加到字典中
for n in content:
    data = {
        'tag'   : n.text,
        'link'  : url+n.get('href'),
        'title' : n.get('title')
    }
```
五、总结
表2.1总结了每种抓取方法的优缺点。
相关阅读:
Python2使用telnetlib库telnet登陆服务器
 Python paramiko使用密钥ssh登陆服务器
 paramiko AttributeError: 'NoneType' object has no attribute 'time'
Python使用pexpect实现telnet登陆服务器
 Python paramiko SSH远程登陆服务器
 Java开发学习(十八)AOP通知获取数据（参数、返回值、异常）
Java开发学习(二十五)使用PostMan完成不同类型参数传递
 Java开发学习(二十二)Spring事务属性、事务传播行为
 Java开发学习(十六)AOP切入点表达式及五种通知类型解析
 Java开发学习(二十四)SpringMVC设置请求映射路径
原文地址：https://www.cnblogs.com/interdrp/p/15911839.html