本文使用的是BeautifulSoup 3,现在已经有BeautifulSoup4了,名字改为bs4
(1)下载与安装
1
2
|
# BeautifulSoup 的下载与安装 pip install BeautifulSoup |
另外也可以下载安装包进行安装
(2)快速开始
1
2
3
4
|
# BeautifulSoup 快速开始 soup = BeautifulSoup(html_doc) print soup.title |
结果:
1
2
|
# BeautifulSoup 结果 <title>前门大街_百度百科< / title> |
(3)BeautifulSoup对象介绍
BeautifulSoup中主要包含三种类型的对象:
- BeautifulSoup.BeautifulSoup
- BeautifulSoup.Tag
- BeautifulSoup.NavigableString
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
# BeautifulSoup 示例 from BeautifulSoup import BeautifulSoup import urllib2 soup = BeautifulSoup(html_doc) print type (soup) print type (soup.title) print type (soup.title.string) print soup.title print soup.title.string |
结果为
1
2
3
4
5
6
7
8
|
# BeautifulSoup 示例结果 < class 'BeautifulSoup.BeautifulSoup' > < class 'BeautifulSoup.Tag' > < class 'BeautifulSoup.NavigableString' > <title>百度一下,你就知道< / title> 百度一下,你就知道 print soup.title print soup.title.string |
从上面的例子可以比较清晰的看到BeautifulSoup主要包括三种类型的对象。
- BeautifulSoup.BeautifulSoup //BeautifulSoup对象
- BeautifulSoup.Tag //标签对象
- BeautifulSoup.NavigableString //导航string文本对象
(4)BeautifulSoup剖析树
1. BeautifulSoup.Tag对象方法
获取标记对象,通过点号获取Tag对象
1
2
3
4
5
6
7
8
9
10
|
# BeautifulSoup 示例 title = soup.title print type (title.contents) print title.contents print title.contents[ 0 ] # BeautifulSoup 示例结果 < type 'list' > [u 'u767eu5ea6u4e00u4e0buff0cu4f60u5c31u77e5u9053' ] 百度一下,你就知道 |
contents方法
获得当前标签的内容list,如果该标签没有子标签,那么string方法和contents[0]得到的内容是一样的。见上面示例
next,parent方法
获得当前的标签的子标签和父标签
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
# BeautifulSoup 示例 html = soup.html print html. next print '' print html. next . next print html. next . next .nextSibling # BeautifulSoup 示例结果 <head><meta http - equiv = "content-type" content = "text/html;charset=utf-8" / ><meta http - equiv = "X-UA-Compatible" content = "IE=Edge" / ><meta content = "always" name = "referrer" / ><meta name = "theme-color" content = "#2932e1" / ><link rel = "shortcut icon" href = "/favicon.ico" type = "image/x-icon" / ><link rel = "icon" sizes = "any" mask = "mask" href = "//www.baidu.com/img/baidu.svg" / ><link rel = "dns-prefetch" href = "//s1.bdstatic.com" / ><link rel = "dns-prefetch" href = "//t1.baidu.com" / ><link rel = "dns-prefetch" href = "//t2.baidu.com" / ><link rel = "dns-prefetch" href = "//t3.baidu.com" / ><link rel = "dns-prefetch" href = "//t10.baidu.com" / ><link rel = "dns-prefetch" href = "//t11.baidu.com" / ><link rel = "dns-prefetch" href = "//t12.baidu.com" / ><link rel = "dns-prefetch" href = "//b1.bdstatic.com" / ><title>百度一下,你就知道< / title> ...... < / head> <meta http - equiv = "content-type" content = "text/html;charset=utf-8" / > <meta http - equiv = "X-UA-Compatible" content = "IE=Edge" / > |
nextSibling,previousSibling
获得当前标签的下一个兄弟标签和前一个兄弟标签