（转）python下很帅气的爬虫包

（转）python下很帅气的爬虫包
官方文档地址：http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

Beautiful Soup 相比其他的html解析有个非常重要的优势。html会被拆解为对象处理。全篇转化为字典和数组。

相比正则解析的爬虫，省略了学习正则的高成本。

相比xpath爬虫的解析，同样节约学习时间成本。虽然xpath已经简单点了。（爬虫框架Scrapy就是使用xpath）

安装

linux下可以执行
[plain] view plain copy
1. apt-get install python-bs4
也可以用python的安装包工具来安装
[html] view plain copy
1. easy_install beautifulsoup4
3. pip install beautifulsoup4
使用简介

下面说一下BeautifulSoup 的使用。

解析html需要提取数据。其实主要有几点

1：获取指定tag的内容。
[plain] view plain copy
1. hello, watsy hello, beautiful soup.
2：获取指定tag下的属性。
[html] view plain copy
1. <a href="http://blog.csdn.net/watsy">watsy's blog</a>
3：如何获取，就需要用到查找方法。

使用示例采用官方
[html] view plain copy
1. html_doc = """
2. <html><head><title>The Dormouse's story</title></head>
3. <body>
4. The Dormouse's story
6. Once upon a time there were three little sisters; and their names were
7. <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
8. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
9. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
10. and they lived at the bottom of a well.
12. ...
13. """
格式化输出。
[html] view plain copy
1. from bs4 import BeautifulSoup
2. soup = BeautifulSoup(html_doc)
4. print(soup.prettify())
5. # <html>
6. # <head>
7. # <title>
8. # The Dormouse's story
9. # </title>
10. # </head>
11. # <body>
12. # 
13. # 
14. # The Dormouse's story
15. # 
16. # 
17. # 
18. # Once upon a time there were three little sisters; and their names were
19. # <a class="sister" href="http://example.com/elsie" id="link1">
20. # Elsie
21. # </a>
22. # ,
23. # <a class="sister" href="http://example.com/lacie" id="link2">
24. # Lacie
25. # </a>
26. # and
27. # <a class="sister" href="http://example.com/tillie" id="link2">
28. # Tillie
29. # </a>
30. # ; and they lived at the bottom of a well.
31. # 
32. # 
33. # ...
34. # 
35. # </body>
36. # </html>
获取指定tag的内容
[html] view plain copy
1. soup.title
2. # <title>The Dormouse's story</title>
4. soup.title.name
5. # u'title'
7. soup.title.string
8. # u'The Dormouse's story'
10. soup.title.parent.name
11. # u'head'
13. soup.p
14. # The Dormouse's story
16. soup.a
17. # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
上面示例给出了4个方面

1：获取tag

soup.title

2：获取tag名称

soup.title.name

3：获取title tag的内容

soup.title.string

4：获取title的父节点tag的名称

soup.title.parent.name

怎么样，非常对象化的使用吧。

提取tag属性

下面要说一下如何提取href等属性。
[html] view plain copy
1. soup.p['class']
2. # u'title'
获取属性。方法是

soup.tag['属性名称']
[html] view plain copy
1. <a href="http://blog.csdn.net/watsy">watsy's blog</a>
常见的应该是如上的提取联接。

代码是
[html] view plain copy
1. soup.a['href']
相当easy吧。

查找与判断

接下来进入重要部分。全文搜索查找提取.

soup提供find与find_all用来查找。其中find在内部是调用了find_all来实现的。因此只说下find_all
[html] view plain copy
1. def find_all(self, name=None, attrs={}, recursive=True, text=None,
2. limit=None, **kwargs):
看参数。

第一个是tag的名称，第二个是属性。第3个选择递归，text是判断内容。limit是提取数量限制。**kwargs 就是字典传递了。。

举例使用。
[html] view plain copy
1. tag名称
2. soup.find_all('b')
3. # [The Dormouse's story]
5. 正则参数
6. import re
7. for tag in soup.find_all(re.compile("^b")):
8. print(tag.name)
9. # body
10. # b
12. for tag in soup.find_all(re.compile("t")):
13. print(tag.name)
14. # html
15. # title
17. 列表
18. soup.find_all(["a", "b"])
19. # [The Dormouse's story,
20. # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
21. # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
22. # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
24. 函数调用
25. def has_class_but_no_id(tag):
26. return tag.has_attr('class') and not tag.has_attr('id')
28. soup.find_all(has_class_but_no_id)
29. # [The Dormouse's story,
30. # Once upon a time there were...,
31. # ...]
33. tag的名称和属性查找
34. soup.find_all("p", "title")
35. # [The Dormouse's story]
37. tag过滤
38. soup.find_all("a")
39. # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
40. # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
41. # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
43. tag属性过滤
44. soup.find_all(id="link2")
45. # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
47. text正则过滤
48. import re
49. soup.find(text=re.compile("sisters"))
50. # u'Once upon a time there were three little sisters; and their names were '
获取内容和字符串

获取tag的字符串
[html] view plain copy

title_tag.string

# u'The Dormouse's story'
注意在实际使用中应该使用 unicode(title_tag.string)来转换为纯粹的string对象
使用strings属性会返回soup的构造1个迭代器，迭代tag对象下面的所有文本内容
[html] view plain copy

for string in soup.strings:

 print(repr(string))

# u"The Dormouse's story"

# u' '

# u"The Dormouse's story"

# u' '

# u'Once upon a time there were three little sisters; and their names were '

# u'Elsie'

# u', '

# u'Lacie'

# u' and '

# u'Tillie'

# u'; and they lived at the bottom of a well.'

# u' '

# u'...'

# u' '
获取内容

.contents会以列表形式返回tag下的节点。
[html] view plain copy

head_tag = soup.head

head_tag

# <head><title>The Dormouse's story</title></head>



head_tag.contents

[<title>The Dormouse's story</title>]



title_tag = head_tag.contents[0]

title_tag

# <title>The Dormouse's story</title>

title_tag.contents

# [u'The Dormouse's story']
想想，应该没有什么其他的了。。其他的也可以看文档学习使用。
总结

其实使用起主要是
[html] view plain copy

soup = BeatifulSoup(data)

soup.title

soup.p.['title']

divs = soup.find_all('div', content='tpc_content')

divs[0].contents[0].string
转自 http://blog.csdn.net/watsy/article/details/14161201
相关阅读:
【原创】大数据基础之Hadoop（3）yarn数据收集与监控
 【原创】运维基础之Docker（7）关于docker latest tag
【原创】大数据基础之ElasticSearch（4）es数据导入过程
 【原创】大叔经验分享（44）hdfs副本数量
 【转】IAR IDE for MSP430、8051、ARM等平台的结合使用
 写驱动的步骤
 【转】IAR for STM8介绍、下载、安装与注册
 KEIL中函数定义存在但go to definition却不跳转的原因
 FatFs
学习2__STM32--汉字显示
原文地址：https://www.cnblogs.com/wuxinqiu/p/3852901.html

（转）python下很帅气的爬虫包

安装

使用简介

获取指定tag的内容

提取tag属性

查找与判断

获取内容和字符串

总结