阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll

1..BeautifulSoup库的使用

BeautifulSoup通常用来分析爬虫抓取的Web文档。

其中findAll函数的使用情景：

链接：http://www.pythonscraping.com/pages/warandpeace.html 中内容如下：

文字部分有黑色，红色，和绿色的，其决定因素主要在于其中的：

“<span class=”red”>

“<span class=”green”>

实现功能：提取出这篇文章中的所有绿色文字。

代码如下：

# -*- coding: utf-8 -*-
"""
Spyder Editor

This is a temporary script file.
"""

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html,"lxml")
nameList = bsObj.findAll("span",{"class":"green"})
for name in nameList:
    print(name.get_text())

代码运行结果：

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna

结果分析：提取出了文中所有绿色文字的内容。

关于bsObj.findAll(tagName,tagAttributes)的调用

.findAll()最常用的参数为:tagName,tagAttributes

tagName指的是"h1","h2","h3"之类的标签

tagAttributes是一个字典类型的数据，指的是{"class":"green","class":"red"}之类的数据。

相关阅读:
npm registry
JS函数addEventListener的浏览器差异性封装
C# WinForm 异步执行耗时操作并将过程显示在界面中
在server 2008/2003中取消对网站的安全检查/去除添加信任网站
SQL语句中将Datetime类型转换为字符串类型
未在本地计算机上注册 Microsoft.Jet.OLEDB.4.0 提供程序
当应用程序不是以 UserInteractive 模式运行时显示模式对话框或窗体是无效操作
TFS2012常见问题及解答
笔记《Hbase 权威指南》
读Java 804

原文地址：https://www.cnblogs.com/chensimin1990/p/6600971.html