BeautifulSoup库的使用

BeautifulSoup

网页解析库
提取网页数据
第三方库,使用的时候进行安装: pip3 install beautifulsoup4

解析库

解析器	使用方法
python标准库	BeautifulSoup(markup,"html.parser")
lxml HTML解析器	BeautifulSoup(markup,"lxml")
lxml XML解析器	BeautifulSoup(markup,"xml")
html5lib	BeautifulSoup(markup,"html5lib")

基本使用

html ="""
<html lang="zh-cn">
<head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="referrer" content="origin" />
    <title>liudemeng - 博客园</title>
    <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
<link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
<link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
    <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
    <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
<link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

标签选择器

选择元素

 html ="""
<html lang="zh-cn">
<head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="referrer" content="origin" />
    <title>liudemeng - 博客园</title>
    <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
<link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
<link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
    <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
    <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
<link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(soup.link)
print(soup.head)

获取名称

 html ="""
<html lang="zh-cn">
<head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="referrer" content="origin" />
    <title>liudemeng - 博客园</title>
    <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
<link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
<link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
    <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
    <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
<link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title.name)  # title

获取内容

html ="""
<html lang="zh-cn">
<head>
    <meta charset="utf-8" />
    <meta  content="width=device-width, initial-scale=1" />
<meta name="referrer" content="origin" />
    <title>liudemeng - 博客园</title>
    <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
<link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
<link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
    <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
    <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
<link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.link.attrs['rel'])

获取内容

html ="""
<html lang="zh-cn">
<head>
    <meta charset="utf-8" />
    <meta  content="width=device-width, initial-scale=1" />
<meta name="referrer" content="origin" />
    <title>liudemeng - 博客园</title>
    <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
<link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
<link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
    <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
    <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
<link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title.string)

嵌套选择

html ="""
<html lang="zh-cn">
<head>
    <meta charset="utf-8" />
    <meta  content="width=device-width, initial-scale=1" />
<meta name="referrer" content="origin" />
    <title>liudemeng - 博客园</title>
    <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
<link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
<link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
    <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
    <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
<link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title.string) # 嵌套选择
# liudemeng - 博客园

子节点和子孙节点

html = """<div class="postCon"><div class="c_b_p_desc">摘要: 数据提取方法 json 数据交换格式,看起来像python中的(字典)的字符串 使用之前进行导包处理 import json 哪里会有json的数据 浏览器切换到手机版 抓包app json.loads 把json字符串转化为python类型 json.loads(json字符串) json.dum<a href="https://www.cnblogs.com/liudemeng/p/10715075.html" class="c_b_p_desc_readmore">阅读全文</a></div></div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.div.contents)

html = """<div class="postCon"><div class="c_b_p_desc">摘要: 数据提取方法 json 数据交换格式,看起来像python中的(字典)的字符串 使用之前进行导包处理 import json 哪里会有json的数据 浏览器切换到手机版 抓包app json.loads 把json字符串转化为python类型 json.loads(json字符串) json.dum<a href="https://www.cnblogs.com/liudemeng/p/10715075.html" class="c_b_p_desc_readmore">阅读全文</a></div></div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.div.children)
for i, child in enumerate(soup.div.children):
    print(i, child)

父节点和祖先节点


html ="""
<html lang="zh-cn">     
<head>
    <meta charset="utf-8" />
    <meta  content="width=device-width, initial-scale=1" />
<meta name="referrer" content="origin" />
    <title>liudemeng - 博客园</title>
    <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
<link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
<link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
    <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
    <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
<link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.meta.parent)

兄弟节点

html ="""
<html lang="zh-cn">
<head>
    <meta charset="utf-8" />
    <meta  content="width=device-width, initial-scale=1" />
<meta name="referrer" content="origin" />
    <title>liudemeng - 博客园</title>
    <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
<link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
<link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
    <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
    <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
<link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.meta.next_siblings)
print(soup.meta.previous_siblings)

标准选择器

find_all(name, attrs, recursive,text,**kwargs)

可根据标签名.属性.内容查找文档

name

html ="""
<html lang="zh-cn">
<head>
    <meta charset="utf-8" />
    <meta  content="width=device-width, initial-scale=1" />
<meta name="referrer" content="origin" />
    <title>liudemeng - 博客园</title>
    <link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
<link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
<link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
    <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
    <link title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
<link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('link'))
print(soup.find_all('link')[0])

attrs

html ="""
<html lang="zh-cn">
<head>
    <meta charset="utf-8" />
    <meta  content="width=device-width, initial-scale=1" />
<meta name="referrer" content="origin" />
    <title>liudemeng - 博客园</title>
    <link class='element' type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
<link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
<link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
    <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
    <link id="list-1" title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
<link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id':"list-1"}))
print(soup.find_all(class_='element'))
# 以上两种写法都是正确的

text

html ="""
<html lang="zh-cn">
<head>
    <meta charset="utf-8" />
    <meta  content="width=device-width, initial-scale=1" />
<meta name="referrer" content="origin" />
    <title>liudemeng - 博客园</title>
    <link class='element' type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=svlZeZMvc3He7PqOD4T7SOuQn0_kIfLBYiy3idqd35Y1"/>
<link id="MainCss" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden.css?v=R6EW1cwbYc7SqZ5y0CMKPNjYaFnIdEGDIwRo4NL-lHw1"/>
<link id="mobile-style" media="only screen and (max- 767px)" type="text/css" rel="stylesheet" href="/skins/summerGarden/bundle-summerGarden-mobile.css?v=0rD5dztz_pRczoCFf4jzWyb4-oTf_yCZ8ttZxIagC2s1"/>
    <link title="RSS" type="application/rss+xml" rel="alternate" href="https://www.cnblogs.com/liudemeng/rss"/>
    <link id="list-1" title="RSD" type="application/rsd+xml" rel="EditURI" href="https://www.cnblogs.com/liudemeng/rsd.xml"/>
<link type="application/wlwmanifest+xml" rel="wlwmanifest" href="https://www.cnblogs.com/liudemeng/wlwmanifest.xml"/>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text="liudemeng - 博客园"))
# ['liudemeng - 博客园']

find(name, attrs, recursive,text,**kwargs)

find返回单个元素,find_all返回所有的元素

find_parents() 和 find_parent()

find_parents() 返回所有的父节点, find_parent()返回父节点

总结:

建议使用lxml解析库, 必要时使用html.parser
标签选择器筛选功能弱但是速度快
建议使用find(), find_all() 查询匹配单个结果或者多个结果
记住常用的获取属性以及文本值的方法

相关阅读:
2020软件工程作业04
2020软件工程作业02
2020软件工程作业01
2020软件工程个人作业06——软件工程实践总结作业
 【软件工程小组-冲刺日志(第二天)】
软件工程逃课小组【团队名称-凡事预则立】
2020软件工程作业——团队02
2020软件工程作业05
2020软件工程作业00——问题清单
 2020软件工程作业04
原文地址：https://www.cnblogs.com/liudemeng/p/10724281.html