Python3 【解析库pyquery】

一.介绍pyquery

上面介绍了Beautiful Soup的使用方法,会发现CSS选择器并没有那么强大,接下来

学习pyquery弥补了CSS选择器的

安装:

pip install pyquery

二.基本使用

html ='''
<!DOCTYPE html>
<html>
<head>
    <title>故事</title>
</head>
<body>
   <p class="title" name="dromouse"><b>这个是dromouse</b></p>
   <p class="story">Once upon a time there were three little sister;
       and their names were
       <a href="http://www.baidu.com" class="sister" id="link1"><!--GH--></a>
       <a href="http://www.baidu.com/oracle" class="sister" id="link2">Local</a>and
       <a href="http://www.baidu.com/title" class="sister" id="link3">Tillie</a>;
   and they lived at the bottom of a well.</p>
   <p class="story">...</p>

</body>
</html>

'''

from pyquery import PyQuery as pq

#将对象pq传入参数进行初识化
doc = pq(html)

#输出所有p标签内容
print(doc('p'))

可以直接传入URL

from pyquery import PyQuery as pq

#将对象pq传入参数进行初识化
doc = pq(url="https://www.baidu.com")

print(doc('title'))

请求文件也可以

import requests

from pyquery import PyQuery as pq

url = "https://www.baidu.com"
#将对象pq传入参数进行初识化
doc = pq(requests.get(url).text)

print(doc('title'))

文件初始化

from pyquery import PyQuery as pq

#将对象pq传入参数进行初识化
doc = pq(filename="demo.html")

print(doc('title'))

三.基本CSS选择器

1.基本使用

from pyquery import PyQuery as pq

doc = pq(url="https://www.baidu.com")

div = doc('.card .lazyload ')

print(div)

2.查找节点

介绍一下查询函数,这些函数和jQuery中函数的用法完全相同

1.子节点

find()方法查找的是所有的子孙节点

from pyquery import PyQuery as pq

doc = pq(url="https://www.baidu.com")

div = doc('.card')

#使用find查找标签
img = div.find('img')

print(img)

只想查找子节点，那么可以用children()方法

from pyquery import PyQuery as pq

doc = pq(url="https://www.baidu.com")

div = doc('.card')

#使用children查找直接子节点
img = div.children('a')

print(img)

2.父节点

我们可以用parent()方法来获取某个节点的父节点(直接父节点)

from pyquery import PyQuery as pq

doc = pq(url="")

#首先定位子节点
items = doc('.fa')

#子节点的直接父节点
contains = items.parent()

print(contains)

如果是祖父节点，也就是父节点的父节点使用parents()

from pyquery import PyQuery as pq

doc = pq(url="https://.com")

#首先定位子节点
items = doc('.fa')

#子节点的祖父节点
contains = items.parents()

print(contains)

3.兄弟节点

如果使用兄弟节点那么使用siblings()方法

from pyquery import PyQuery as pq

doc = pq(url="")

#首先定位元素
items = doc('.card-text')

#兄弟节点，同级
contains = items.siblings()

print(contains)

4.遍历

pyquery选择的可能是多个节点，也可能是单个节点，对于多个节点的结果，我们就需要使用遍历了

from pyquery import PyQuery as pq

doc = pq(url="")

#多个节点
list = doc('.card').items()

print(type(list))

#遍历输出每一个
for div in list:

    print(div)

5.获取信息

节点我们说完了,那么我们就要获取节点中的信息了,获取文本或者获取属性

获取属性attr()

from pyquery import PyQuery as pq

doc = pq(url="")

#多个节点，加items()表示全部
img = doc('.card .lazyload').items()

for i in img:  
    #获取属性
    img_href = i.attr('data-src')
    #img_href = i.attr.data-src
    print(img_href)

获取文本使用text()和html()

from pyquery import PyQuery as pq

doc = pq(url="")

#多个节点，加items()表示全部
infos = doc('.btn').items()

for i in infos:  
    #获取文本
    info = i.text()
    #带有html的文本形式
    info_html = i.html()
    print(info,info_html)

上面的attr(属性名,属性值),text("修改文本")和html("<a></a>")可以直接修改参数

6.节点操作

pyquery提供了一系列方法来对节点进行动态修改，比如为某个节点添加一个class，移除某个节点等。这些操作有时候会为提取信息带来极大的便利

addClass()和removeClass()这些方法动态改变节点的class属性

from pyquery import PyQuery as pq

doc = pq(url="")

#查询
img = doc('.card a img')

print(img)
#去除class为lazyload
img.removeClass('lazyload')

print(img)

7.移除remove()

remove()方法就是移除元素

from pyquery import PyQuery as pq
html = '''

<div class="wrap">
     Hello,world
     <p>This is a man</p>
    </div>
'''
doc = pq(html)

#获取hello world
wrap = doc('.wrap')

wrap.find('p').remove()

print(wrap.text())

一些常用的方法append(),empty()和prepend()等方法，它们和jQuery用法完全一致

官方文档:http://pyquery.readthedocs.io/en/latest/api.html

8.伪类选择器

from pyquery import PyQuery as pq
html = '''

<!DOCTYPE html>
<html>
<head>
    <title>故事</title>
</head>
<body>
   <p class="title" name="dromouse"><b>这个是dromouse</b></p>
   <p class="story">Once upon a time there were three little sister;
       and their names were
       <a href="http://www.baidu.com" class="sister" id="link1"><!--GH--></a>
       <a href="http://www.baidu.com/oracle" class="sister" id="link2">Local</a>and
       <a href="http://www.baidu.com/title" class="sister" id="link3">Tillie</a>;
   and they lived at the bottom of a well.</p>
   <p class="story">...</p>
   <ul>
       <li>1</li>
       <li>2</li>
       <li>3</li>
       <li>4</li>
   </ul>
    
</body>
</html>

'''
doc = pq(html)

li_f = doc("li:first-child")
li_l = doc("li:last-child")
li_n = doc("li:nth-child(2)")
li_n = doc("li:nth-child(2n)")
li_text = doc("li:contains(were)")
li = doc("li:gt(2)")

学习网站:

相关阅读:
你了解幻读吗？
http面试准备
 DNS域名解析过程，域名的认识
 ajax实现用户登陆，退出,java做后端
 serialize()与serializeArray()
jquery中Json操作
 jquery字符串操作
 理解jQuery的$.extend与$.fn.extend
HTML之间互相传参
 解决topjui中工具栏按钮删除刷新从属表
原文地址：https://www.cnblogs.com/Crown-V/p/12733771.html