xpath是一种过滤HTML页面寻找我们需要数据的方法,他的结果是一个列表
待过滤HTML页面:
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"/> <title>Xpath 测试</title> </head> <body> <div class="song"> 火药 <b>指南针</b> <b>造纸术</b> <b>印刷术</b> </div> <div class = "tang"> <ul> <li class = "balove">停车坐爱枫林晚,霜叶红于二月花</li> <li id = "hua">商女不知亡国恨,隔江犹唱后庭花</li> <li class = "love" name="yang">一骑红尘妃子笑,无人知是荔枝来</li> <li id = "bei">葡萄美酒夜光杯,欲饮琵琶马上催</li> <li><a href="http://www.baidi.com">百度一下</a></li> </ul> <ol> <li class="lucy">寻寻觅觅,冷冷清清,凄凄惨惨戚戚</li> <li class="balily">咋暖还寒时候,最难将息</li> <li class="lilei">三杯两盏淡酒</li> <li>怎敌他晚来风急</li> <li>雁过也,正伤心,却是旧时相识</li> <li>爱就一个字,我只说一次</li> <li>爱过,不后悔,保大</li> </ol> </div> </body> </html>
xpath实例演示:
#本地文件xpath.html中的内容查找 from lxml import etree #生成对象 tree = etree.parse('xpath.html') #print(tree) ret = tree.xpath('//div[@class="tang"]/ul/li[1]') #ret是个列表 print(ret[0].text) ret = tree.xpath('//div[@class="tang"]/ul/li[1]/text()') print(ret) #href属性百度一下 baidu = tree.xpath('//div[@class="tang"]/ul/li[5]/a/@href') print(baidu) #逻辑and luoji = tree.xpath('//div[@class="tang"]/ul/li[@class="love" and @name="yang"]/text()') print(luoji) #模糊contains mohu = tree.xpath('//li[contains(@class,"l")]') print(mohu) mohu1 = tree.xpath('//li[contains(text(),"爱")]/text()') print(mohu1) start = tree.xpath('//li[starts-with(@class,"ba")]/text()') print(start) #取文本text() text = tree.xpath('//div[@class="song"]') string = text[0].xpath('string(.)') print(string.replace(' ','').replace(" ", ""))#替换掉所有换行和制表符
过滤结果: