• xpath 操作


     导包
     import requests
     from lxml import etree
    
     headers = {
         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'
     }
     res = requests.get(url='https://www.tuli.cc/index.html', headers=headers)
    
     实例化对象
     tree = etree.HTML(res.text)
     ret = tree.xpath('//*[@id="img-container"]/div[6]/div/div/div[1]/a/img')  # 结果为一个列表, 列表中的元素是字符串节点对象
     ret_src = ret[0].xpath('./@src')
     print(ret_src)
    
     1) /: 根节点
     2) //: 任意位置
     3) .: 当前节点
     4) nodename: 节点名定位, *代表任意节点名
     5) 节点属性定位: 节点名[@属性名="divtag "]    div[@id="divtag"]
     6) 获取节点的属性: @属性名
     7) 获取节点文本: text()
    
    单数属性多值匹配  & 多属性匹配
    1).单属性多值: <div class="divtag clear"></div><div class="divtag clear item"></div>
    contains函数: tree.xpath('//div[contains(@class, "divtag")]')
    2).多属性匹配: <div class="divtag" name="item"></div> <div class="divtag"></div>
    and关键字: tree.xpath('//div[@class="divtag" and name="item"]')
    
     按序选择:
     1).索引定位: 注意一下, xpath的索引从1开始
     2).last()函数: 定位最后一个
     3).position()函数: 位置函数, 确定节点的位置
    

    案例:

    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>Xpath练习文件</title>
    </head>
    <body>
    <div id="007">
        "我是div标签的文字内容, 和下面的p标签还有div标签是同级的哦"
        <p>这是p标签内的文字内容</p>
        <div>这是p标签同级的div标签</div>
    </div>
    
    <div class="divtag">
        <ul>
            <li>第1个li标签</li>
            <li>第2个li标签</li>
            <li>第3个li标签</li>
            <li>第4个li标签</li>
            <li>第5个li标签</li>
        </ul>
        <a href="https://www.baidu.com">这是百度的跳转连接</a>
    </div>
    
    
    <div class="c1" name="laoda">老大在此</div>
    <div class="c1 c3" name="laoer">老二任性, class有两个值</div>
    <div class="c1" name="laosan">我是老三</div>
    
    <div id="112">
        <div>
            <h1>lsdjfihlfdjofjwjbcl</h1>
            <span>就返回使得开发</span>
        </div>
        <div>
            <h1>lsdjfihlfdjofjwjbcl</h1>
            <span>你好</span>
        </div>
        <div>
            <h1>lsdjfihlfdjofjwjbcl</h1>
            <span>我好</span>
        </div>
        <div>
            <h1>lsdjfihlfdjofjwjbcl</h1>
            <span>他好</span>
        </div>
        <div>
            <h1>lsdjfihlfdjofjwjbcl</h1>
            <span>都挺好</span>
        </div>
    </div>
    
    </body>
    </html>
    from lxml import etree
    
    tree = etree.parse('./xpath_test.html', etree.HTMLParser())
    # //为任意位置, 节点名定位
    ret1 = tree.xpath('//title/text()')
    
    # 根据节点属性定位
    ret2 = tree.xpath('//div[@id="007"]/text()')
    # print(ret2)
    ret3 = tree.xpath('//div[@id="007"]//text()')
    # print(ret3)
    
    # 获取属性
    ret4 = tree.xpath('//div[@class="divtag"]/a/@href')
    # print(ret4)
    
    # .: ./ .//
    # skdfjlksdjf-都挺好
    div_list = tree.xpath('//div[@id="112"]/div')
    for div in div_list:
        first = div.xpath('.//h1/text()')[0]
        second = div.xpath('./span/text()')[0]
        # print('%s-----%s' % (first, second))
    
    # 单属性多值
    ret5 = tree.xpath('//div[contains(@class, "c1")]/text()')
    print(ret5)
    
    # 多属性匹配
    ret6 = tree.xpath('//div[@class="c1" and @name="laosan"]/text()')
    print(ret6)
    
    ret7 = tree.xpath('//div[@class="divtag"]/ul/li[last()-1]/text()')
    # print(ret7)
    
    ret8 = tree.xpath('//div[@class="divtag"]/ul/li[position()>2 and position()<5]/text()')
    print(ret8)
    
    ret9 = tree.xpath('//div[@class="divtag"]/ul/li[position()<3 or position()>4]/text()')
    print(ret9)
  • 相关阅读:
    yocto/bitbake 学习资源
    QEMU/KVM学习资源
    ubuntu 中创建和删除用户
    git 重命名本地和远程分支
    Ubuntu 上搭建 FTP 服务器
    gdb 常见用法
    git log 显示与特定文件相关的 commit 信息
    基于 qemu system mode 运行 arm 程序
    基于 qemu user mode 运行 aarch64 程序
    checking in(airport)
  • 原文地址:https://www.cnblogs.com/gaodenghan/p/13626704.html
Copyright © 2020-2023  润新知