• python BeautifulSoup html解析


    * BeautifulSoup 的.find(), .findAll() 函数原型

    findAll(tag, attributes, recursive, text, limit, keywords)
    find(tag, attributes, recursive, text, keywords)
    

      

    * 取得 span.green

    bsObj.findAll("span", {"class":"green"})

    #-*- coding: UTF-8 -*-
    #!/usr/local/bin/python
    from urllib.request import urlopen
    from urllib.request import HTTPError, URLError
    from bs4 import BeautifulSoup
    
    def getBsObj(url):
        try:
            html = urlopen(url, None, 3)
        except(HTTPError, URLError) as e:
            print(e)
            return None
        try:
            bsObj = BeautifulSoup(html.read(), "html.parser")
        except AttributeError as e:
            return None
        return bsObj
    
    bsObj = getBsObj("http://www.pythonscraping.com/pages/warandpeace.html")
    nameList = bsObj.findAll("span", {"class":"green"})
    for name in nameList:
        print(name.get_text())
    

      

    * 取得 h1,h2,h3,h4,h5,h6

    bsObj.findAll({"h1","h2","h3","h4","h5","h6"});
    

      

    // javascript 生成引号 包裹每个元素的字符串

    function quote(s) {
        return """ + s.split(",").join("","") + """;
    }
    var s = "h1,h2,h3,h4,h5,h6"
    console.log(quote(s))
    

      

    * 取得 span.green, span.red

    bsObj.findAll("span", {"class":{"green", "red"}})

    * 取得网页中包含"the prince"内容的标签数量

    nameList = bsObj.findAll(text="the prince")
    print(len(nameList))

    * 找到#text  id="text"

    allText = bsObj.find(id="text")
    print(allText.get_text())

    * 找到div#text

    allText = bsObj.find("div", {"id":"text"})

    * 找到div#text > span.red:first-child

    red = bsObj.find("div", {"id":"text"}).find("span", {"class":"red"}, False)
    print(red.get_text())
    

      

  • 相关阅读:
    C#开发Activex控件疑难杂症
    spring、struts、mybatis、Postgresql集成使用存储过程进行分页
    C#开发Activex控件升级
    通过Maven将Web程序部署到远程Tomcat8服务器的一些注意事项
    分页存储过程Oracle版
    JSP EL表达式(转)
    关于Log4x
    C#类在初始化时的执行顺序
    使用MSMQ 远程队列
    tomcat部署与Context(转)
  • 原文地址:https://www.cnblogs.com/mingzhanghui/p/9424791.html
Copyright © 2020-2023  润新知