• Python 通过sgmllib模块解析HTML


    """
    对html文本的解析方案-示例:在标签开始的时候检查标签中的attrs属性,解析出所有的参数的href属性值
    依赖安装:pip install sgmllib3k
    使用方法:
        1.自定义一个类,继承sgmllib的SGMLParser
        2.复写SGMLParser的方法,添加自己自定义的标签处理函数
        3.通过自定义的类的对象的.feed(data)把要解析的数据传入解析器,然后自定义的方法自动生效。
    """
    from urllib import request
    import sgmllib
    
    
    class HandleHtml(sgmllib.SGMLParser):
        """
        自定义HTML解析类
        """
    
        def unknown_starttag(self, tag, attrs):
            """
            任意标签开始被解析时调用
            :param tag: 标签名
            :param attrs: 标签的参数
            :return:
            """
            try:
                for attr in attrs:
                    if attr[0] == 'href':
                        print(f"{attr[0]}:{attr[1]}")
            except:
                pass
    
    
    if __name__ == '__main__':
        response = request.urlopen("http://freebuf.com/")
        page = response.read()
        page = page.decode('utf-8')
    
        # 创建HTML解析对象
        handle_html = HandleHtml()
        # 将数据传入解析器
        handle_html.feed(page)

    输出结果:

    href:https://www.freebuf.com/buf/plugins/wp-favorite-posts/wpfp.css
    href:https://static.3001.net/css/recentcomments/wp-recentcomments.css?ver=2.2.3
    href:https://www.freebuf.com/buf/plugins/gold/assets/css/widget.css?ver=1.3.2.1
    href:https://static.3001.net/css/highslide/highslide.css
    href:https://www.freebuf.com/buf/plugins/cartpauj-pm/style/style.css
    href: https://www.freebuf.com/buf/plugins/simditor/highlight/styles/default.css
    href:https://static.freebuf.com/images/favicon.ico
    href:https://static.3001.net/css/new/header.css
    href:https://static.3001.net/css/new/bootstrap.min.css?ver=2016051701
    href:https://static.3001.net/css/new/swiper-3.4.2.min.css
    href:https://static.3001.net/css/new/model.css?ver=2017112156855
    href:https://static.3001.net/css/new/style.css?ver=2018112123749359438534
    href:http://www.freebuf.com
    href:http://www.freebuf.com
    href:http://job.freebuf.com
    href:#
    ......
  • 相关阅读:
    关于js中"window.location.href"、"location.href"、"parent.location.href"、"top.location.href"的用法
    对于json数据的应用01
    关于session应用(1)session过期时间设置
    关于session应用(2)JAVA中怎么使用session
    Jquery常用技巧(3)
    0101对称二叉树 Marathon
    0112路径之和 & 0113所有路径之和 Marathon
    0106105从中序与后序遍历序列中构造二叉树 Marathon
    0110平衡二叉树 Marathon
    0513找树左下角的值 Marathon
  • 原文地址:https://www.cnblogs.com/Jimc/p/10307684.html
Copyright © 2020-2023  润新知