• Screen scraping 2


    Using HTMLPareser

    Using HTMLParser simply means subclassing it, and overriding various event-handling methods such as handle_starttag or handle_data.

    Handle_starttag(tag, attrs): When a start tag is found. Attrs is a sequence of (name, value) pairs.

    Handle_startendtag(tag, attrs): for empty tags; default handles start and end separately

    Handle_endtag(tag): when end tag is found

    Handle_data(data): for textual data

    Handle_charref(ref): for character references of the form &#ref

    Handle_entityref(name): for entity references of the form &name

    Handle_decl(decl): for declarations of the form <!...>

    Handle_pi(data): for processing instructions

    from urllib import urlopen
    import re
    from HTMLParser import HTMLParser
    
    class Scraper(HTMLParser):
        in_h2 = False
        in_link = False
        
        def handle_starttag(self, tag, attrs):
            attrs = dict(attrs)
            if tag == 'h2':
                self.in_h2 = True
            if tag == 'a' and 'href' in attrs:
                self.in_link = True
                self.chunks = []
                self.url = attrs['href']
                
        def handle_data(self, data):
            if self.in_link:
                self.chunks.append(data)
                
        def handle_endtag(self, tag):
            if tag == 'h2':
                self.in_h2 = False
            if tag == 'a':
                if self.in_h2 and self.in_link:
                    print '%s (%s)' %(''.join(self.chunks), self.url)
                self.in_link = False
    
    text = urlopen("http://www.python.org/community/jobs/").read()
    parser = Scraper()
    parser.feed(text)
    parser.close()
    作者:Shane
    出处:http://bluescorpio.cnblogs.com
    本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
  • 相关阅读:
    mysql索引最左匹配的理解(转载于知乎回答)
    mysql深度优化与理解(迄今为止读到最优秀的mysql博客)
    PHP数组函数总结与使用
    进程(process)和线程(thread)
    联合索引使用规则(转载)
    mysql优化大全(转自别人 )
    HTTP隧道解决的问题
    HTTP代理协议 HTTP/1.1的CONNECT方法
    vant弹窗提示
    vue获取验证码倒计时
  • 原文地址:https://www.cnblogs.com/bluescorpio/p/2513950.html
Copyright © 2020-2023  润新知