Screen scraping 2

Using HTMLPareser

Using HTMLParser simply means subclassing it, and overriding various event-handling methods such as handle_starttag or handle_data.

Handle_starttag(tag, attrs): When a start tag is found. Attrs is a sequence of (name, value) pairs.

Handle_startendtag(tag, attrs): for empty tags; default handles start and end separately

Handle_endtag(tag): when end tag is found

Handle_data(data): for textual data

Handle_charref(ref): for character references of the form &#ref

Handle_entityref(name): for entity references of the form &name

Handle_decl(decl): for declarations of the form <!...>

Handle_pi(data): for processing instructions

from urllib import urlopen
import re
from HTMLParser import HTMLParser

class Scraper(HTMLParser):
    in_h2 = False
    in_link = False
    
    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        if tag == 'h2':
            self.in_h2 = True
        if tag == 'a' and 'href' in attrs:
            self.in_link = True
            self.chunks = []
            self.url = attrs['href']
            
    def handle_data(self, data):
        if self.in_link:
            self.chunks.append(data)
            
    def handle_endtag(self, tag):
        if tag == 'h2':
            self.in_h2 = False
        if tag == 'a':
            if self.in_h2 and self.in_link:
                print '%s (%s)' %(''.join(self.chunks), self.url)
            self.in_link = False

text = urlopen("http://www.python.org/community/jobs/").read()
parser = Scraper()
parser.feed(text)
parser.close()

作者：Shane
出处：http://bluescorpio.cnblogs.com
本文版权归作者和博客园共有，欢迎转载，但未经作者同意必须保留此段声明，且在文章页面明显位置给出原文连接，否则保留追究法律责任的权利。

相关阅读:
mysql索引最左匹配的理解(转载于知乎回答)
mysql深度优化与理解(迄今为止读到最优秀的mysql博客)
PHP数组函数总结与使用
进程(process)和线程(thread)
联合索引使用规则(转载)
mysql优化大全(转自别人 )
HTTP隧道解决的问题
HTTP代理协议 HTTP/1.1的CONNECT方法
vant弹窗提示
vue获取验证码倒计时

原文地址：https://www.cnblogs.com/bluescorpio/p/2513950.html