Using HTMLPareser
Using HTMLParser simply means subclassing it, and overriding various event-handling methods such as handle_starttag or handle_data.
Handle_starttag(tag, attrs): When a start tag is found. Attrs is a sequence of (name, value) pairs.
Handle_startendtag(tag, attrs): for empty tags; default handles start and end separately
Handle_endtag(tag): when end tag is found
Handle_data(data): for textual data
Handle_charref(ref): for character references of the form &#ref
Handle_entityref(name): for entity references of the form &name
Handle_decl(decl): for declarations of the form <!...>
Handle_pi(data): for processing instructions
from urllib import urlopen import re from HTMLParser import HTMLParser class Scraper(HTMLParser): in_h2 = False in_link = False def handle_starttag(self, tag, attrs): attrs = dict(attrs) if tag == 'h2': self.in_h2 = True if tag == 'a' and 'href' in attrs: self.in_link = True self.chunks = [] self.url = attrs['href'] def handle_data(self, data): if self.in_link: self.chunks.append(data) def handle_endtag(self, tag): if tag == 'h2': self.in_h2 = False if tag == 'a': if self.in_h2 and self.in_link: print '%s (%s)' %(''.join(self.chunks), self.url) self.in_link = False text = urlopen("http://www.python.org/community/jobs/").read() parser = Scraper() parser.feed(text) parser.close()