Screen scraping 3

Use BeautifulSoup

from urllib import urlopen
from bs4 import BeautifulSoup as BS

text = urlopen("http://www.python.org/community/jobs/").read()
soup = BS(text.decode('gbk', 'ignore'))

jobs = set()
for header in soup('h2'):
    links = header('a', 'reference')
    if not links:
        continue
    link = links[0]
    jobs.add('%s (%s)' % (link.string, link['href']))
        
print '\n'.join(sorted(jobs, key = lambda s: s.lower()))
eliminate duplicates and print the names in sorted order

soup('h2'): to get a list of all h2 elements
header('a', 'reference') to get a list of child elements of the reference class

作者：Shane
出处：http://bluescorpio.cnblogs.com
本文版权归作者和博客园共有，欢迎转载，但未经作者同意必须保留此段声明，且在文章页面明显位置给出原文连接，否则保留追究法律责任的权利。

相关阅读:
No resource found that matches the given name 'Theme.AppCompat.Light' 的完美解决方案
Data Flow ->> Fuzzy Lookup & Fuzzy Grouping
SSIS ->> Script Debugging and Troubleshooting
Data Flow ->> Script Component
SSIS ->> Logging
SSIS ->> Event Handler
SSIS ->> Script Task
Data Flow ->> Look up & Merge Join
SSIS ->> 生成时间格式
SSIS ->> Null & Null Functions

原文地址：https://www.cnblogs.com/bluescorpio/p/2513951.html