一、中国大学排名定向爬虫
定向爬虫:仅对输入URL进行爬取,不扩展爬取。
步骤一:从网络上获取大学排名网页内容 getHTMLText()
最好大学网:http://www.zuihaodaxue.com/zuihaodaxuepaiming2016.html
步骤二:提取网页内容中信息到合适的数据结构 fillUnivList()
步骤三:利用数据结构展示并输出结果 printUnivList() 格式化输出
import requests
from bs4 import BeautifulSoup
import bs4
def getHTMLText(url):
try:
r = requests.get(url , timeout = 30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return "获取失败"
def fillUnivList(ulist , html):
soup = BeautifulSoup(html , "html.parser")
for tr in soup.find('tbody').children:
if isinstance(tr , bs4.element.Tag):
tds = tr('td')
ulist.append([tds[0].string , tds[1].string , tds[3].string])
def printUnivList(ulist , num):
tplt = "{0:^10} {1:{3}^10} {2:^10}"
print(tplt.format("排名" , "学校名称" , "总分" , chr(12288)))
#print("{:^10} {:^10} {:^10}".format("排名" , "学校名称" , "总分"))
for i in range(num):
u = ulist[i]
print(tplt.format(u[0], u[1] , u[2] , chr(12288)))
#print("{:^10} {:^10} {:^10}".format(u[0], u[1] , u[2]))
def main():
uinfo = []
url = "http://www.zuihaodaxue.com/zuihaodaxuepaiming2016.html"
html = getHTMLText(url)
fillUnivList(uinfo , html)
printUnivList(uinfo , 20)
main()
中文对齐问题的解决:当中文字符宽度不够时,会默认采用西文字符填充;但中西文字符占用宽度不同。可采用中文字符的空格填充 chr(12288)
二、提取HTML中所有URL链接
思路:
1、搜索到所有的<a>标签
2、解析<a>标签格式,提取href后的链接内容
>>> import requests >>> from bs4 import BeautifulSoup >>> url = "http://python123.io/ws/demo.html" >>> r = requests.get(url) >>> demo = r.text >>> soup = BeautifulSoup(demo , "html.parser") >>> for link in soup.find_all('a'): print(link.get('href')) http://www.icourse163.org/course/BIT-268001 http://www.icourse163.org/course/BIT-1001870001