BeautifulSoup模块是干嘛的?
答:通过html标签去快速匹配标签中的内容。效率相对比正则会好的多。效率跟xpath模块应该差不多。
一:解析器:
- BeautifulSoup(html,"html.parser")
- BeautifulSoup(html,'lxml')
- BeautifulSoup(html,'xml')
- BeautifulSoup(html,'html5lib')
假设要匹配a标签里的href属性:
1 html = "<a href='http://baidu.com/'>this is baidu.com</a>" 2 bs = BeautifulSoup(html,"lxml") 3 all_href = bs.find_all('a') 4 for i in all_href: 5 print i['href']
1 #!usr/bin/env python 2 #encding:utf-8 3 #by i3ekr 4 5 import requests 6 from bs4 import BeautifulSoup 7 8 html = """ 9 <!DOCTYPE html> 10 <html> 11 <head> 12 <title>title test demo</title> 13 </head> 14 <body> 15 <h1>this is h1</h1> 16 <h1>this is h1 two</h1> 17 <h1>this is h1 stree</h1> 18 <a href="http://baidu.com">this is a href.</a> 19 </body> 20 </html> 21 """ 22 bs = BeautifulSoup(html, "lxml") 23 print bs.find_all('h1')