python爬虫2：按html标签提取信息和中文域名处理（BeautifulSoup用法初步）

 1 #!/usr/bin/env python  
 2 # -*- coding: utf-8 -*-  
 3 # python3
 4 import string
 5 import urllib
 6 from urllib import request
 7 from bs4 import BeautifulSoup
 8 
 9 url="https://ne0matrix.com/2020/01/08/伊朗，赢了"
10 # 有中文的url，直接urlopen会出错，需要quote处理一下。
safe=参数表示不需要被处理的字符，默认为/。现在设为string.printable表示非中文的不需要转换。
11 
12 url_quote=urllib.parse.quote(url,safe=string.printable)
13 # quote的逆向操作unquote：
14 # url_unquote=urllib.parse.unquote(url_quote
15 print (url_quote)
16 
17 page_read=request.urlopen(url_quote).read()
18 page_decode=page_read.decode('utf-8')
19 with open ('output.html','w')as f:
20     f.write(page_decode)
21 
22 with open ('output.html','r')as f:
23     alltext=f.read()
24 bsobj=BeautifulSoup(alltext,'html.parser')
25 # 如果不加html.parser则使用默认的lxmlparser，会有警告，但不影响使用
26 
27 print (bsobj.title)
28 # 获取标题<title>...
29 print (bsobj.title.get_text())
30 # get_text()获取纯文字的标题
31 date=bsobj.find('p',{'class':'mt-3'}).get_text()
32 print (date.strip())
33 # strip()去掉前后空格
34 count=bsobj.find('span',{'class':'post-count'})
35 print(count.get_text().strip())
36 text=bsobj.find('div',{'class':'markdown-body'})
37 print(text.get_text())
38 # 查找正文

相关阅读:
html5 新增的页面元素
【BZOJ1500】[NOI2005]维修数列 Splay
【BZOJ1720】[Usaco2006 Jan]Corral the Cows 奶牛围栏双指针法
【BZOJ3437】小P的牧场斜率优化
【BZOJ1096】[ZJOI2007]仓库建设斜率优化
【BZOJ3156】防御准备斜率优化
【BZOJ4101】[Usaco2015 Open]Trapped in the Haybales Silver 二分
【BZOJ4099】Trapped in the Haybales Gold STL
【BZOJ3387】[Usaco2004 Dec]Fence Obstacle Course栅栏行动线段树
【BZOJ3939】[Usaco2015 Feb]Cow Hopscotch 动态规划+线段树

原文地址：https://www.cnblogs.com/cityfckr/p/12357493.html