5.2 个人作业2

爬虫中遇到的问题

关于爬取时，网页编码不是utf-8，而导致 lxml 输出中文时乱码的解决办法

用requests.get拿到response，response.content是bytes的内容，所以可以直接传给 lxml， body = etree.HTML(response.content)就不会有乱码了

而response.text是返回unicode编码的内容，需要进行编码(encode)，所以就是 body = etree.HTML(response.text.encode(response.encoding))

其实在python3中只要理解 str 和 bytes 的关系就够了。

有一篇文章讲得很好：http://www.ituring.com.cn/article/1116

ps：其实python3已经比python2少了很多很多编码转换的问题了。。。

在爬取某个网站时，直接用lxml.etree对response.content进行分析拿到的数据，与保存到本地后再分析拿到数据不一致

 1 url = 'http://op.hanhande.com/mh/'
 2 HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; rv:51.0) Gecko/20100101 Firefox/51.0 '}
 3 
 4 #直接分析
 5 response = requests.get(url, headers=HEADERS)
 6 body = etree.HTML(response.content)
 7 us = body.xpath('//div[@class="mh_list_li"]/ul/li/a/@href')
 8 ts = body.xpath('//div[@class="mh_list_li"]/ul/li/a/text()')
 9 print(len(us), len(ts))
10 #保存后分析
11 with codecs.open('body.html', 'w', encoding = response.encoding) as f:
12     f.write(response.content.decode(response.encoding))
13 with codecs.open('body.html', 'r') as f:
14     body = f.read()
15     body = etree.HTML(body)
16     us = body.xpath('//div[@class="mh_list_li"]/ul/li/a/@href')
17     ts = body.xpath('//div[@class="mh_list_li"]/ul/li/a/text()')
18     print(len(us), len(ts))

运行结果为：

14 14
582 582

猜测可能是编码的问题，但是不知道如何确定。

相关阅读:
在DataGrid中创建一个点击列名时的弹出式窗口
利用自定义事件实现不同窗体间的通讯 C#篇
用javascript实现禁用鼠标右键
刷新页面时，防止滚动条上滚
web服务编程
数据库链接Connection和DataReader的关闭
.NET的WEB商业应用架构所要解决的若干
zblog屏蔽分类文章
过年随想
mysql数据库文件的真实的物理存储位置

原文地址：https://www.cnblogs.com/dty602511/p/14914499.html