例如:
url = 'https://zozo.jp/shop/mrolive/goods-sale/44057773/?did=73037089' resp = requests.get(url=url) html = etree.HTML(resp.text) title = html.xpath('//div[@id="item-intro"]/h1/text()')[0] print(title)
打印结果为:
AeB[N ubN JEU[ / MA-1 U[ u]
以上打印结果为乱码,解决办法:
1 先获取网址的编码:
url = 'https://zozo.jp/shop/mrolive/goods-sale/44057773/?did=73037089' resp = requests.get(url=url) encodings = requests.utils.get_encodings_from_content(resp.text) print(encodings)
结果为:
['Shift_JIS']
由此可知网站的编码是['Shift_JIS']
2 将获取的response.conetent的编码设置为['Shift_JIS'],再次请求,获取到的就不是乱码了。
url = 'https://zozo.jp/shop/mrolive/goods-sale/44057773/?did=73037089' resp = requests.get(url=url) resp_txt = resp.content.decode('Shift_JIS') html = etree.HTML(resp_txt) title = html.xpath('//div[@id="item-intro"]/h1/text()')[0] print(title)
结果为:
アンティーク ブラック カウレザー / MA-1 レザー ブルゾン