python爬虫教程之美丽汤（一）

# python 爬虫之美丽汤 BeautifulSoup

作者： jwang106

1. 使用requests获取网页的html源码

import requests
from bs4 import BeautifulSoup


response = requests.get('https://www.autohome.com.cn/news/')
response.encoding = response.apparent_encoding
response.text

request用法总结

response = requests.get(url)
# get传参
>>> payload = {'key1': 'value1', 'key2': 'value2', 'key3': None}
>>> r = requests.get('http://httpbin.org/get', params=payload)

# 参数也可以传递列表
>>> payload = {'key1': 'value1', 'key2': ['value2', 'value3']}

>>> r = requests.get('http://httpbin.org/get', params=payload)
>>> print(r.url)
http://httpbin.org/get?key1=value1&key2=value2&key2=value3

# 编码
request.encoding

# 返回headers中的编码解析的结果
text 

# 返回二进制结果
content

# response.json()返回JSON格式，可能抛出异常

apparent_encoding

# 状态码 404 200等
status_code
# 为方便引用，Requests还附带了一个内置的状态码查询对象:
print r.status_code == requests.codes.ok

2. 使用美丽汤

举例：如果目标是爬取某个html里某个id下a标签的图片

soup = BeautifulSoup(response.text,features='html.parser')

# 直接用soup.find(id='xxx'） 简单又好记
# soup的每一个find的return可以继续用find, find是找到第一个， 
# find_all 是所有，返回list
target = soup.find(id='auto-channel-lazyload-article')
li_list = target.find_all('li')
for i in li_list:
    a = i.find('a')
    if a:
        print(a.attrs.get('href'))
        txt = a.find('h3').text 
        print(txt)
        img_url = 'https:' + a.find('img').attrs.get('src')
        print(img_url)

        img_response = requests.get(url=img_url)
        import uuid
        file_name = str(uuid.uuid4()) + '.jpg'
        with open(file_name,'wb') as f:
            f.write(img_response.content)

打印一下这些元素的type，就更容易懂了

print(type(soup))
print(type(target))
print(type(li_list[0]))

output:

<class 'bs4.BeautifulSoup'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>

打印一下a

a = li_list[0].find('a')
a.attrs

output:

{'href': '//www.autohome.com.cn/news/201901/928448.html#pvareaid=102624'}

可以看到是一个字典，并且汽车之家使用了小技巧来防止加密，就是href里没有写https，没有难度我们自己加上就ok了。

后面的代码就很好懂了，获取使用requests获取图片，然后写入本地文件。

美丽汤总结

soup = BeautifulSoup(response.text, features='html.parser')
soup.find('div')
soup.find(id='1')
soup.find('div', id='1')

find是找第一个 find_all是所有,返回列表

3. 讲一下uuid

通用唯一识别码（英语：Universally Unique Identifier，UUID）

uuid.uuid1([node[, clock_seq]])
Generate a UUID from a host ID, sequence number, and the current time. 

uuid.uuid3(namespace, name)
Generate a UUID based on the MD5 hash of a namespace identifier (which is a UUID) and a name (which is a string).

uuid.uuid4()
Generate a random UUID.

相关阅读:
必须先将 ContentLength 字节写入请求流，然后再调用 [Begin]GetResponse。解决方法
 使用NPOI导出导入导出Excel
使用jquery ajaxForm提交表单
 VS2013创建Windows服务
 mysql查询锁表及解锁
 selenium学习笔记（1）
进程基本知识
 多任务原理
 数据转化之JSON
安装Treserocr遇到的问题
原文地址：https://www.cnblogs.com/wangjiale1024/p/10255024.html