进击的爬虫-003-beautifulsoup实现猫眼电影前100爬取

进击的爬虫-003-beautifulsoup实现猫眼电影前100爬取
BeutifulSoup
- beautifulsoup是python的一个xml , html解析库, 借助网页的结构和属性等特性来解析网页,只需要简单的几条语句, 就可以用来方便的从网页中提取数据
选择解释器

beautifulsoup在解析的时候需要依赖解析器
- 1. python标准库 BeautifulSoup(text, 'html.parser)
- 2. lxml HTML解析器 BeautifulSoup(text, 'lxml')
- 3. lxml XML解析器 BeautifulSoup(text, 'xml')
- 4. html5lib BeautifulSoup(text, 'html5lib')
- 推荐使用 lxml HTML 解析器
基本用法
- soup = BeautfifulSoup(text, 'lxml')
- soup.prettify() 把要解析的字符串以标准的缩进格式输出
- soup.p.string string属性获取文本内容
节点选择器
- 选择元素 soup.p 如果有多个p元素,只找到第一个
- 提取属性 soup.p.attrs得到一个字典
- soup.p['class'] 获取属性的值, 只可能是字符串也可能是列表
- 嵌套选择 soup.p.a.string
- 关联选择子节点 children 子孙节点 descdants
- 关联选择父节点 parent 祖先节点 parents
- 兄弟节点 next-sibling previous-sibling
方法选择器
- find_all() 找到所有满足条件的标签, 放在一个列表中
- find() 找到第一个满足条件的列表
- css 选择器, beautifulsoup还提供了css选择器,对web比较熟悉, 想使用css选择器来选择标签的小伙伴可以使用 pyquery 解析库这了就不做介绍了
beautifulsoup实现猫眼电影前100爬取
```
from bs4 import BeautifulSoup as bs
import requests

def get_movie_info(ret):
    soup = bs(ret.text, 'lxml')  #用beautifulsoup库 处理前端页面
    all_dd = soup.find_all('dd')  #找到页面中的每个dd标签
    for content in range(10):     #每个dd标签中都包含着一个电影的信息
        num = all_dd[content].i.string  #获取当前电影的排名
        movie_infos = [num]
        for p in all_dd[content].find_all('p'):
            if p.string:   #分别获取 电影名, 主演, 上映时间等,
                movie_infos.append(p.string.strip())

        movie = f'排名:{movie_infos[0]}, 电影名:{movie_infos[1]}, {movie_infos[2]}, {movie_infos[3]}'
        print(movie)

url = 'https://maoyan.com/board/4'

for offset in range(10):
    data = {
        'offset':offset * 10
    }
    ret = requests.get(url, params=data)  #获取前端页面
    get_movie_info(ret)  #调用函数, 处理前端页面
```
相关阅读:
shell 默认变量
 diff 实现
 AWK变量
 docker-bridge network
docker--linux network namespace
docker-container 操作
 docker--shell和Exec格式
 docker--build自己的image
docker--container
docker--删除container和image
原文地址：https://www.cnblogs.com/zhangjian0092/p/11215750.html

进击的爬虫-003-beautifulsoup实现猫眼电影前100爬取

BeutifulSoup

选择解释器

基本用法

节点选择器

方法选择器