BeautifulSoup模块
- 简单介绍
它就是用来从HTML源码中提取我们需要的有效数据信息的工具,效率比正则表达式高
BeautifulSoup又被称为bs4 - 安装
pip install BeautifulSoup - 简单案例
import requests
import bs4
url = 'https://www.lagou.com/'
res = requests.get(url)
res.raise_for_status()
no = bs4.BeautifulSoup(res.text)
print(type(no))
bs4.BeautifulSoup('Html文件中的内容的字符串'):获取一个BeautifulSoup对象
上面的代码直接运行会有警告:
D:/JavaSoft/pycharm-professional-2019.3/WorkSpace/python_learning/python_base/webcrawle/webcrawle_demo3.py:11: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 11 of the file D:/JavaSoft/pycharm-professional-2019.3/WorkSpace/python_learning/python_base/webcrawle/webcrawle_demo3.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.
谷歌翻译一下警告:没有显式指定解析器,所以我使用这个系统中可用的最佳HTML解析器(“lxml”)。这通常不是问题,但是如果您在另一个系统上或在不同的虚拟环境中运行这段代码,它可能会使用不同的解析器,并且行为也会有所不同。
总结:总的来说就是缺少一个html解析器,然后在程序中安装这个lxml模块,然后在初始化的时候把这个变量添加上去就可以解决了
import requests
import bs4
url = 'https://www.lagou.com/'
res = requests.get(url)
res.raise_for_status()
no = bs4.BeautifulSoup(res.text, 'lxml')
print(type(no))
- select方法 作用:在bs4将html源码全部加载到对象中,然后可以调用这个方法进行规则匹配寻找我们需要的元素和数据
实现机制:看源码select()方法每次匹配之后返回的是一个element模块中的ResultSet对象,这个对象继承list类,实际上返回的就是一个装有Tag对象的列表。Tag对象的值是可以传递给str()函数,这个对象有一个attrs属性,这个属性会把Tag对象中所有HTML属性作为一个字典进行存储
import requests
import bs4
# 从拉钩网把数据下载下来然后存储在本地的文件中(二进制存储)
# url = 'https://www.lagou.com/'
# res = requests.get(url)
# res.raise_for_status()
# with open('lagou.txt', 'wb') as op:
# for line in res.iter_content(1000):
# op.write(line)
file = open('lagou.txt', 'r', encoding='utf-8')
soup = bs4.BeautifulSoup(file, 'lxml')
print(type(soup))
elems = soup.select('#search_input')
print(elems)
print(type(elems))
print(len(elems))
print(elems[0])
print(type(elems[0]))
print(elems[0].getText())
print(elems[0].attrs)
print(elems[0].get('placeholder'))
代码执行之后的结果
<class 'bs4.BeautifulSoup'>
[]
<class 'bs4.element.ResultSet'>
1
<class 'bs4.element.Tag'>{'maxlength': '64', 'placeholder': '搜索职位、公司或地点', 'type': 'text', 'id': 'search_input', 'class': ['search_input'], 'autocomplete': 'off', 'tabindex': '1', 'value': ''}
搜索职位、公司或地点