寒假学习进度-7（Python爬虫）

1.使用Python自带的urllib爬取一个网页的代码

# -*- coding: UTF-8 -*-

from urllib import request

if __name__ == "__main__":
    response = request.urlopen("https://www.cnblogs.com/")
    html = response.read()
    html = html.decode("utf-8")
    print(html)

通过request的URLopen向https://www.cnblogs.com/发送请求，返回的数据保存在response中

html.decode("utf-8”)对返回的数据进行解码(decode)

通过pip install chardet命令下载chatdet,通过chardet这个第三方库可以自动获取目标网页的编码

# -*- coding: UTF-8 -*-
from urllib import request
import chardet

if __name__ == "__main__":
    response = request.urlopen("https://www.cnblogs.com/")
    html = response.read()
    charset = chardet.detect(html)
    print(charset)

urllib是学习python爬虫需要掌握的最基本的库，它主要包含四个模块：

urllib.request基本的HTTP请求模块。可以模拟浏览器向目标服务器发送请求。
urllib.error 异常处理模块。如果出现错误，可以捕捉异常。
urllib.parse 工具模块。提供URL处理方法, 比如对URL进行编码和解码。
urllib.robotpaser 用来判断哪些网站可以爬，哪些网站不可以爬。

抓取标签

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

 def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bs0bj = BeautifulSoup(html.read(), "html.parser")
        title = bs0bj.head.title
    except AttributeError as e:
        return None
    return title

title = getTitle("http://www.baidu.com")
if title == None:
    print("Title could not be found !")
else:
    print(title)

相关阅读:
Asp.net Core依赖注入（Autofac替换IOC容器）
.NET Core WEB API接口参数模型绑定
.net core docker容器编排部署（linux）
asp .net core发布订阅kafka
asp.net Core依赖注入（自带的IOC容器）
VS2017 GIT推送错误：Authentication failed解决办法
《ASP.NET Core 开发实战》
《Entity Framework 实用精要》
《C# 敏捷开发实践》
《ASP.NET 框架应用程序实战》

原文地址：https://www.cnblogs.com/liujinxin123/p/12258385.html