Python网络爬虫实战入门

一、网络爬虫

网络爬虫（又被称为网页蜘蛛，网络机器人），是一种按照一定的规则，自动地抓取万维网信息的程序。

爬虫的基本流程：

发起请求：

通过HTTP库向目标站点发起请求，也就是发送一个Request，请求可以包含额外的header等信息，等待服务器响应

获取响应内容：

如果服务器能正常响应，会得到一个Response，Response的内容便是所要获取的页面内容，类型可能是HTML,Json字符串，二进制数据（图片或者视频）等类型

解析内容：

得到的内容可能是HTML,可以用正则表达式，页面解析库进行解析，可能是Json,可以直接转换为Json对象解析，可能是二进制数据，可以做保存或者进一步的处理

保存数据：

保存形式多样，可以存为文本，也可以保存到数据库，或者保存特定格式的文件

二、准备

准备安装以下三个库：

1、urllib库

Urllib是python内置的标准库模块，使用它可以像访问本地文本文件一样读取网页的内容。Python的Urllib库模块包括以下四个模块：

urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse url解析模块
urllib.robotparser解析模块

2、urllib.request模块的常用方法

基本使用步骤：

（1）导入urllib.request模块

from urllib import request

（2）连接要访问的网站，发起请求

resp = request.urlopen("http://网站IP地址")

（3）获取网站代码信息

print(resp.read().decode())

3、BeautifulSoup模块

（1）BeautifulSoup模块的基本元素

（2）标签树

在解析网页文档的过程中，需要应用BeautifulSoup模块对HTML内容进行遍历。

设有如下的一个HTML文档：

<html>

  <head>

   ....

  </head>

  <body>

     <p class="title"> The demo Python Project.</p>

     <p class="course"> Python is a programming language.

         <a href="http://www.icourse163.com"> Basic Python </a>

   <a href="http:..www.python.org"> Advanced Python </a>

   </p>

  </body>

</html>

（3）BeautifulSoup模块对象“标签树”的上行遍历属性

（4）BeautifulSoup模块对象“标签树”的下行遍历属性

（5）BeautifulSoup模块对象的信息提取方法

三、入门练习

1、抓取湖北师范大学网站基本信息

import urllib.request

response=urllib.request.urlopen("http://www.hbnu.edu.cn/")
print(response.info())
print('
*************************************************************
')
print(response.getcode())
print('
*************************************************************
')
print(response.read())

2、爬取最好大学网站的大学排名榜

import bs4
from urllib import request
from bs4 import BeautifulSoup

def getHTMLText(url):
    '''获取页面'''
    try:
        resp = request.urlopen(url)
        html_data = resp.read().decode('utf-8')
        return html_data 
    except:
        return ""

def fillUnivList(ulist, html):
    '''处理页面'''
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:    # 找到关键词'tbody'后，搜索'td'子项
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[3].string])

def printUnivList(ulist, num):
    '''格式输出页面'''
    tplt = "{0:^10}	{1:{3}^10}	{2:^10}"
    print(tplt.format("排名", "学校名称", "学校类型", chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0], u[1], u[2], chr(12288)))

if __name__ == '__main__':
    uinfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2020.html' # 2020年
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20)           #  输出20个大学排名

3、爬取网络版小说《红楼梦》

爬取某网站的网络版小说《红楼梦》。打开《红楼梦》小说的目录页面会如图所示。

运用F12，找对应章节的位置

首先爬取对应章节的网址：

from urllib import request
from bs4 import BeautifulSoup
 
if __name__ == '__main__':
    # 目录页
    url = 'http://www.136book.com/hongloumeng/'
    head = {}
    head['User-Agent'] = 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19'
    req = request.Request(url, headers = head)
    response = request.urlopen(req)
    html = response.read()
    # 解析目录页
    soup = BeautifulSoup(html, 'lxml')
    # find_next找到第二个<div>
    soup_texts = soup.find('div', id = 'book_detail', class_= 'box1').find_next('div')
    # 遍历ol的子节点，打印出章节标题和对应的链接地址
    for link in soup_texts.ol.children:
        if link != '
':
            print(link.text + ':  ', link.a.get('href'))

爬取每一章节的内容：

from urllib import request
from bs4 import BeautifulSoup

if __name__ == '__main__':
    # 第1章的网址
    url = 'http://www.136book.com/hongloumeng/qlxecbzt/'
    head = {}
    # 使用代理
    #head['User-Agent'] = 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19'
    req = request.Request(url, headers = head)
    response = request.urlopen(req)
    html = response.read()
    # 创建request对象
    soup = BeautifulSoup(html, 'lxml')
    # 找出div中的内容
    soup_text = soup.find('div', id = 'content')
    # 输出其中的文本
    print(soup_text.text)

当然如此显示会很不好看，我们去试一下生成一本《红楼梦》.txt，默认存在我的D盘

from urllib import request
from bs4 import BeautifulSoup

if __name__ == '__main__':
    url = 'http://www.136book.com/hongloumeng/'
    head = {}
    req = request.Request(url, headers = head)
    response = request.urlopen(req)
    html = response.read()
    soup = BeautifulSoup(html, 'lxml')
    soup_texts = soup.find('div', id = 'book_detail', class_= 'box1').find_next('div')
    # 打开文件
    f = open('D:hongloumeng.txt','w')
    # 循环解析链接地址
    for link in soup_texts.ol.children:
        if link != '
':
            download_url = link.a.get('href')
            download_req = request.Request(download_url, headers = head)
            download_response = request.urlopen(download_req)
            download_html = download_response.read()
            download_soup = BeautifulSoup(download_html, 'lxml')
            download_soup_texts = download_soup.find('div', id = 'content')
            # 抓取其中文本
            download_soup_texts = download_soup_texts.text
            # 写入章节标题
            f.write(link.text + '

')
            # 写入章节内容
            f.write(download_soup_texts)
            f.write('

')
    f.close()

感悟：效果很不错，以后看小说不愁没资源了，自行爬取txt导入手机免费看（也可以复制粘贴到word自动分行），当然之前还在52pj看过爬妹子图，乐趣无穷。

相关阅读:
Spring标签@Aspect-实现面向方向编程(@Aspect的多数据源自动加载)——SKY
easyUI参数传递Long型时，前台解析出错的问题——SKY
javax.servlet.ServletException: Could not resolve view with name‘ XXXX’in servlet with name 'spring'的解决方案-----SKY
Netty实现java多线程Post请求解析(Map参数类型)—SKY
java并发编程基础---Sky
创建100个1k的随机文件到FSxL
RAID磁盘阵列与LVM逻辑卷管理
 The beginners’ guide to farming Chia Coin on Windows.
有赞移动Crash平台建设
 Comparisons Serverless
原文地址：https://www.cnblogs.com/wangzheming35/p/12926310.html