Python简单爬虫

轻量级爬虫

不需要登录
静态网页 -- 数据不是异步加载

爬虫：一段自动抓取互联网信息的程序

URL管理器

管理对象

将要抓取的url
已经抓取过的url

作用

防止重复抓取
防止循环抓取

实现方式：

1、内存

python内存

待爬取URL集合：set()

已爬取URL集合：set()

2、关系型数据库

MySQL

数据表urls(url, is_crawled)

3、缓存数据库

redis

待爬取URL集合：set()

已爬取URL集合：set()

网页下载器

将获取到的网页下载到本地进行分析的工具

类型

1、urllib2

Python 官方基础展模块

2、requests

第三方包，更强大

urllib2下载网页

1、方法一：最简单的方法

import urllib2

# 直接请求
response = urllib2.urlopen('http://www.baidu.com')

# 获取状态码，如果是200表示获取成功
print response.getcode()

# 读取内容
cont = response.read()

2、方法二：添加data、http header

import urllib2

# 创建Request对象
request urllib2.Request(url)

# 添加数据
request.add_data('a', '1')

# 添加http的header, 模拟Mozilla浏览器
response.add_header('User-Agent', 'Mozilla/5.0')

3、方法三：添加特殊情景的处理器

HTTPCookieProcessor：对于需要用户登录的网页
ProxyHandler：对于需要代理才能访问的网页
HTTPSHandler：对于https协议的网页
HTTPRedirectHandler：对于设置了自动跳转的网页

import urllib2, cookielib

# 创建cookie容器
cj = cookielib.CookieJar()

# 创建1个opener
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

# 给urllib2安装opener
urllib2.install_opener(opener)

# 使用带有cookie的urllib2访问网页
response = urllib2.urlopen("http://www.baidu.com")

实例代码

# coding:utf8
import urllib2, cookielib

url = "http://www.baidu.com"

print("一种方法：")
response1 = urllib2.urlopen(url)
print(response1.getcode())
print(len(response1.read()))

print('第二种方法：')
request = urllib2.Request(url)
request.add_header("user-agent", 'Mozilla/5.0')
response1 = urllib2.urlopen(url)
print(response1.getcode())
print(len(response1.read()))

print('第三种方法：')
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(request)
print(response3.getcode())
print(cj)
print(response3.read())

注：以上是Python2的写法，以下是Python3的写法

# coding:utf8
import urllib.request
import http.cookiejar

url = "http://www.baidu.com"

print("一种方法：")
response1 = urllib.request.urlopen(url)
print(response1.getcode())
print(len(response1.read()))

print('第二种方法：')
request = urllib.request.Request(url)
request.add_header("user-agent", 'Mozilla/5.0')
response1 = urllib.request.urlopen(url)
print(response1.getcode())
print(len(response1.read()))

print('第三种方法：')
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
urllib.request.install_opener(opener)
response3 = urllib.request.urlopen(request)
print(response3.getcode())
print(cj)
print(response3.read())

网页解析器

解析网页，从网页中提取有价值数据的工具

网页解析器(BeautifulSoup)

类型

1、正则表达式（模糊匹配）

2、html.parser（结构化解析）

3、Beautiful Soup（结构化解析）

4、lxml（结构化解析）

结构化解析-DOM（Document Object Model）树

安装并使用 Beautiful Soup4

1、安装

pip install beautifulsoup4

2、使用

创建BeautifulSoup对象
搜索节点（按节点名称、属性、文字）
- find_all
- find
访问节点
- 名称
- 属性
- 文字

（1）创建Beautiful Soup对象

from bs4 import BeautifulSoup

# 根据HTML网页字符串创建BeautifulSoup对象
soup = BeautifulSoup(
    html_doc,               # HTML文档字符串
    'html.parser',          # HTML解析器
    from_encoding='utf8'    # HTML文档的编码
)

（2）搜索节点(find_all，find)

# 方法：find_all(name, attrs, string)
 
# 查找所有标签为a的节点
soup.find_all('a')
 
# 查找所有标签为a,链接符合/view/123.html形式的节点
soup.find_all('a', href='/view/123.htm')
soup.find_all('a', href=re.compile(r'/view/d+.htm'))

# 查找所有标签为div，class为abs，文字为Python的节点
soup.find_all('div', class_='abc', string='Python')

用class_作为查询类属性的变量名，因为class本身是python的关键字，所以需要加一个下划线来区别

（3）访问节点信息

# 得到节点：<a href="1.html">Python</a>

# 获取查找到的节点的标签名称
node.name

# 获取查找到的a节点的href属性
node['href']

# 获取查找到的a节点的链接文字
node.get_text()

3、实例

# coding:utf8
from bs4 import BeautifulSoup, re
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print('获取所有的链接：')
links = soup.find_all('a')
for link in links:
    print(link.name, link['href'], link.get_text())

print('获取lacie的链接：')
link_node = soup.find('a', href='http://example.com/lacie')
print(link_node.name, link_node['href'], link_node.get_text())


print('正则匹配：')
link_node = soup.find('a', href=re.compile(r"ill"))
print(link_node.name, link_node['href'], link_node.get_text())

print('获取p段落文字：')
p_node = soup.find('p', class_='title')
print(p_node.name, p_node.get_text())

执行后效果：

开发爬虫

分析目标

URL格式
数据格式
网页编码

1、目标: 百度百科Python词条相关词条网页 -- 标题和简介

2、入口页

https://baike.baidu.com/item/Python/407313

3、URL格式：

词条页面URL: /item/****

4、数据格式：

标题：

<dd class="lemmaWgt-lemmaTitle-title"><h1>...</h1></dd>

简介：

<div class="lemma-summary" label-module="lemmaSummary">...</div

5、页面编码：UTF-8

项目目录结构

调度主程序

# coding:utf8
from baike_spider import url_manager, html_downloader, html_parser, html_outputer


class SpiderMain(object):
    def __init__(self):
        # url管理器
        self.urls = url_manager.UrlManager()
        # 下载器
        self.downloader = html_downloader.HtmlDownloader()
        # 解析器
        self.parser = html_parser.HtmlParser()
        # 输出器
        self.outputer = html_outputer.HtmlOutputer()

    # 爬虫的调度程序
    def craw(self, root_url):
        count = 1
        self.urls.add_new_url(root_url)
        while self.urls.has_new_url():
            try:
                if count == 1000:
                    break

                new_url = self.urls.get_new_url()

                print('craw %d : %s' % (count, new_url))
                html_cont = self.downloader.download(new_url)
                new_urls, new_data = self.parser.parse(new_url, html_cont)
                self.urls.add_new_urls(new_urls)
                self.outputer.collect_data(new_data)

                count = count + 1
            except:
                print('craw failed')

        self.outputer.output_html()


if __name__ == "__main__":
    root_url = "https://baike.baidu.com/item/Python/407313"
    obj_spider = SpiderMain()
    obj_spider.craw(root_url)

URL管理器

# coding:utf8
class UrlManager(object):
    def __init__(self):
        self.new_urls = set()
        self.old_urls = set()

    def add_new_url(self, url):
        if url is None:
            return

        if url not in self.new_urls and url not in self.old_urls:
            self.new_urls.add(url)

    def add_new_urls(self, urls):
        if urls is None or len(urls) == 0:
            return
        for url in urls:
            self.add_new_url(url)

    def has_new_url(self):
        return len(self.new_urls) != 0

    def get_new_url(self):
        new_url = self.new_urls.pop()
        self.old_urls.add(new_url)

        return new_url

网页下载器

# coding:utf8

import urllib.request


class HtmlDownloader(object):

    def download(self, url):
        if url is None:
            return None

        # request = urllib.request.Request(url)
        # request.add_header("user-agent", 'Mozilla/5.0')
        response = urllib.request.urlopen(url)
        if response.getcode() != 200:
            return None

        return response.read()

网页解析器

# coding:utf8

from bs4 import BeautifulSoup, re
from urllib.parse import urljoin


class HtmlParser(object):

    def _get_new_urls(self, page_url, soup):
        new_urls = set()

        links = soup.find_all('a', href=re.compile(r"/item/"))
        for link in links:
            new_url = link['href']
            new_full_url = urljoin(page_url, new_url)
            new_urls.add(new_full_url)

        return new_urls

    def _get_new_data(self, page_url, soup):
        res_data = {}

        res_data['url'] = page_url

        title_node = soup.find('dd', class_='lemmaWgt-lemmaTitle-title').find('h1')
        res_data['title'] = title_node.get_text()

        summary_node = soup.find('div', class_='lemma-summary')
        res_data['summary'] = summary_node.get_text()

        return res_data

    def parse(self, page_url, html_cont):
        if page_url is None or html_cont is None:
            return

        soup = BeautifulSoup(html_cont, 'html.parser')
        new_urls = self._get_new_urls(page_url, soup)
        new_data = self._get_new_data(page_url, soup)

        return new_urls, new_data

网页输出器

# coding:utf8
class HtmlOutputer(object):
    def __init__(self):
        self.datas = []

    def collect_data(self, data):
        if data is None:
            return
        self.datas.append(data)

    def output_html(self):
        fout = open('output.html', 'w')

        fout.write('<html>')
        fout.write('<body>')
        fout.write('<table>')

        for data in self.datas:
            fout.write('<tr>')
            fout.write('<td>%s</td>' % data['url'])
            fout.write('<td>%s</td>' % data['title'].encode('utf-8'))
            fout.write('<td>%s</td>' % data['summary'].encode('utf-8'))
            fout.write('</tr>')

        fout.write('</table>')
        fout.write('</body>')
        fout.write('</html>')

        fout.close()

高级爬虫：

登录
验证码
Ajax
服务器防爬虫
多线程
分布式

学习资料：慕课网-Python开发简单爬虫

相关阅读:
Codeforces 1485C Floor and Mod (枚举)
CodeForces 1195D Submarine in the Rybinsk Sea （算贡献）
CodeForces 1195C Basketball Exercise （线性DP）
2021年初寒假训练第24场 B. 庆功会（搜索）
任务分配（dp）
开发工具的异常现象
 Telink MESH SDK 如何使用PWM
Telink BLE MESH PWM波的小结
 [LeetCode] 1586. Binary Search Tree Iterator II
[LeetCode] 1288. Remove Covered Intervals
原文地址：https://www.cnblogs.com/zqunor/p/11155756.html