深入Python 验证码解析

深入Python 验证码解析
介绍

在Python的实战中爬虫承担相当重要的角色，而验证码识别则是爬虫中一个重点。验证码是一个网站项目的守卫，如果不能通过验证码识别，那后期的爬虫则无法进行。本文详细介绍Python验证码识别的具体细节。郑重声明：仅讨论技术，不能用于违法手段，如若不然则受法律严惩且与作者无关。

准备工作——验证码解析环境搭建

安装Tesseract

Tesserocr 是 Python 的一个 OCR 识别库，但其实是对 Tesseract 做的一层 Python API 封装，它的核心是 Tesseract，所以在安装 Tesserocr 之前我们需要先安装 Tesseract

官方网址：https://digi.bib.uni-mannheim.de/tesseract/

选择版本：

此处选择4.0.0版本，因为截至目前（2020-2-28）对应的python库的支持最新只到这个版本。

具体看https://github.com/simonflueckiger/tesserocr-windows_build/releases的显示版本，括号里是支持Tesserocr的版本。

安装时可以勾选多语言支持（但会导致整个过程很慢）：

安装完成后，需要设置环境变量。在Path中设置C:Program FilesTesseract-OCR（路径以自己为准）

确认是否设置正确：

安装Tesserocr（Tesseract-OCR）

使用pip直接安装：
```
 pip install tesserocr pillow 
```
如果安装失败，尝试使用以下方法：
- 1.下载安装tesserocr的whl格式文件。
whl格式本质上是一个压缩包,里面包含了py文件,以及经过编译的pyd文件

网址：https://github.com/simonflueckiger/tesserocr-windows_build/releases
- 2.查看本机python对应的版本：
新建test2.py文件并执行：
```
import pip import pip._internal 
print(pip._internal.pep425tags.get_supported()) 
```
输出：

[('cp37', 'cp37m', 'win_amd64'), ('cp37', 'none', 'win_amd64'), ('py3', 'none', 'win_amd64'), ('cp37', 'none', 'any'), ('cp3', 'none', 'any'), ('py37', 'none', 'any'), ('py3', 'none', 'any'), ('py36', 'none', 'any'), ('py35', 'none', 'any'), ('py34', 'none', 'any'), ('py33', 'none', 'any'), ('py32', 'none', 'any'), ('py31', 'none', 'any'), ('py30', 'none', 'any')]

意思是对应版本是'cp37', 'cp37m', 'win_amd64'。
- 3.找到对应的版本：
- 4.下载后使用pip安装.whl文件（路径以自己实际路径为准）：
```
 pip install C:	esserocr-2.4.0-cp37-cp37m-win_amd64.whl 
```
牛刀小试——简单验证码识别

首先安装依赖：
```
 pip install pillow 
```
如果安装失败。使用：
```
 python -m pip install --upgrade pip 
```
完成后执行install命令。

使用tesseract识别验证码

找一张较简单的验证码（test.jpg）：

解析验证码（test3.py）：
```
import tesserocr
from PIL import Image
image=Image.open('test.jpg')
image.show()  #可以打印出图片，供预览
print(tesserocr.image_to_text(image))
```
如果执行过程中报错：

Failed to init API, possibly an invalid tessdata path: C:UsersXXXXXAppDataLocalProgramsPythonPython37/tessdata/

则将Tesseract安装目录下的tessdata文件夹复制到python的根目录，即报错显示的目录。

使用pytesseract识别验证码

以上范例使用的是tesserocr.image_to_text()，但是识别效率很低，推荐使用pytesseract。pytesseract是在Tesseract-OCR基础上封装的，识别效果更好的类库。

官方介绍：Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others.

首先安装pytesseract：
```
 pip install pytesseract 
```
使用pytesseract的image_to_string()方法：
```
from PIL import Image
from pytesseract import *

result = image_to_string(Image.open("test.jpg"), lang='eng', config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
```
lang表示识别的语言。
psm是一个设置验证码识别的重要参数，可以用它来精确提升验证通过率（下方是官网给出的值范围）。
oem没有找到专门的解释，官网给的范例使用的值是3。
tessedit_char_whitelist表示白名单，将识别的结果控制在白名单范围（经测试，效果有限）

psm值：

Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,bypassing hacks that are Tesseract-specific.

颇费功夫——复杂验证码识别

上文的验证码已经算是非常简单的一种，几乎使用原生的验证码识别库就可以识别。但是大部分时候我们面对的是下面这种验证码：

或者这种：

亦或者这种：

这些验证码使用库来识别通过率会非常低，几乎无法识别。这时候就得用到我们的新手段——图片处理。

不同的验证码图片需要做的处理是不一样的，需要对症下药，比如第一种，它的特点是有一条很细的边框以及极多的背景干扰线。这样我们需要作出两点操作：

1.点性降噪

2.去除边框

图片是由像素点构成的，我们放大图像就可以一目了然。这些像素点中，有些是组成验证码的重要像素点，而大部分则是造成识别干扰的像素。

图片当中的像素点不是独立存在的，一个像素点周围有8个像素点（边框除外）。如下图，若中心点与8个像素中绝大部分的像素点RBG值不一样，就像脸上的粉刺一样，这个孤零零的点破坏了整体的RBG统一性，成为了我们必须去除的点——噪点。

上图中组成MABC四个字母的像素点是连贯的，但是噪点却是随机分布的。利用这个特点我们就可以判断是否是噪点。

当然，中心点与周围RBG值完全不同是特殊情况。实际中我们看到的往往是这样：

上图里中心点与周围像素有RBG相同的也有不同的，面对这种情况，我们就需要设定一个值（N），N表示在判定噪点的时候，中心像素点与周围像素点相同的个数的临界值。

当中心点与周围像素的RBG值相同的数量小于N时，该点为噪点。

上图中，因为与中心点相同像素数是2个。当我们将N设为3，中心点将会被认为是噪点。若设为1，则中心点不是噪点。N值的设定需要我们根据情况判断调整。

按照这个逻辑，对每一个像素点进行判断，若是噪点则将其颜色置为白色即可。

但是实际中有可能因为图片的噪点太过密集而出现漏网之鱼。这样我们再引入一个新的想法——多次降噪。

意思是，在对每个像素点降噪判断后，多次重新扫描保证尽可能多的噪点被去除。

但是多次降噪可能会导致验证码像素受影响，需根据情况斟酌。

依照这个思路，我们写出降噪代码如下。（image是图片二值阈值，N是噪点判断的临界值，K是多次降噪的次数）
```
def clearNoise(image, N, K):
    for i in range(0, K):
        t2val[(0, 0)] = 1
        t2val[(image.size[0] - 1, image.size[1] - 1)] = 1

        for x in range(1, image.size[0] - 1):
            for y in range(1, image.size[1] - 1):
                nearDots = 0
                L = t2val[(x, y)]
                if L == t2val[(x - 1, y - 1)]:
                    nearDots += 1
                if L == t2val[(x - 1, y)]:
                    nearDots += 1
                if L == t2val[(x - 1, y + 1)]:
                    nearDots += 1
                if L == t2val[(x, y - 1)]:
                    nearDots += 1
                if L == t2val[(x, y + 1)]:
                    nearDots += 1
                if L == t2val[(x + 1, y - 1)]:
                    nearDots += 1
                if L == t2val[(x + 1, y)]:
                    nearDots += 1
                if L == t2val[(x + 1, y + 1)]:
                    nearDots += 1

                if nearDots < N:
                    t2val[(x, y)] = 1
```
处理完成后得到图片：

可以看出，降噪完成后的图片背景已经变得非常“干净”。除了边框外，这个验证码已经比较容易识别。

由于边框像素本身也是一串连续的点，与验证码相似，且位置在边界处，降噪不能对其处理。

第二步进行边框去除。这个就比较简单了。将边框处的像素剪裁变色。
```
def clear_border(img_name):
    img = cv_imread(path_extends.get_absolute_path()+"\images\"+img_name)
    filename = path_extends.get_absolute_path()+"\images\" + 
        img_name.split('-')[0] + '-clearBorder.jpg'
    h, w = img.shape[:2]
    for y in range(0, w):
        for x in range(0, h):
            if y < 2 or y > w - 2:
                img[x, y] = 255
            if x < 2 or x > h - 2:
                img[x, y] = 255

    cv_imwrite(filename, img)
    return img
```
经过一系列的处理，得到结果：

完整的代码（调用image_to_text函数即可识别，验证码原始图片需放置在images文件夹内并命名为test.png）：
```
# coding:utf-8
import sys, os
from PIL import Image, ImageDraw
from pytesseract import *
import cv2
from tools import path_extends
import numpy as np


# 二值数组
t2val = {}
def twoValue(image, G):
    for y in range(0, image.size[1]):
        for x in range(0, image.size[0]):
            g = image.getpixel((x, y))
            if g > G:
                t2val[(x, y)] = 1
            else:
                t2val[(x, y)] = 0


def clear_border(img_name):
    img = cv_imread(path_extends.get_absolute_path()+"\images\"+img_name)
    filename = path_extends.get_absolute_path()+"\images\" + 
        img_name.split('-')[0] + '-clearBorder.jpg'
    h, w = img.shape[:2]
    for y in range(0, w):
        for x in range(0, h):
            if y < 2 or y > w - 2:
                img[x, y] = 255
            if x < 2 or x > h - 2:
                img[x, y] = 255

    cv_imwrite(filename, img)
    return img

def clearNoise(image, N, K):
    for i in range(0, K):
        t2val[(0, 0)] = 1
        t2val[(image.size[0] - 1, image.size[1] - 1)] = 1

        for x in range(1, image.size[0] - 1):
            for y in range(1, image.size[1] - 1):
                nearDots = 0
                L = t2val[(x, y)]
                if L == t2val[(x - 1, y - 1)]:
                    nearDots += 1
                if L == t2val[(x - 1, y)]:
                    nearDots += 1
                if L == t2val[(x - 1, y + 1)]:
                    nearDots += 1
                if L == t2val[(x, y - 1)]:
                    nearDots += 1
                if L == t2val[(x, y + 1)]:
                    nearDots += 1
                if L == t2val[(x + 1, y - 1)]:
                    nearDots += 1
                if L == t2val[(x + 1, y)]:
                    nearDots += 1
                if L == t2val[(x + 1, y + 1)]:
                    nearDots += 1

                if nearDots < N:
                    t2val[(x, y)] = 1

def cv_imread(filePath):
    cv_img = cv2.imdecode(np.fromfile(filePath, dtype=np.uint8), -1)
    return cv_img

def cv_imwrite(filePath, features):
    cv2.imencode('.jpg', features)[1].tofile(filePath)

def saveImage(filename, size):
    image = Image.new("1", size)
    draw = ImageDraw.Draw(image)

    for x in range(0, size[0]):
        for y in range(0, size[1]):
            draw.point((x, y), t2val[(x, y)])

    image.save(filename)
 

def image_to_text():
    image = Image.open(path_extends.get_absolute_path()+"\images\test.png").convert("L")
    twoValue(image, 100)
    clearNoise(image, 2, 1)
    path1 = path_extends.get_absolute_path()+"\images\test-clearNoise.jpg"
    saveImage(path1, image.size)
    clear_border("my-clearNoise.jpg")
    result = image_to_string(Image.open(
        path_extends.get_absolute_path()+"\images\test-clearBorder.jpg"), lang='eng', config='--psm 10 --oem 3 -c tessedit_char_whitelist=QWERTYUIOPLKHJHGFDSAZXCVBNM')

    return result

 
 
```
究极难度——开始样本训练吧

以上的验证码还不算是最难识别的，我们一定见过这种的（图片来自百度）：

文字扭曲、倾斜、挤靠。这些验证码即便是人来看都得多看一眼，更何况程序识别。这时候我们上文的办法已经力不从心，需要一个新的思路。

计算机有比人快而准的优点，但是一个字母或者符号稍加变形程序便无法识别，这种过于较真的特点反倒成了缺点。假如我们能告诉程序m等于m，也等于m，问题就得以解决。

这就需要引入一个概念——样本训练。

我们在做训练之前先需要收集样本，这些样本可以通过手动截图，也可以通过程序分割。举个简单的例子，我们需要训练0~9的数字，就需要先收集这10个数字的样本图片，之后进行下一步。

下载jTessBoxEditor：

官方下载（较慢）：https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/

国内下载：https://www.jb51.net/softs/676483.html#downintro2

下载库：

训练库下载： https://sourceforge.net/projects/tess4j/files/tess4j/

制作样本：

png转化为tif

转化网址：https://cloudconvert.com/png-to-tiff

导入训练样本

选择训练图片：

选择后会继续弹框让你选择目录，用来保存合并后的tiff。

文件名命名为xl.normal.exp0.tif

执行命令行(开始训练)：
```
tesseract xl.normal.exp0.tif xl.normal.exp0 -l eng batch.nochop makebox
```
样本训练完毕，接下来是关键的一步——分割验证码，以方便程序对照样本进行识别。

分割的逻辑都大抵相似，这里直接引用shaomine的博文：
```
#coding:utf8
import os
from PIL import Image,ImageDraw,ImageFile
import numpy
import pytesseract
import cv2
import imagehash
import collections
class pictureIdenti:

    #rownum：切割行数；colnum：切割列数；dstpath：图片文件路径；img_name：要切割的图片文件
    def splitimage(self, rownum=1, colnum=4, dstpath="D:workpython36_crawlVeriycode",
                   img_name="D:workpython36_crawlVeriycodemode_5246.png",):
        img = Image.open(img_name)
        w, h = img.size
        if rownum <= h and colnum <= w:
            print('Original image info: %sx%s, %s, %s' % (w, h, img.format, img.mode))
            print('开始处理图片切割, 请稍候...')

            s = os.path.split(img_name)
            if dstpath == '':
                dstpath = s[0]
            fn = s[1].split('.')
            basename = fn[0]
            ext = fn[-1]

            num = 1
            rowheight = h // rownum
            colwidth = w // colnum
            file_list = []
            for r in range(rownum):
                index = 0
                for c in range(colnum):
                    # (left, upper, right, lower)
                    # box = (c * colwidth, r * rowheight, (c + 1) * colwidth, (r + 1) * rowheight)
                    if index<1:
                        colwid = colwidth+6
                    elif index<2:
                        colwid = colwidth + 1
                    elif index < 3:
                        colwid = colwidth

                    box = (c * colwid, r * rowheight, (c + 1) * colwid, (r + 1) * rowheight)
                    newfile = os.path.join(dstpath, basename + '_' + str(num) + '.' + ext)
                    file_list.append(newfile)
                    img.crop(box).save(os.path.join(dstpath, basename + '_' + str(num) + '.' + ext), ext)
                    num = num + 1
                    index+=1
            for f in file_list:
                print(f)
            print('图片切割完毕，共生成 %s 张小图片。' % num)
```
宿命之敌——逻辑验证码

事实上，逻辑验证码已经不再是“码”，而是一种逻辑判断。举个例子（图片来自百度）：

以及我们最熟悉的：

这已经不是上文的1=1，而是需要观察者识别内容后进行逻辑判断再输入结果。依照上文的方式已经很难再识别。具体的解决方法也已经不是本文的讨论范围。

结束语

验证码是网站和应用程序的守卫，它的作用也越来越重要。如果你不是一个Python爬虫研究者，而是一个网站管理员，也需要深入了解验证码的识别，因为这对你的网站安全尤为重要。

我们研究验证码识别是为了更好的加固网络安全性。对使用爬虫技术的人来说，安全、非破坏式的使用该技术是底线也是自我要求。在爬取数据的时候应当先了解这些内容是否允许被爬，遵守robots.txt守则，且在爬取过程中应该尽可能的多等待，而不是无节制刷取数据而对服务器造成影响。

部分引用：

https://www.cnblogs.com/shaosks/p/9700610.html

https://blog.csdn.net/dream_people/article/details/83393134
相关阅读:
Shooting Algorithm
Subgradient Algorithm
Factorization Machine
支持向量机
 Hashing Trick
Science上发表的超赞聚类算法
 Contractive Auto-Encoder
Shell之数学计算
 牛顿方法(Newton-Raphson Method)
泊松回归(Poisson Regression)
原文地址：https://www.cnblogs.com/JHelius/p/14318880.html

深入Python 验证码解析

介绍

准备工作——验证码解析环境搭建

牛刀小试——简单验证码识别

颇费功夫——复杂验证码识别

究极难度——开始样本训练吧

宿命之敌——逻辑验证码

结束语