k-近邻算法-手写识别系统

手写数字是32x32的黑白图像。为了能使用KNN分类器，我们需要把32x32的二进制图像转换为1x1024

1. 将图像转化为向量

from numpy import *
# 导入科学计算包numpy和运算符模块operator
import operator
from os import listdir

def img2vector(filename):
    """
    将图像数据转换为向量
    :param filename: 图片文件 因为我们的输入数据的图片格式是 32 * 32的
    :return: 一维矩阵
    该函数将图像转换为向量：该函数创建 1 * 1024 的NumPy数组，然后打开给定的文件，
    循环读出文件的前32行，并将每行的头32个字符值存储在NumPy数组中，最后返回数组。
    """
    returnVect = zeros((1, 1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0, 32 * i + j] = int(lineStr[j])
    return returnVect

测试：

testVector = img2vector('F:/迅雷下载/machinelearninginaction/Ch02/testDigits/0_13.txt')
testVector[0, 0:31]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

2. KNN分类器

def classify0(inX, dataSet, labels, k):
    """
    inX: 用于分类的输入向量
    dataSet: 输入的训练样本集
    labels: 标签向量
    k: 选择最近邻居的数目
    注意：labels元素数目和dataSet行数相同；程序使用欧式距离公式.
    """
    # 求出数据集的行数
    dataSetSize = dataSet.shape[0]
    # tile生成和训练样本对应的矩阵，并与训练样本求差
    """
    tile: 列: 3表示复制的行数， 行：1／2 表示对inx的重复的次数
    例：In []: inX = [1, 2, 3]
               tile(inx, (3, 1))
               
        Out[]: array([[1, 2, 3],
                       [1, 2, 3],
                       [1, 2, 3]])
    """
    # 用inx（输入向量）生成和dataSet类型一样的矩阵，在减去dataSet
    diffMat = tile(inX, (dataSetSize, 1)) - dataSet
    # 取平方
    sqDiffMat = diffMat ** 2
    # 将矩阵的每一行相加
    sqDistances = sqDiffMat.sum(axis=1)
    # 开方
    distances = sqDistances ** 0.5
    # 根据距离排序从小到大的排序，返回对应的索引位置
    # argsort() 是将x中的元素从小到大排列，提取其对应的index（索引），然后输出到y。
    """
    In [] : y = argsort([3, 0, 2, -1, 4, 5])
            print(y[0])
            print(y[5])
    Out[] : 3
            5
    由于最小的数是-1，它的序号是3，因此y[0] = 3, 最大的数是5，它的序号是5，因此y[5] = 5
    """
    sortedDistIndicies = distances.argsort()
    # 2. 选择距离最小的k个点
    classCount = {}
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

3. 手写数字识别系统的测试代码

def handwritingClassTest():
    # 1. 导入数据
    hwLabels = []
    trainingFileList = listdir('F:/迅雷下载/machinelearninginaction/Ch02/trainingDigits')  # load the training set
    m = len(trainingFileList)
    trainingMat = zeros((m, 1024))
    # hwLabels存储0～9对应的index位置， trainingMat存放的每个位置对应的图片向量
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]  # take off .txt
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        # 将 32*32的矩阵->1*1024的矩阵
        trainingMat[i, :] = img2vector('F:/迅雷下载/machinelearninginaction/Ch02/trainingDigits/%s' % fileNameStr)

    # 2. 导入测试数据
    testFileList = listdir('F:/迅雷下载/machinelearninginaction/Ch02/testDigits')  # iterate through the test set
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]  # take off .txt
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('F:/迅雷下载/machinelearninginaction/Ch02/testDigits/%s' % fileNameStr)
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        print("the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr))
        if (classifierResult != classNumStr): errorCount += 1.0
    print("
the total number of errors is: %d" % errorCount)
    print("
the total error rate is: %f" % (errorCount / float(mTest)))

handwritingClassTest()

the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
...

the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9

the total number of errors is: 10

the total error rate is: 0.010571
k-近邻算法识别手写数字，错误率在1.1%.改变k的值、修改函数 handwritingClassTest 随机选取训练样本、改变训练样本的数目，都会对k-近邻算法的错误率产生影响。
实际上，这个算法的执行效率并不高。因为每个算法需要为每个测试向量做2000次距离计算，每个距离计算包括了1024个维度浮点运算，总计执行900次。
而K决策树就是k-近邻的优化版。

4. 总结

k-近邻算法的特点：

1. 是分类数据最简单最有效的算法

2. 必须保存全部数据集，会使用大量存储空间

3. 必须对每个数据计算距离值，非常耗时

相关阅读:
OSCache使用指南
 sql性能优化浅谈
 Oracle SQL性能优化
 SQL性能优化
 Linux/Unix笔记本
 Linux/Unix笔记本
 LUOGU P3413 SAC#1
poj 2393 Yogurt factory(贪心)
poj 2431 Expedition (贪心)
LUOGU P3161 [CQOI2012]模拟工厂 (贪心)
原文地址：https://www.cnblogs.com/gezhuangzhuang/p/9979770.html