• 《机器学习实战》KNN算法实现



    本系列都是参考《机器学习实战》这本书,只对学习过程一个记录,不做详细的描述!

    注释:看了一段时间Ng的机器学习视频,感觉不能光看不练,现在一边练习再一边去学习理论!

    KNN很早就之前就看过也记录过,在此不做更多说明,这是k-means之前的记录,感觉差不多:http://www.cnblogs.com/wjy-lulu/p/7002688.html

    1.简单的分类

    代码:

     1 import numpy as np
     2 import operator
     3 import KNN
     4 
     5 def classify0(inX,dataSet,labels,k):
     6      dataSetSize = dataSet.shape[0] #样本个数
     7      diffMat = np.tile(inX,(dataSetSize,1)) - dataSet#样本每个值和测试数据做差
     8      sqDiffMat = diffMat**2#平方
     9      sqDistances = sqDiffMat.sum(axis=1)#第二维度求和,也就是列
    10      distances = sqDistances**0.5#平方根
    11      sortedDistIndicies = distances.argsort()#下标排序
    12      classCount = {}
    13 
    14      for i in range(k):
    15           voteIlabel = labels[sortedDistIndicies[i]]#得到距离最近的几个数
    16           classCount[voteIlabel] = classCount.get(voteIlabel,0)+1#标签计数
    17      sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)#按照数值排序operator.itemgetter(1)代表第二个域
    18      #上面排序之后就不是字典了,而是一个列表里面包含的元组[('c',2),('a',3)]
    19      return sortedClassCount[0][0]
    20 
    21 if __name__ == '__main__':
    22      group,labels = KNN.createDataSet()
    23      result = classify0([0,0.5],group,labels,1)
    24      print (result)

    KNN.Py文件

    1 import numpy as np
    2 import operator
    3 
    4 
    5 def createDataSet():
    6     group = np.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
    7     labels = ['A', 'B', 'C', 'D']
    8     return group, labels

    2.约会网站的预测

      下面给出每个部分的代码和注释:

    A.文本文件转换为可用数据

    上面的文本中有空格和换行,而且样本和标签都在一起,必须的分开处理成矩阵才可以进行下一步操作。

     1 def file2matrix(filename):#把文件转化为可操作数据
     2     fr = open(filename)#打开文件
     3     arrayOLines = fr.readlines()#读取每行文件
     4     numberOfLines = len(arrayOLines)#行数量
     5     returnMat = np.zeros([numberOfLines,3])#存储数据
     6     classLabelVector = []
     7     index = 0
     8     for line in arrayOLines:
     9         line = line.strip()#去除换行符
    10         listFromLine = line.split('	')#按照空格去分割
    11         returnMat[index,:] = listFromLine[0:3]#样本
    12         classLabelVector.append(int(listFromLine[-1]))#labels
    13         index += 1
    14     return returnMat,classLabelVector#返回数据和标签

     B.归一化

    数据大小差异太明显,比如有三个特征:a=[1,2,3],b=[1000,2000,3000],c=[0.1,0.2,0.3],我们发现c和a根本没啥作用,因为b的值太大了,或者说b的权重太大了,Ng中可以用惩罚系数去操作,或者正则化都可以处理这类数据,当然这是题外话。

     1 def autoNorm(dataSet):#归一化函数
     2     #每列的最值
     3     minValue = dataSet.min(0)
     4     maxValue = dataSet.max(0)
     5     range = maxValue - minValue
     6     #创建最小值矩阵
     7     midData =  np.tile(minValue,[dataSet.shape[0],1])
     8     dataSet = dataSet - midData
     9     #创建range矩阵
    10     range = np.tile(range,[dataSet.shape[0],1])
    11     dataSet = dataSet / range #直接相除不是矩阵相除
    12     return dataSet,minValue,maxValue

    C.预测

    KNN的方法就是距离,计算K个距离,然后排序看哪个占得比重大就选哪个类。

     1 def classify0(inX, dataSet, labels, k):#核心分类程序
     2     dataSetSize = dataSet.shape[0]  # 样本个数
     3     diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet  # 样本每个值和测试数据做差
     4     sqDiffMat = diffMat ** 2  # 平方
     5     sqDistances = sqDiffMat.sum(axis=1)  # 第二维度求和,也就是列
     6     distances = sqDistances ** 0.5  # 平方根
     7     sortedDistIndicies = distances.argsort()  # 下标排序
     8     classCount = {}
     9 
    10     for i in range(k):
    11         voteIlabel = labels[sortedDistIndicies[i]]  # 得到距离最近的几个数
    12         classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1  # 标签计数
    13     sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1),
    14                               reverse=True)  # 按照数值排序operator.itemgetter(1)代表第二个域
    15     # 上面排序之后就不是字典了,而是一个列表里面包含的元组[('c',2),('a',3)]
    16     return sortedClassCount[0][0]

    D.性能测试

    比如1000个数据,900个用做样本,100用做测试,看看精确度是多少?

     1 def datingClassTest():
     2     hoRatio = 0.2
     3     datingDataMat , datingLabels = file2matrix('datingTestSet2.txt')
     4     normMat = autoNorm(datingDataMat)
     5     n = normMat.shape[0]
     6     numTestVecs = int(n*hoRatio)#测试数据和样本数据的分割点
     7     erroCount = 0.0
     8     #numTestVecs:n样本,[i,numTestVecs]测试
     9     for i in range(numTestVecs):
    10         classfiResult = classify0(normMat[i,:],normMat[numTestVecs:n,:],
    11                                   datingLabels[numTestVecs:n],3)
    12         if (classfiResult!=datingLabels[i]): erroCount+=1.0
    13     print ("the totle error os: %f" %(erroCount/float(numTestVecs)))

     E.实战分类

    注意输入的数据也得归一化

     1 def classfiPerson():
     2     resultList = ['not at all','in small doses','in large doses']
     3     personTats = float(input('please input video game 
    '))
     4     ffMiles = float(input('please input flier miles 
    '))
     5     iceCream = float(input('please input ice cream 
    '))
     6     datingData,datingLabels = file2matrix('datingTestSet2.txt')
     7     normData,minData,maxData = autoNorm(datingData)
     8     inputData = np.array([personTats,ffMiles,iceCream])#转化为矩阵
     9     inputData = (inputData - minData)/(maxData - minData)#输入归一化
    10     result = classify0(inputData,normData,datingLabels,3)
    11     print('等级是:',result)

    F.可视化显示

     1      datingDatas, datingLabels = KNN.file2matrix('datingTestSet2.txt')
     2      #可视化样本数据显示
     3      fig = plt.figure('data_show')
     4      ax = fig.add_subplot(111)
     5      for i in range(datingDatas.shape[0]):
     6           if datingLabels[i]==1:
     7                ax.scatter(datingDatas[i, 0], datingDatas[i, 1], marker="*",c='r')  # 用后两个特征绘图
     8 
     9           if datingLabels[i]==2:
    10                ax.scatter(datingDatas[i, 0], datingDatas[i, 1], marker="s", c='g')  # 用后两个特征绘图
    11 
    12           if datingLabels[i]==3:
    13                ax.scatter(datingDatas[i, 0], datingDatas[i, 1], marker="^", c='b')  # 用后两个特征绘图
    14      plt.show()

    G.完整代码

     1 import numpy as np
     2 import operator
     3 #from numpy import *
     4 
     5 def createDataSet():#创建简单测试的几个数
     6     group = np.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
     7     labels = ['A', 'B', 'C', 'D']
     8     return group, labels
     9 
    10 def autoNorm(dataSet):#归一化函数
    11     #每列的最值
    12     minValue = dataSet.min(0)
    13     maxValue = dataSet.max(0)
    14     range = maxValue - minValue
    15     #创建最小值矩阵
    16     midData =  np.tile(minValue,[dataSet.shape[0],1])
    17     dataSet = dataSet - midData
    18     #创建range矩阵
    19     range = np.tile(range,[dataSet.shape[0],1])
    20     dataSet = dataSet / range #直接相除不是矩阵相除
    21     return dataSet,minValue,maxValue
    22 
    23 def file2matrix(filename):#把文件转化为可操作数据
    24     fr = open(filename)#打开文件
    25     arrayOLines = fr.readlines()#读取每行文件
    26     numberOfLines = len(arrayOLines)#行数量
    27     returnMat = np.zeros([numberOfLines,3])#存储数据
    28     classLabelVector = []
    29     index = 0
    30     for line in arrayOLines:
    31         line = line.strip()#去除换行符
    32         listFromLine = line.split('	')#按照空格去分割
    33         returnMat[index,:] = listFromLine[0:3]#样本
    34         classLabelVector.append(int(listFromLine[-1]))#labels
    35         index += 1
    36     return returnMat,classLabelVector#返回数据和标签
    37 
    38 def classify0(inX, dataSet, labels, k):#核心分类程序
    39     dataSetSize = dataSet.shape[0]  # 样本个数
    40     diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet  # 样本每个值和测试数据做差
    41     sqDiffMat = diffMat ** 2  # 平方
    42     sqDistances = sqDiffMat.sum(axis=1)  # 第二维度求和,也就是列
    43     distances = sqDistances ** 0.5  # 平方根
    44     sortedDistIndicies = distances.argsort()  # 下标排序
    45     classCount = {}
    46 
    47     for i in range(k):
    48         voteIlabel = labels[sortedDistIndicies[i]]  # 得到距离最近的几个数
    49         classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1  # 标签计数
    50     sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1),
    51                               reverse=True)  # 按照数值排序operator.itemgetter(1)代表第二个域
    52     # 上面排序之后就不是字典了,而是一个列表里面包含的元组[('c',2),('a',3)]
    53     return sortedClassCount[0][0]
    54 
    55 def datingClassTest():
    56     hoRatio = 0.2
    57     datingDataMat , datingLabels = file2matrix('datingTestSet2.txt')
    58     normMat = autoNorm(datingDataMat)
    59     n = normMat.shape[0]
    60     numTestVecs = int(n*hoRatio)#测试数据和样本数据的分割点
    61     erroCount = 0.0
    62     #numTestVecs:n样本,[i,numTestVecs]测试
    63     for i in range(numTestVecs):
    64         classfiResult = classify0(normMat[i,:],normMat[numTestVecs:n,:],
    65                                   datingLabels[numTestVecs:n],3)
    66         if (classfiResult!=datingLabels[i]): erroCount+=1.0
    67     print ("the totle error os: %f" %(erroCount/float(numTestVecs)))
    68 
    69 def classfiPerson():
    70     resultList = ['not at all','in small doses','in large doses']
    71     personTats = float(input('please input video game 
    '))
    72     ffMiles = float(input('please input flier miles 
    '))
    73     iceCream = float(input('please input ice cream 
    '))
    74     datingData,datingLabels = file2matrix('datingTestSet2.txt')
    75     normData,minData,maxData = autoNorm(datingData)
    76     inputData = np.array([personTats,ffMiles,iceCream])#转化为矩阵
    77     inputData = (inputData - minData)/(maxData - minData)#输入归一化
    78     result = classify0(inputData,normData,datingLabels,3)
    79     print('等级是:',result)

    3.手写数字识别

     A.转换文件

     1 def img2vector(filename):
     2     returnVector = np.zeros([32,32])
     3     fr = open(filename)
     4     lineData = fr.readlines()
     5     count = 0
     6     for line in lineData:
     7         line = line.strip()#去除换行符
     8         for j in range(len(line)):
     9             returnVector[count,j] = line[j]
    10         count += 1
    11     returnVector = returnVector.reshape(1,1024).astype(int)#转化为1X1024
    12     return returnVector

    B.识别分类

     1 def handWriteringClassTest():
     2     #--------------------------读取数据---------------------------------
     3     hwLabels = []
     4     trainingFileList = os.listdir('trainingDigits')#获取文件目录
     5     m = len(trainingFileList)#获取目录个数
     6     trainingMat = np.zeros([m,1024])#全部样本
     7     for i in range(m):
     8         fileNameStr = trainingFileList[i]
     9         fileStr = fileNameStr.split('.')[0]#得到不带格式的文件名
    10         classNumStr = int(fileStr.split('_')[0])#得到最前面的数字类别0-9
    11         hwLabels.append(classNumStr)#存储
    12         dirList = 'trainingDigits/' + fileNameStr#绝对目录信息
    13         vectorUnderTest = img2vector(dirList)#读取第i个数据信息
    14         trainingMat[i,:] = vectorUnderTest #存储
    15     #--------------------------测试数据--------------------------------
    16     testFileList = os.listdir('testDigits')
    17     errorCount = 0.0
    18     m = len(testFileList)
    19     for i in range(m):
    20         fileNameStr = testFileList[i]
    21         fileInt = fileNameStr.split('.')[0].split('_')[0]
    22         dirList = 'testDigits/' + fileNameStr  # 绝对目录信息
    23         vectorUnderTest = img2vector(dirList)  # 读取第i个数据信息
    24         if int(fileInt) != int(classify0(vectorUnderTest,trainingMat,hwLabels,3)):
    25             errorCount += 1
    26     print('error count is : ',errorCount)
    27     print('error Rate is : ', (errorCount/m))

    C.完整代码

      1 import numpy as np
      2 import operator
      3 import os
      4 #from numpy import *
      5 
      6 def createDataSet():#创建简单测试的几个数
      7     group = np.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
      8     labels = ['A', 'B', 'C', 'D']
      9     return group, labels
     10 
     11 def autoNorm(dataSet):#归一化函数
     12     #每列的最值
     13     minValue = dataSet.min(0)
     14     maxValue = dataSet.max(0)
     15     range = maxValue - minValue
     16     #创建最小值矩阵
     17     midData =  np.tile(minValue,[dataSet.shape[0],1])
     18     dataSet = dataSet - midData
     19     #创建range矩阵
     20     range = np.tile(range,[dataSet.shape[0],1])
     21     dataSet = dataSet / range #直接相除不是矩阵相除
     22     return dataSet,minValue,maxValue
     23 
     24 def file2matrix(filename):#把文件转化为可操作数据
     25     fr = open(filename)#打开文件
     26     arrayOLines = fr.readlines()#读取每行文件
     27     numberOfLines = len(arrayOLines)#行数量
     28     returnMat = np.zeros([numberOfLines,3])#存储数据
     29     classLabelVector = []
     30     index = 0
     31     for line in arrayOLines:
     32         line = line.strip()#去除换行符
     33         listFromLine = line.split('	')#按照空格去分割
     34         returnMat[index,:] = listFromLine[0:3]#样本
     35         classLabelVector.append(int(listFromLine[-1]))#labels
     36         index += 1
     37     return returnMat,classLabelVector#返回数据和标签
     38 
     39 def classify0(inX, dataSet, labels, k):#核心分类程序
     40     dataSetSize = dataSet.shape[0]  # 样本个数
     41     diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet  # 样本每个值和测试数据做差
     42     sqDiffMat = diffMat ** 2  # 平方
     43     sqDistances = sqDiffMat.sum(axis=1)  # 第二维度求和,也就是列
     44     distances = sqDistances ** 0.5  # 平方根
     45     sortedDistIndicies = distances.argsort()  # 下标排序
     46     classCount = {}
     47 
     48     for i in range(k):
     49         voteIlabel = labels[sortedDistIndicies[i]]  # 得到距离最近的几个数
     50         classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1  # 标签计数
     51     sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1),
     52                               reverse=True)  # 按照数值排序operator.itemgetter(1)代表第二个域
     53     # 上面排序之后就不是字典了,而是一个列表里面包含的元组[('c',2),('a',3)]
     54     return sortedClassCount[0][0]
     55 
     56 def datingClassTest():
     57     hoRatio = 0.2
     58     datingDataMat , datingLabels = file2matrix('datingTestSet2.txt')
     59     normMat = autoNorm(datingDataMat)
     60     n = normMat.shape[0]
     61     numTestVecs = int(n*hoRatio)#测试数据和样本数据的分割点
     62     erroCount = 0.0
     63     #numTestVecs:n样本,[i,numTestVecs]测试
     64     for i in range(numTestVecs):
     65         classfiResult = classify0(normMat[i,:],normMat[numTestVecs:n,:],
     66                                   datingLabels[numTestVecs:n],3)
     67         if (classfiResult!=datingLabels[i]): erroCount+=1.0
     68     print ("the totle error os: %f" %(erroCount/float(numTestVecs)))
     69 
     70 def classfiPerson():
     71     resultList = ['not at all','in small doses','in large doses']
     72     personTats = float(input('please input video game 
    '))
     73     ffMiles = float(input('please input flier miles 
    '))
     74     iceCream = float(input('please input ice cream 
    '))
     75     datingData,datingLabels = file2matrix('datingTestSet2.txt')
     76     normData,minData,maxData = autoNorm(datingData)
     77     inputData = np.array([personTats,ffMiles,iceCream])#转化为矩阵
     78     inputData = (inputData - minData)/(maxData - minData)#输入归一化
     79     result = classify0(inputData,normData,datingLabels,3)
     80     print('等级是:',result)
     81 
     82 def img2vector(filename):
     83     returnVector = np.zeros([32,32])
     84     fr = open(filename)
     85     lineData = fr.readlines()
     86     count = 0
     87     for line in lineData:
     88         line = line.strip()#去除换行符
     89         for j in range(len(line)):
     90             returnVector[count,j] = line[j]
     91         count += 1
     92     returnVector = returnVector.reshape(1,1024).astype(int)#转化为1X1024
     93     return returnVector
     94 
     95 def img2vector2(filename):
     96     returnVect = np.zeros([1,1024])
     97     fr = open(filename)
     98     for i in range(32):
     99         lineStr = fr.readline()
    100         for j in range(32):
    101             returnVect[0,32*i+j] = int(lineStr[j])
    102     return returnVect
    103 
    104 def handWriteringClassTest():
    105     #--------------------------读取数据---------------------------------
    106     hwLabels = []
    107     trainingFileList = os.listdir('trainingDigits')#获取文件目录
    108     m = len(trainingFileList)#获取目录个数
    109     trainingMat = np.zeros([m,1024])#全部样本
    110     for i in range(m):
    111         fileNameStr = trainingFileList[i]
    112         fileStr = fileNameStr.split('.')[0]#得到不带格式的文件名
    113         classNumStr = int(fileStr.split('_')[0])#得到最前面的数字类别0-9
    114         hwLabels.append(classNumStr)#存储
    115         dirList = 'trainingDigits/' + fileNameStr#绝对目录信息
    116         vectorUnderTest = img2vector(dirList)#读取第i个数据信息
    117         trainingMat[i,:] = vectorUnderTest #存储
    118     #--------------------------测试数据--------------------------------
    119     testFileList = os.listdir('testDigits')
    120     errorCount = 0.0
    121     m = len(testFileList)
    122     for i in range(m):
    123         fileNameStr = testFileList[i]
    124         fileInt = fileNameStr.split('.')[0].split('_')[0]
    125         dirList = 'testDigits/' + fileNameStr  # 绝对目录信息
    126         vectorUnderTest = img2vector(dirList)  # 读取第i个数据信息
    127         if int(fileInt) != int(classify0(vectorUnderTest,trainingMat,hwLabels,3)):
    128             errorCount += 1
    129     print('error count is : ',errorCount)
    130     print('error Rate is : ', (errorCount/m))

  • 相关阅读:
    Java 第一章 初识Java
    Tomcat基础教程(三)
    Tomcat基础教程(二)
    Web Service相关工具的配置
    分布式版本控制系统Git的安装与使用
    个人项目小学四则运算 “软件”之初版
    结对项目四则运算 “软件”之升级版
    第一篇作业准备
    Linux常用命令入门文件、网络、系统及其他操作命令
    MySql5.7默认生成的密码无法正常登陆
  • 原文地址:https://www.cnblogs.com/wjy-lulu/p/7853260.html
Copyright © 2020-2023  润新知