• KNN-k近邻算法


    KNN-k近邻算法


    KNN(k-Nearest Neighbors)思想简单,应用的数学知识几乎为0,所以作为机器学习的入门非常实用、可以解释机器学习算法使用过程中的很多细节问题。能够更加完整地刻画机器学习应用的流程。

    ​ 首先大致介绍一下KNN的思想,假设我们现在有两类数据集,一类是红色的点表示,另一类用蓝色的点表示,这两类点就作为我们的训练数据集,当有一个新的数据绿色的点,那么我们该怎么给这个绿色的点进行分类呢?一般情况下,我们需要先指定一个k,当一个新的数据集来临时,我们首先计算这个新的数据跟训练集中的每一个数据的距离,一般使用欧氏距离。然后从中选出距离最近的k个点,这个k一般选取为奇数,方便后面投票决策。在k个点中根据最多的确定新的数据属于哪一类。

    mark

    一、KNN基础

    1. 先创建好数据集x_train, y_train,和一个新的数据x_new, 并使用matplot将其可视化出来。
    import numpy as np
    import matplotlib.pyplot as plt
    
    raw_data_x = [[3.3935, 2.3313],
                  [3.1101, 1.7815],
                  [1.3438, 3.3684],
                  [3.5823, 4.6792],
                  [2.2804, 2.8670],
                  [7.4234, 4.6965],
                  [5.7451, 3.5340],
                  [9.1722, 2.5111],
                  [7.7928, 3.4241],
                  [7.9398, 0.7916]]
    raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
    x_train = np.array(raw_data_x)
    y_train = np.array(raw_data_y)
    
    x_new = np.array([8.0936, 3.3657])
    
    plt.scatter(x_train[y_train==0,0], x_train[y_train==0,1], color='g')
    plt.scatter(x_train[y_train==1,0], x_train[y_train==1,1], color='r')
    plt.scatter(x_new[0], x_new[1], color='b')
    plt.show()
    
    1. knn过程
    • 计算距离
    from math import sqrt
    distance = []
    for x in x_train:
        d = sqrt(np.sum((x_train - x) ** 2))
        distance.append(d)
        
    # 其实上面这些代码用一行就可以搞定
    # distances = [sqrt(np.sum((x_train - x) ** 2)) for x in x_train]
    

    输出结果:

    [10.888422144185997,
     11.825242797930196,
     15.18734646375067,
     11.660703691887552,
     12.89974598548359,
     12.707715895864213,
     9.398411207752083,
     15.62480440229573,
     12.345673749536719,
     14.394770082568183]
    
    • 将距离进行排序,返回的是排序之后的索引位置
    nearsest = np.argsort(distances)
    

    输出结果:array([6, 0, 3, 1, 8, 5, 4, 9, 2, 7], dtype=int64)

    • 取k个点,假设k=5
    k = 5
    topk_y = [y_train[i] for i in nearest[:k]]
    topk_y
    

    输出结果:[1, 0, 0, 0, 1]

    根据输出结果我们可以发现,新来的数据距离最近的5个点,有三个点属于第一类,有两个点属于第二类,根据少数服从多数原则,新来的数据就属于第一类!

    • 投票
    from collections import Counter
    Counter(topk_y)
    

    输出结果:Counter({1: 2, 0: 3})

    votes = Counter(topk_y)
    votes.most_common(1)
    y_new = votes.most_common(1)[0][0]
    

    输出结果:0

    这样,我们就完成了一个基本的knn!

    二、自己写一个knn函数

    ​ knn是一个不需要训练过程的机器学习算法。其数据集可以近似看成一个模型。

    import numpy as np
    from math import sqrt
    from collections import Counter
    
    def kNN_classifier(k, x_train, y_train, x):
    
        assert 1 <= k <= x_train.shape[0], "k must be valid"
        assert x_train.shape[0] == y_train.shape[0], "the size of x_train must be equal to the size of y_train"
        assert x_train.shape[1] == x.shape[0], "the feature number of x must be equal to x_train"
    
        distances = [sqrt(np.sum((x_train - x) ** 2)) for x in x_train]
        nearest = np.argsort(distances)
    
        topk_y = [y_train[i] for i in nearest[:k]]
        votes = Counter(topk_y)
        
        return votes.most_common(1)[0][0]
    

    测试一下:

    raw_data_x = [[3.3935, 2.3313],
                  [3.1101, 1.7815],
                  [1.3438, 3.3684],
                  [3.5823, 4.6792],
                  [2.2804, 2.8670],
                  [7.4234, 4.6965],
                  [5.7451, 3.5340],
                  [9.1722, 2.5111],
                  [7.7928, 3.4241],
                  [7.9398, 0.7916]]
    raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
    x_train = np.array(raw_data_x)
    y_train = np.array(raw_data_y)
    
    x_new = np.array([8.0936, 3.3657])
    
    y_new = kNN_classifier(5, x_train, y_train, x_new)
    print(y_new)
    
    

    三、使用sklearn中的KNN

    from sklearn.neighbors import KNeighborsClassifier
    import numpy as np
    
    raw_data_x = [[3.3935, 2.3313],
                  [3.1101, 1.7815],
                  [1.3438, 3.3684],
                  [3.5823, 4.6792],
                  [2.2804, 2.8670],
                  [7.4234, 4.6965],
                  [5.7451, 3.5340],
                  [9.1722, 2.5111],
                  [7.7928, 3.4241],
                  [7.9398, 0.7916]]
    raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
    x_train = np.array(raw_data_x)
    y_train = np.array(raw_data_y)
    
    x_new = np.array([8.0936, 3.3657])
    
    knn_classifier = KNeighborsClassifier(n_neighbors=5)
    knn_classifier.fit(x_train, y_train)
    
    y_new = knn_classifier.predict(x_new.reshape(1, -1))
    print(y_new[0])
    
    

    四、自己写一个面向对象的KNN

    import numpy as np
    from math import sqrt
    from collections import Counter
    
    class KNNClassifier():
    
        def __init__(self, k):
            assert 1 <= k, "k must be valid"
            self.k = k
            self._x_train = None
            self._y_train = None
    
        def fit(self, x_train, y_train):
            assert x_train.shape[0] == y_train.shape[0],  
            "the size of x_train must be equal to the size of y_train"
            assert self.k <= x_train.shape[0],  
             "the size of x_train must be at least k"
    
            self._x_train = x_train
            self._y_train = y_train
            return self
    
        def predict(self, x_new):
            x_new = x_new.reshape(1, -1)
            assert self._x_train is not None and self._y_train is not None, 
            "must fit before predict"
            assert x_new.shape[1] == self._x_train.shape[1], 
            "the feature number of x must be equal to x_train"
    
            y_new = [self._predict(x) for x in x_new]
            return np.array(y_new)
    
        def _predict(self, x):
            assert x.shape[0] == self._x_train.shape[1], 
            "the feature number of x must be equal to x_train"
    
            distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in self._x_train]
            nearest = np.argsort(distances)
    
            topk_y = [self._y_train[i] for i in nearest[:self.k]]
            votes = Counter(topk_y)
    
            return votes.most_common(1)[0][0]
    
        def __repr__(self):
            return "KNN(k=%d)" % self.k
    
    

    ​ 测试一下:

    raw_data_x = [[3.3935, 2.3313],
                  [3.1101, 1.7815],
                  [1.3438, 3.3684],
                  [3.5823, 4.6792],
                  [2.2804, 2.8670],
                  [7.4234, 4.6965],
                  [5.7451, 3.5340],
                  [9.1722, 2.5111],
                  [7.7928, 3.4241],
                  [7.9398, 0.7916]]
    raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
    x_train = np.array(raw_data_x)
    y_train = np.array(raw_data_y)
    
    x_new = np.array([8.0936, 3.3657])
    
    knn_clf = KNNClassifier(6)
    knn_clf.fit(x_train, y_train)
    y_new = knn_clf.predict(x_new)
    print(y_new[0])
    
    

    五、分割数据集

    import numpy as np
    from sklearn import datasets
    
    
    def train_test_split(x, y, test_ratio=0.2, seed=None):
    
        assert x.shape[0] == y.shape[0], "the size of x must be equal to the size of y"
        assert 0.0 <= test_ratio <= 1.0, "test_ratio must be valid"
    
        if seed:
            np.random.seed(seed)
    
        shuffle_idx = np.random.permutation(len(x))
    
        test_size = int(len(x) * test_ratio)
        test_idx = shuffle_idx[:test_size]
        train_idx = shuffle_idx[test_size:]
    
        x_train = x[train_idx]
        y_train = y[train_idx]
    
        x_test = x[test_idx]
        y_test = y[test_idx]
    
        return x_train, y_train, x_test, y_test
    
    

    六、使用sklearn中的鸢尾花数据测试KNN

    import numpy as np
    from sklearn import datasets
    from knn_clf import KNNClassifier
    
    iris = datasets.load_iris()
    x = iris.data
    y = iris.target
    
    x_train, y_train, x_test, y_test = train_test_split(x, y)
    my_knn_clf = KNNClassifier(k=3)
    my_knn_clf.fit(x_train, y_train)
    
    y_predict = my_knn_clf.predict(x_test)
    print(sum(y_predict == y_test))
    print(sum(y_predict == y_test) / len(y_test))
    
    
    # 也可以使用sklearn中自带的数据集拆分方法
    from sklearn.model_selection import train_test_split
    import numpy as np
    from sklearn import datasets
    from knn_clf import KNNClassifier
    
    iris = datasets.load_iris()
    x = iris.data
    y = iris.target
    x_train, y_train, x_test, y_test = train_test_split(x, y, 
                                                        test_size=0.2, random_state=666)
    my_knn_clf = KNNClassifier(k=3)
    my_knn_clf.fit(x_train, y_train)
    y_predict = my_knn_clf.predict(x_test)
    print(sum(y_predict == y_test))
    print(sum(y_predict == y_test) / len(y_test))
    
    

    七、使用sklearn中的手写数字数据集测试KNN

    ​ 首先,先来了解一下手写数字数据集。

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import datasets
    
    digits = datasets.load_digits()
    digits.keys()
    print(digits.DESCR)
    y.shape
    digits.target_names
    y[:100]
    x[:10]
    some_digit = x[666]
    y[666]
    some_digit_image = some_digit.reshape(8, 8)
    plt.imshow(some_digit_image, cmap=plt.cm.binary)
    plt.show()
    
    

    ​ 接下来,就开始动手试试。

    from sklearn import datasets
    from shuffle_dataset import train_test_split
    from knn_clf import KNNClassifier
    
    digits = datasets.load_digits()
    x = digits.data
    y = digits.target
    
    x_train, y_train, x_test, y_test = train_test_split(x, y, test_ratio=0.2)
    my_knn_clf = KNNClassifier(k=3)
    my_knn_clf.fit(x_train, y_train)
    y_predict = my_knn_clf.predict(x_test)
    
    print(sum(y_predict == y_test) / len(y_test))
    
    

    ​ 把求acc封装成一个函数,方便调用。

    def accuracy_score(y_true, y_predict):
        assert y_true.shape[0] == y_predict.shape[0],  
        "the size of y_true must be equal to the size of y_predict"
    
        return sum(y_true == y_predict) / len(y_true)
    
    

    ​ 接下来把它封装到KNNClassifier的类中。

    import numpy as np
    from math import sqrt
    from collections import Counter
    from metrics import accuracy_score
    
    class KNNClassifier():
    
        def __init__(self, k):
            assert 1 <= k, "k must be valid"
            self.k = k
            self._x_train = None
            self._y_train = None
    
        def fit(self, x_train, y_train):
            assert x_train.shape[0] == y_train.shape[0], 
            "the size of x_train must be equal to the size of y_train"
            assert self.k <= x_train.shape[0], 
            "the size of x_train must be at least k"
    
            self._x_train = x_train
            self._y_train = y_train
            return self
    
        def predict(self, x_new):
            # x_new = x_new.reshape(1, -1)
            assert self._x_train is not None and self._y_train is not None, 
            "must fit before predict"
            assert x_new.shape[1] == self._x_train.shape[1], 
            "the feature number of x must be equal to x_train"
    
            y_new = [self._predict(x) for x in x_new]
            return np.array(y_new)
    
        def _predict(self, x):
            assert x.shape[0] == self._x_train.shape[1], 
            "the feature number of x must be equal to x_train"
    
            distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in self._x_train]
            nearest = np.argsort(distances)
    
            topk_y = [self._y_train[i] for i in nearest[:self.k]]
            votes = Counter(topk_y)
    
            return votes.most_common(1)[0][0]
    
        def score(self, x_test, y_test):
            y_predict = self.predict(x_test)
            return accuracy_score(y_test, y_predict)
    
        def __repr__(self):
            return "KNN(k=%d)" % self.k
    
    

    ​ 其实,在sklearn中这些都已经封装好了。

    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsClassifier
    
    digits = datasets.load_digits()
    x = digits.data
    y = digits.target
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
    knn_classifier = KNeighborsClassifier(n_neighbors=3)
    knn_classifier.fit(x_train, y_train)
    knn_classifier.score(x_test, y_test)
    
    

    八、超参数

    • k

    ​ 在knn中的超参数k何时最优?

    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsClassifier
    
    digits = datasets.load_digits()
    x = digits.data
    y = digits.target
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
    
    best_score = 0.0
    best_k = -1
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k)
        knn_clf.fit(x_train, y_train)
        score = knn_clf.score(x_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
    print("best k=", best_k)
    print("best score=", best_score)
    
    
    • 投票方式

    mark

    ​ 上面这张图,绿色的球最近的三颗球分别是红色的1号,紫色的3号和蓝色的4号。如果只考虑绿色的k个近邻中多数服从少数,目前来说就是平票。即使不是平票,红色也是距离绿色最近。此时我们就可以考虑给他们加个权重。一般使用距离的倒数作为权重。假设距离分别为1、 3、 4

    ​ 红球:1 紫+蓝:1/3 + 1/4 = 7/12

    ​ 这两者加起来都没有红色的权重大,因此最终将这颗绿球归为红色类别。这样能有效解决平票问题。 因此,这也算knn的一个超参数。

    ​ 其实这个在sklearn封装的knn中已经考虑到了这个问题。在KNeighborsClassifier(n_neighbors=k,weights=?)还有一个参数weights,一般有两种:uniform、distance。

    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsClassifier
    
    digits = datasets.load_digits()
    x = digits.data
    y = digits.target
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
    
    best_method = ""
    best_score = 0.0
    best_k = -1
    for method in["uniform", "distance"]: 
        for k in range(1, 11):
            knn_clf = KNeighborsClassifier(n_neighbors=k)
            knn_clf.fit(x_train, y_train)
            score = knn_clf.score(x_test, y_test)
            if score > best_score:
                best_method = method
                best_k = k
                best_score = score
    print("best_method=", best_method)
    print("best k=", best_k)
    print("best score=", best_score)
    
    
    • p

      ​ 如果使用距离,那么有很多种距离可以使用,欧氏距离、曼哈顿距离、明可夫斯基距离。

    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsClassifier
    
    digits = datasets.load_digits()
    x = digits.data
    y = digits.target
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
    
    best_p = -1
    best_score = 0.0
    best_k = -1
    for p in range(1, 6):
        for k in range(1, 11):
            knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", p=p)
            knn_clf.fit(x_train, y_train)
            score = knn_clf.score(x_test, y_test)
            if score > best_score:
                best_p = p
                best_k = k
                best_score = score
    print("best_p=", best_p)
    print("best k=", best_k)
    print("best score=", best_score)
    
    

    ​ 关于knn更多的超参寻优,后续待更~

    九、更多关于k近邻的思考

    ​ 其实使用k近邻除了可以解决分类问题还可以解决回归问题,比如下面,这个绿色的球可以按照k个近邻取平均或者进行加权平均,sklearn中也封装了k近邻的回归部分,详情

    http://scikit-learn.rog/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

    mark

    ​ 缺点1:k近邻最大的缺点就是,效率低下,因为如果训练集有m个样本,n个特征,则预测每一个新的数据需要的时间复杂度O(m*n),关于k近邻的优化问题,可以使用KD-Tree,Ball-Tree。

    ​ 缺点2:高度数据相关。在数据集中异常样本对于预测的影响很大。

    ​ 缺点3:预测结果具有不可解释性

    其实还有一个最大的问题,对于k近邻算法而言就是维度灾难

    维度灾难:随着维度的增加,"看似相近"的两个点之间的距离越来越大。

    mark

    解决的办法:就是降维!后续会对降维进行一个整理!

    我是尾巴

    ​ 如果有人问我:那些艰难的岁月你是怎么熬过来的?

    我想我只有一句话回答:我有一种强大的精神力量支撑着我,这种力量名字叫“想死又不敢”。

    本次推荐:

    xmind:xmind

    间接性踌躇满志,持续性混吃等死!

  • 相关阅读:
    支付宝生活号内置浏览器长按保存二维码
    Web前端发展史
    ES6语法
    Java多线程
    Java基础知识
    静态库和动态库的使用
    Gcc的使用
    Vim的使用
    力扣345. 反转字符串中的元音字母
    力扣605. 种花问题
  • 原文地址:https://www.cnblogs.com/zhangkanghui/p/11289260.html
Copyright © 2020-2023  润新知