• 机器学习入门教程-k-近邻


    k-近邻算法原理

    像之前提到的那样,机器学习的一个要点就是分类,对于分类来说有许多不同的算法,所谓的物以聚类,分以群分。我们非常的清楚,一个地域的人群,不管在生活习惯,还是在习俗上都是非常相似的,也就是我们说的一类人。每一类人都会形成自己的一个中心,越靠近这个中心的人越为相似。k近邻算法就是为了找到这个中心点,把这中心点当成这类关键点,在有新的数据需要分类的话,就看离哪个中心点近,那么就属于哪一类。

    假设我们有这样的一组数据,他代表一个人的地理坐标位置:

    x坐标 y坐标 哪省人
    4.035615117 4.920529835 0
    4.665299994 4.702897321 0
    1.711128297 1.031989236 1

    根据这坐标在图上绘出图形:

    两个蓝色的点互相靠近,它们的属性应该是相似的,而红色的点,离这两个蓝色的点有一定的距离,可能属于另一个聚合。

    在这里导入一组数据,这一组数据中有三个分类,每一个分类就是一个群,组成了三个中心,具体的数据和图如下:

    import numpy as np
    import random
    import matplotlib.pyplot as plt
    
    def read_clusters(clustersfile):
        cl = []
        tl = []
        with open(clustersfile, 'r') as f:
            for line in f:
                line = line.strip()
                if line != '':
                    line = line.split()
                    constraint = [float(line[0]), float(line[1])]
    
                    cl.append(constraint)
                    tl.append(int(line[2]))
        return cl,tl
    
    train_data,train_labels = read_clusters('clusters3.txt')
    train_data = np.array(train_data)
    key_name = {0:'red',1:'blue',2:'orange'}
    
    for i in range(train_data.shape[0]):
        plt.scatter(train_data[i:i + 1, 0:1], train_data[i:i + 1, 1:2], c=key_name[train_labels[i]], marker='o',s=20)
    
    plt.savefig('clusters.png')
    

    k-近邻算法步骤

    k-近邻的一般步骤如下:

    1.先随机的产生几个中心,中心点的确认来自于需要组建几个类群。

    def _init_random_centroids(self, data):
        n_samples, n_features = np.shape(data)
        centroids = np.zeros((self.k, n_features))
        for i in range(self.k):
            centroid = data[np.random.choice(range(n_samples))]
            centroids[i] = centroid
        return centroids
    

    2.接下来是把所有的数据点跟这几个中心点进行比较,数据点里哪个中心点近,那么这个点就属于哪个类群。

    计算距离的公式如下:

    def euclidean_distance(vec_1, vec_2):
    	if(len(vec_1) != len(vec_2)):
    		raise Exception("The two vectors do NOT have equal length")
    
    	distance = 0
    	for i in range(len(vec_1)):
    		distance += pow((vec_1[i] - vec_2[i]), 2)
    
    	return np.sqrt(distance)
    

    根据距离查找属于哪个中心点。

    def _closest_centroid(self, sample, centroids):
        closest_i = None
        closest_distance = float("inf")
        for i, centroid in enumerate(centroids):
            distance = ml_helpers.euclidean_distance(sample, centroid)
            if distance < closest_distance:
                closest_i = i
                closest_distance = distance
        return closest_i
    

    3.通过中心点确定了类群,在通过类群更新中心点。中心点是这个类群所有点的均值点,计算均值更新中心点。

    def _calculate_centroids(self, clusters, data):
        n_features = np.shape(data)[1]
        centroids = np.zeros((self.k, n_features))
        for i, cluster in enumerate(clusters):
            centroid = np.mean(data[cluster], axis=0)
            centroids[i] = centroid
        return centroids
    

    4.不断的更新这一个过程,直到中心点不在变化。

    整个过程如下:

    import numpy as np
    import random
    import sys
    
    import matplotlib.pyplot as plt
    
    def euclidean_distance(vec_1, vec_2):
    	if(len(vec_1) != len(vec_2)):
    		raise Exception("The two vectors do NOT have equal length")
    
    	distance = 0
    	for i in range(len(vec_1)):
    		distance += pow((vec_1[i] - vec_2[i]), 2)
    
    	return np.sqrt(distance)
    	
    def read_clusters(clustersfile):
        cl = []
        tl = []
        with open(clustersfile, 'r') as f:
            for line in f:
                line = line.strip()
                if line != '':
                    line = line.split()
                    constraint = [float(line[0]), float(line[1])]
    
                    cl.append(constraint)
                    tl.append(int(line[2]))
        return cl,tl
    
    
    class KMeans():
        def __init__(self, k=2, max_iterations=500):
            self.k = k
            self.max_iterations = max_iterations
            self.kmeans_centroids = []
    
        def _init_random_centroids(self, data):
            n_samples, n_features = np.shape(data)
            centroids = np.zeros((self.k, n_features))
            for i in range(self.k):
                centroid = data[np.random.choice(range(n_samples))]
                centroids[i] = centroid
            return centroids
    
        def _closest_centroid(self, sample, centroids):
            closest_i = None
            closest_distance = float("inf")
            for i, centroid in enumerate(centroids):
                distance = euclidean_distance(sample, centroid)
                if distance < closest_distance:
                    closest_i = i
                    closest_distance = distance
            return closest_i
    
        def _create_clusters(self, centroids, data):
            n_samples = np.shape(data)[0]
            clusters = [[] for _ in range(self.k)]
            for sample_i, sample in enumerate(data):		
                centroid_i = self._closest_centroid(sample, centroids)
                clusters[centroid_i].append(sample_i)
            return clusters
    
        def _calculate_centroids(self, clusters, data):
            n_features = np.shape(data)[1]
            centroids = np.zeros((self.k, n_features))
            for i, cluster in enumerate(clusters):
                centroid = np.mean(data[cluster], axis=0)
                centroids[i] = centroid
            return centroids
    
        def _get_cluster_labels(self, clusters, data):
            y_pred = np.zeros(np.shape(data)[0])
            for cluster_i, cluster in enumerate(clusters):
                for sample_i in cluster:
                    y_pred[sample_i] = cluster_i
            return y_pred
    
        def fit(self, data):
            centroids = self._init_random_centroids(data)
    
            for iteration in range(self.max_iterations):
    
    
                clusters = self._create_clusters(centroids, data)
    
                prev_centroids = centroids
    
                centroids = self._calculate_centroids(clusters, data)
    
                diff = centroids - prev_centroids
                if not diff.any():
                    break
    
            self.kmeans_centroids = centroids
            return centroids
    
        def predict(self, data):
    
    
            if not self.kmeans_centroids.any():
                raise Exception("K-Means centroids have not yet been determined.
    Run the K-Means 'fit' function first.")
    
            clusters = self._create_clusters(self.kmeans_centroids, data)
    
            predicted_labels = self._get_cluster_labels(clusters, data)
    
            return predicted_labels
    
    
    
    key_name = {0:'red',1:'blue',2:'orange'}
    
    
    
    
    clf = KMeans(k=3, max_iterations=3000)
    
    train_data,train_labels = read_clusters('clusters3.txt')
    train_data = np.array(train_data)
    centroids = clf.fit(train_data)
    print centroids
    

    中心点不断更新的过程如下:

    算法误差估计

    检验算法的好坏,简单的办法是把一部分的数据用来训练,一部分的数据用来检验,查看算法的结果跟预计的数据相差多少?

    下面是算法的效果估计:

    Accuracy = 0
    for index in range(len(train_labels)):
    	# Cluster the data using K-Means
    	current_label = train_labels[index]
    	predicted_label = predicted_labels[index]
    
    	if current_label == int(predicted_label):
    		Accuracy += 1
    
    Accuracy /= len(train_labels)
    
    print Accuracy
    

    输出的结果为

    1
    

    准确率达到100%。

    sklearn 下的k-近邻算法

    在学习算法的时候知道了原理,通过自己的代码对算法的原理进行编写,通常来讲这很方便学习,在知道了如何编写算法以后,可以直接使用现成的开源库,直接使用该算法,sklearn 就非常方便使用。

    clf = cluster.KMeans(n_clusters=3, max_iter=3000, n_init=10)
    kmeans = clf.fit(train_data)
    
    Accuracy = 0
    for index in range(len(train_labels)):
    	# Cluster the data using K-Means
    	current_sample = train_data[index].reshape(1,-1) 
    	current_label = train_labels[index]
    	predicted_label = kmeans.predict(current_sample)
    	if current_label == predicted_label:
    		Accuracy += 1
    
    Accuracy /= len(train_labels)
    

    算法的应用

    k-近邻算法用来找到中心点,同时算法也可以用来进行去重,把重复的附近的点都把他近似为中心点。

    转载请标明来之:http://www.bugingcode.com/

    更多教程:阿猫学编程

  • 相关阅读:
    Ext.widgetsform(上)BasicForm/Field/Checkbox/Radio/HtmlEditor/TextField
    EXT核心API详解(二)Array/Date/Function/Number/String
    Think of Ext2.0
    EXT核心API详解(七)Ext.KeyNav/KeyMap/JSON/Format/DelayedTask/TaskRunner/TextMetrics/XTemplate
    Ext架构分析(2)理解Ext.util.Observable
    Ext.dataStore
    Ext架构分析(1)理解Ext.util.Event
    Spket Eclipse插件使用教程
    Ext.widgetsform(下)ComboBox,TimeField,DateField,TriggerField,TextArea,NumberField
    Ext.menu.Menu
  • 原文地址:https://www.cnblogs.com/bugingcode/p/8560242.html
Copyright © 2020-2023  润新知