k-means:是无监督的分类算法
k代表要分的类数,即要将数据聚为k类; means是均值,代表着聚类中心的迭代策略.
k-means算法思想:
(1)随机选取k个聚类中心(一般在样本集中选取,也可以自己随机选取);
(2)计算每个样本与k个聚类中心的距离,并将样本归到距离最小的那个类中;
(3)更新中心,计算属于k类的样本的均值作为新的中心。
(4)反复迭代(2)(3),直到聚类中心不发生变化,后者中心位置误差在阈值范围内,或者达到一定的迭代次数。
python实现:
k-means简单小样例:
import numpy as np data = np.random.randint(1,10,(30,2)) #k=4 k=4 #central np.random.shuffle(data) cent = data[0:k,:] #distance distance = np.zeros((data.shape[0],k)) last_near = np.zeros(data.shape[0]) n=0 while True: n = n+1 print(n) for i in range(data.shape[0]): for j in range(cent.shape[0]): dist = np.sqrt(np.sum((data[i]-cent[j])**2)) distance[i,j] = dist nearst = np.argmin(distance,axis = 1) if (last_near == nearst).all(): #if n<1000: break #update central for ele_cen in range(k): cent[ele_cen] = np.mean(data[nearst == ele_cen],axis=0) last_near = nearst print(cent)
下面样例是为了适应yolov3选取anchorbox的度量需求:
import numpy as np def iou(box, clusters): """ Calculates the Intersection over Union (IoU) between a box and k clusters. :param box: tuple or array, shifted to the origin (i. e. width and height) :param clusters: numpy array of shape (k, 2) where k is the number of clusters :return: numpy array of shape (k, 0) where k is the number of clusters """ x = np.minimum(clusters[:, 0], box[0]) y = np.minimum(clusters[:, 1], box[1]) if np.count_nonzero(x == 0) > 0 or np.count_nonzero(y == 0) > 0: raise ValueError("Box has no area") intersection = x * y box_area = box[0] * box[1] cluster_area = clusters[:, 0] * clusters[:, 1] iou_ = intersection / (box_area + cluster_area - intersection) return iou_ def kmeans(boxes, k, dist=np.median): """ Calculates k-means clustering with the Intersection over Union (IoU) metric. :param boxes: numpy array of shape (r, 2), where r is the number of rows :param k: number of clusters :param dist: distance function :return: numpy array of shape (k, 2) """ rows = boxes.shape[0] distances = np.empty((rows, k)) #初始化距离矩阵,rows代表样本数量,k代表聚类数量,用于存放每个样本对应每个聚类中心的距离 last_clusters = np.zeros((rows,))#记录上一次样本所属的类型 np.random.seed() # the Forgy method will fail if the whole array contains the same rows clusters = boxes[np.random.choice(rows, k, replace=False)]#从样本中随机选取聚类中心 while True: for row in range(rows): distances[row] = 1 - iou(boxes[row], clusters) #这里是距离计算公式,这里是为了适应yolov3选取anchorbox的度量需求 nearest_clusters = np.argmin(distances, axis=1) #找到距离最小的类 if (last_clusters == nearest_clusters).all(): #判断是否满足终止条件 break for cluster in range(k): #更新聚类中心 clusters[cluster] = dist(boxes[nearest_clusters == cluster], axis=0) #将某一类的均值更新为聚类中心 last_clusters = nearest_clusters return clusters
希望可以为正在疑惑的你提供一些思路!