• K-Means 聚类算法


    K-Means 概念定义:

    K-Means 是一种基于距离的排他的聚类划分方法。

    上面的 K-Means 描述中包含了几个概念:

    • 聚类(Clustering):K-Means 是一种聚类分析(Cluster Analysis)方法。聚类就是将数据对象分组成为多个类或者簇 (Cluster),使得在同一个簇中的对象之间具有较高的相似度,而不同簇中的对象差别较大。
    • 划分(Partitioning):聚类可以基于划分,也可以基于分层。划分即将对象划分成不同的簇,而分层是将对象分等级。
    • 排他(Exclusive):对于一个数据对象,只能被划分到一个簇中。如果一个数据对象可以被划分到多个簇中,则称为可重叠的(Overlapping)。
    • 距离(Distance):基于距离的聚类是将距离近的相似的对象聚在一起。基于概率分布模型的聚类是在一组对象中,找到能符合特定分布模型的对象的集合,他们不一定是距离最近的或者最相似的,而是能完美的呈现出概率分布模型所描述的模型。

    K-Means 问题描述:

    给定一个 n 个对象的数据集,它可以构建数据的 k 个划分,每个划分就是一个簇,并且 k ≤ n。同时还需满足:

    1. 每个组至少包含一个对象。
    2. 每个对象必须属于且仅属于一个簇。

    Simply speaking, K-Means clustering is an algorithm to classify or to group your objects based on attributes/features, into K number of groups. K is a positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus, the purpose of K-means clustering is to classify the data.

    例如,有如下包含 10 条数据的集合。集合中每项描述了一个人的身高(Height: inches)和体重(Weight: kilograms)。

    Height Weight
    -------------
    (73.0, 72.6) 
    (61.0, 54.4) 
    (67.0, 99.9) 
    (68.0, 97.3) 
    (62.0, 59.0) 
    (75.0, 81.6) 
    (74.0, 77.1) 
    (66.0, 97.3) 
    (68.0, 93.3) 
    (61.0, 59.0)

    通过按照身高和体重的聚类,可以将上述 10 条数据分组成 3 类。

    Height Weight
    -------------
    (67.0, 99.9) 
    (68.0, 97.3) 
    (66.0, 97.3) 
    (68.0, 93.3)
    
    (73.0, 72.6) 
    (75.0, 81.6) 
    (74.0, 77.1)
    
    (61.0, 54.4) 
    (62.0, 59.0) 
    (61.0, 59.0)

    分类结果可以描述为:中等身高并且很重、很高并且中等体重、矮并且轻。如果用图形来观察分组状况则结果一目了然。

    K-Means 算法实现:

    由于 K-Means 算法值针对给定的完整数据集进行操作,不需要任何特殊的训练数据,所以 K-Means 是一种无监督的机器学习方法(Unsupervised Machine Learning Technique)。

    K-Means 算法最常见的实现方式是使用迭代式精化启发法的 Lloyd's algorithm

    • 给定划分数量 k。创建一个初始划分,从数据集中随机地选择 k 个对象,每个对象初始地代表了一个簇中心(Cluster Centroid)。对于其他对象,计算其与各个簇中心的距离,将它们划入距离最近的簇。
    • 采用迭代的重定位技术,尝试通过对象在划分间移动来改进划分。所谓重定位技术,就是当有新的对象加入簇或者已有对象离开簇的时候,重新计算簇的平均值,然后对对象进行重新分配。这个过程不断重复,直到各簇中对象不再变化为止。
    randomly assign all data items to a cluster 
    loop until no change in cluster assignments 
      compute centroids for each cluster 
      reassign each data item to cluster of closest centroid 
    end

    简洁点儿的表述即为:

    initialize clustering 
    loop 
      update centroids 
      update clustering 
    end loop

    应用 K-Means 算法到上述身高与体重的示例,聚类过程如下图所示。

    K-Means 优缺点:

    当结果簇是密集的,而且簇和簇之间的区别比较明显时,K-Means 的效果较好。对于大数据集,K-Means 是相对可伸缩的和高效的,它的复杂度是 O(nkt),n 是对象的个数,k 是簇的数目,t 是迭代的次数,通常 k << n,且 t << n,所以算法经常以局部最优结束。

    K-Means 的最大问题是要求先给出 k 的个数。k 的选择一般基于经验值和多次实验结果,对于不同的数据集,k 的取值没有可借鉴性。另外,K-Means 对孤立点数据是敏感的,少量噪声数据就能对平均值造成极大的影响。

    Basic K-Means - Lloyd's algorithm C# 代码实现:

    Code below referenced from Machine Learning Using C# Succinctly by James McCaffrey, and article K-Means Data Clustering Using C#.

      1 using System;
      2 
      3 namespace ClusterNumeric
      4 {
      5   class ClusterNumProgram
      6   {
      7     static void Main(string[] args)
      8     {
      9       Console.WriteLine("
    Begin k-means clustering demo
    ");
     10 
     11       double[][] rawData = new double[10][];
     12       rawData[0] = new double[] { 73, 72.6 };
     13       rawData[1] = new double[] { 61, 54.4 };
     14       rawData[2] = new double[] { 67, 99.9 };
     15       rawData[3] = new double[] { 68, 97.3 };
     16       rawData[4] = new double[] { 62, 59.0 };
     17       rawData[5] = new double[] { 75, 81.6 };
     18       rawData[6] = new double[] { 74, 77.1 };
     19       rawData[7] = new double[] { 66, 97.3 };
     20       rawData[8] = new double[] { 68, 93.3 };
     21       rawData[9] = new double[] { 61, 59.0 };
     22 
     23       Console.WriteLine("Raw unclustered height (in.) weight (kg.) data:
    ");
     24       Console.WriteLine(" ID Height Weight");
     25       Console.WriteLine("---------------------");
     26       ShowData(rawData, 1, true, true);
     27 
     28       int numClusters = 3;
     29       Console.WriteLine("
    Setting numClusters to " + numClusters);
     30 
     31       Console.WriteLine("Starting clustering using k-means algorithm");
     32       Clusterer c = new Clusterer(numClusters);
     33       int[] clustering = c.Cluster(rawData);
     34       Console.WriteLine("Clustering complete
    ");
     35 
     36       Console.WriteLine("Final clustering in internal form:
    ");
     37       ShowVector(clustering, true);
     38 
     39       Console.WriteLine("Raw data by cluster:
    ");
     40       Console.WriteLine(" ID Height Weight");
     41       ShowClustered(rawData, clustering, numClusters, 1);
     42 
     43       Console.WriteLine("
    End k-means clustering demo
    ");
     44       Console.ReadLine();
     45     }
     46 
     47     static void ShowData(
     48       double[][] data, int decimals,
     49       bool indices, bool newLine)
     50     {
     51       for (int i = 0; i < data.Length; ++i)
     52       {
     53         if (indices == true)
     54           Console.Write(i.ToString().PadLeft(3) + " ");
     55 
     56         for (int j = 0; j < data[i].Length; ++j)
     57         {
     58           double v = data[i][j];
     59           Console.Write(v.ToString("F" + decimals) + "   ");
     60         }
     61 
     62         Console.WriteLine("");
     63       }
     64 
     65       if (newLine == true)
     66         Console.WriteLine("");
     67     }
     68 
     69     static void ShowVector(int[] vector, bool newLine)
     70     {
     71       for (int i = 0; i < vector.Length; ++i)
     72         Console.Write(vector[i] + " ");
     73 
     74       if (newLine == true)
     75         Console.WriteLine("
    ");
     76     }
     77 
     78     static void ShowClustered(
     79       double[][] data, int[] clustering,
     80       int numClusters, int decimals)
     81     {
     82       for (int k = 0; k < numClusters; ++k)
     83       {
     84         Console.WriteLine("===================");
     85         for (int i = 0; i < data.Length; ++i)
     86         {
     87           int clusterID = clustering[i];
     88           if (clusterID != k) continue;
     89           Console.Write(i.ToString().PadLeft(3) + " ");
     90           for (int j = 0; j < data[i].Length; ++j)
     91           {
     92             double v = data[i][j];
     93             Console.Write(v.ToString("F" + decimals) + "   ");
     94           }
     95           Console.WriteLine("");
     96         }
     97         Console.WriteLine("===================");
     98       }
     99     }
    100   }
    101 
    102   public class Clusterer
    103   {
    104     private int numClusters; // number of clusters 
    105     private int[] clustering; // index = a tuple, value = cluster ID 
    106     private double[][] centroids; // mean (vector) of each cluster 
    107     private Random rnd; // for initialization 
    108 
    109     public Clusterer(int numClusters)
    110     {
    111       this.numClusters = numClusters;
    112       this.centroids = new double[numClusters][];
    113       this.rnd = new Random(0); // arbitrary seed 
    114     }
    115 
    116     public int[] Cluster(double[][] data)
    117     {
    118       int numTuples = data.Length;
    119       int numValues = data[0].Length;
    120       this.clustering = new int[numTuples];
    121 
    122       for (int k = 0; k < numClusters; ++k) // allocate each centroid 
    123         this.centroids[k] = new double[numValues];
    124 
    125       InitRandom(data);
    126 
    127       Console.WriteLine("
    Initial random clustering:");
    128       for (int i = 0; i < clustering.Length; ++i)
    129         Console.Write(clustering[i] + " ");
    130       Console.WriteLine("
    ");
    131 
    132       bool changed = true; // change in clustering? 
    133       int maxCount = numTuples * 10; // sanity check 
    134       int ct = 0;
    135       while (changed == true && ct <= maxCount)
    136       {
    137         ++ct; // k-means typically converges very quickly 
    138         UpdateCentroids(data); // no effect if fail 
    139         changed = UpdateClustering(data); // no effect if fail 
    140       }
    141 
    142       int[] result = new int[numTuples];
    143       Array.Copy(this.clustering, result, clustering.Length);
    144       return result;
    145     }
    146 
    147     private void InitRandom(double[][] data)
    148     {
    149       int numTuples = data.Length;
    150 
    151       int clusterID = 0;
    152       for (int i = 0; i < numTuples; ++i)
    153       {
    154         clustering[i] = clusterID++;
    155         if (clusterID == numClusters)
    156           clusterID = 0;
    157       }
    158       for (int i = 0; i < numTuples; ++i)
    159       {
    160         int r = rnd.Next(i, clustering.Length);
    161         int tmp = clustering[r];
    162         clustering[r] = clustering[i];
    163         clustering[i] = tmp;
    164       }
    165     }
    166 
    167     private void UpdateCentroids(double[][] data)
    168     {
    169       int[] clusterCounts = new int[numClusters];
    170       for (int i = 0; i < data.Length; ++i)
    171       {
    172         int clusterID = clustering[i];
    173         ++clusterCounts[clusterID];
    174       }
    175 
    176       // zero-out this.centroids so it can be used as scratch 
    177       for (int k = 0; k < centroids.Length; ++k)
    178         for (int j = 0; j < centroids[k].Length; ++j)
    179           centroids[k][j] = 0.0;
    180 
    181       for (int i = 0; i < data.Length; ++i)
    182       {
    183         int clusterID = clustering[i];
    184         for (int j = 0; j < data[i].Length; ++j)
    185           centroids[clusterID][j] += data[i][j]; // accumulate sum 
    186       }
    187 
    188       for (int k = 0; k < centroids.Length; ++k)
    189         for (int j = 0; j < centroids[k].Length; ++j)
    190           centroids[k][j] /= clusterCounts[k]; // danger? 
    191     }
    192 
    193     private bool UpdateClustering(double[][] data)
    194     {
    195       // (re)assign each tuple to a cluster (closest centroid) 
    196       // returns false if no tuple assignments change OR 
    197       // if the reassignment would result in a clustering where 
    198       // one or more clusters have no tuples. 
    199 
    200       bool changed = false; // did any tuple change cluster? 
    201 
    202       int[] newClustering = new int[clustering.Length]; // proposed result 
    203       Array.Copy(clustering, newClustering, clustering.Length);
    204 
    205       double[] distances = new double[numClusters]; // from tuple to centroids
    206 
    207       for (int i = 0; i < data.Length; ++i) // walk through each tuple 
    208       {
    209         for (int k = 0; k < numClusters; ++k)
    210           distances[k] = Distance(data[i], centroids[k]);
    211 
    212         int newClusterID = MinIndex(distances); // find closest centroid 
    213         if (newClusterID != newClustering[i])
    214         {
    215           changed = true; // note a new clustering 
    216           newClustering[i] = newClusterID; // accept update 
    217         }
    218       }
    219 
    220       if (changed == false)
    221         return false; // no change so bail 
    222 
    223       // check proposed clustering cluster counts 
    224       int[] clusterCounts = new int[numClusters];
    225       for (int i = 0; i < data.Length; ++i)
    226       {
    227         int clusterID = newClustering[i];
    228         ++clusterCounts[clusterID];
    229       }
    230 
    231       for (int k = 0; k < numClusters; ++k)
    232         if (clusterCounts[k] == 0)
    233           return false; // bad clustering 
    234 
    235       Array.Copy(newClustering, clustering, newClustering.Length); // update 
    236       return true; // good clustering and at least one change 
    237     }
    238 
    239     // Euclidean distance between two vectors for UpdateClustering() 
    240     private static double Distance(double[] tuple, double[] centroid)
    241     {
    242       double sumSquaredDiffs = 0.0;
    243       for (int j = 0; j < tuple.Length; ++j)
    244         sumSquaredDiffs += (tuple[j] - centroid[j]) * (tuple[j] - centroid[j]);
    245       return Math.Sqrt(sumSquaredDiffs);
    246     }
    247 
    248     // helper for UpdateClustering() to find closest centroid 
    249     private static int MinIndex(double[] distances)
    250     {
    251       int indexOfMin = 0;
    252       double smallDist = distances[0];
    253       for (int k = 1; k < distances.Length; ++k)
    254       {
    255         if (distances[k] < smallDist)
    256         {
    257           smallDist = distances[k];
    258           indexOfMin = k;
    259         }
    260       }
    261       return indexOfMin;
    262     }
    263   }
    264 }

    运行结果如下:

    参考资料

    本篇文章《K-Means 聚类算法》由 Dennis Gao 发表自博客园个人博客,未经作者本人同意禁止以任何的形式转载,任何自动的或人为的爬虫转载行为均为耍流氓。

  • 相关阅读:
    python网络爬虫之使用scrapy自动爬取多个网页
    流畅的python学习笔记:第五章
    python cookbook第三版学习笔记 一
    python网络爬虫之scrapy 调试以及爬取网页
    python网络爬虫之scrapy 工程创建以及原理介绍
    流畅的python学习笔记:第三章
    python网络爬虫之LXML与HTMLParser
    python 解析docx文档的方法,以及利用Python从docx文档提取插入的文本对象和图片
    【原创】大数据基础之Drill(2)Drill1.14+Hive2.1.1运行
    【原创】大叔问题定位分享(12)Spark保存文本类型文件(text、csv、json等)到hdfs时为什么是压缩格式的
  • 原文地址:https://www.cnblogs.com/gaochundong/p/kmeans_clustering.html
Copyright © 2020-2023  润新知