KMeans|| in Spark MLLib

KMeans|| in Spark MLLib
算法跟传统的kmeans的区别主要在于：kmeans||的k个中心的不是随机初始化的。而是选择了k个彼此“足够”分离的中心。
```
org.apache.spark.mllib.clustering.KMeans
private[org.apache.spark.mllib.clustering] 
def initKMeansParallel(data: RDD[VectorWithNorm]): Array[VectorWithNorm]

Initialize a set of cluster centers using the k-means|| algorithm by 
Bahmani et al. (Bahmani et al., Scalable K-Means++, VLDB 2012). 
```
This is a variant of k-means++ that tries to find dissimilar cluster centers
by starting with a random center and then doing passes where more centers
are chosen with probability proportional to their squared distance to the
current cluster set. It results in a provable approximation to an optimal
clustering. The original paper can be found at
http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf.

初始中心的选择

通过几次循环来实现：
- 随机选择一个点D_j作为初始化中心，centers={D_j}; 每个点的代价向量costs={cost_1,...}, cost_i表示第i个点的代价（距离当前最近center的距离）,初始cost_i=正无穷；
- 计算每个点到当前中心的代价：
  cost_i := min(cost_i, cost_of(Di, newCenters))
  def: cost_of 某个点到当前最近中心的距离。
  -- sum_cost = sum_i{c_i}
  -- 更新costs={cost_1,...}
- 选择候选的中心点，对某个点Di，及其cost_i,该点被选中的概率是：
  P_i=2 * cost_i * k / sum_cost
  选择之后，形成新的newCenters.
循环执行上述2,3步骤（参数配置循环次数，默认2次）。得到一组候选点。在此基础上执行本地（非分布式）Kmeans算法，最终得到k个点作为初始化的中心点。

然后再次基础上运行传统的KMeams算法.

P_i=2 * cost_i * k / sum_cost的解释：

每个点被选中的概率正比于它跟当前最近的中心点的距离，距离越远被选中的概率越大，也就是倾向于选中更离散的点。
每次循环后选中的点的数量期望是2 * k，假设循环10次，那么期望选中20k个候选点，然后在此基础上运行local的kmeans算法选择其中k个点作为后续分布式kmeans的初始中心点集合。
相关阅读:
Scanner类
 16 String类
 15_面向对象_01
14_面向对象_01_2 完整格式
 14_面向对象_01_1 成员变量成员方法
 13_面向对象辅助看懂实例
 面向对象的基本概念
 Servlet_03 进阶随笔两种域的运用调用方式
 Servlet_03 资源跳转关键字
 Servlet_02 资源跳转
原文地址：https://www.cnblogs.com/luweiseu/p/7766652.html

KMeans|| in Spark MLLib

初始中心的选择

P_i=2 * cost_i * k / sum_cost的解释：