• hadoop与spark的处理技巧(六)聚类算法(2)K-means


    K-均值算法试图将一系列样本分割成K个不同的类簇(其中K是模型的输入参数)

    K-means

    K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The spark.mllib implementation includes a parallelized variant of the k-means++ method called kmeans||. The implementation in spark.mllib has the following parameters:

    • k is the number of desired clusters. Note that it is possible for fewer than k clusters to be returned, for example, if there are fewer than k distinct points to cluster.
    • maxIterations is the maximum number of iterations to run.
    • initializationMode specifies either random initialization or initialization via k-means||.
    • runs This param has no effect since Spark 2.0.0.
    • initializationSteps determines the number of steps in the k-means|| algorithm.
    • epsilon determines the distance threshold within which we consider k-means to have converged.
    • initialModel is an optional set of cluster centers used for initialization. If this parameter is supplied, only one run is performe

    (1)K是簇的数量,返回的数量可以少于K,例如再样本个数小于K时

    (2)最大迭代次数

    (3)指定初始化方式:随机初始化 或者 通过KmeansII指定

    (4)runs--spark2之后弃用

    (5)初始化步长

    (6)epsilon确定我们认为k-means已收敛的距离阈值

    (7)initialModel 是用于初始化的可设置的簇中心,如果已设置,就只执行一次

  • 相关阅读:
    Java条件语句之多重 if
    Java条件语句之 if...else
    equals和==的区别
    Java条件语句之 if
    Java中运算符的优先级
    Java中的条件运算符
    tp 推送微信的模板消息
    thinkphp 上传多张图片
    图片服务器和WEB应用服务器相分离的简单方案
    PHP无限级分类实现(递归+非递归)
  • 原文地址:https://www.cnblogs.com/gaohuajie/p/10234303.html
Copyright © 2020-2023  润新知