• kmeans 初步学习小结


    接触kmeans 算法比较长时间了,但是一直没好好明白怎么回事。推荐几个好点的链接。

    http://coolshell.cn/articles/7779.html

    http://blog.csdn.net/zouxy09/article/details/9982495

    http://www.360doc.com/content/13/1122/14/10724725_331295214.shtml

    运用matlab函数的一个最基本程序

    yangben= load('F:iris.txt');
    s=size(yangben);
    hang=s(1);
    lie=s(2);
    x=yangben(:,1:4);
    opts=statset('Display','final');
    k=3;
    [idx,ctrs]=kmeans(x,k,'Distance','city','Replicates',5,'options',opts);
         plot(x(idx==1,1),x(idx==1,2),'r.',...
         x(idx==2,1),x(idx==2,2),'b.',...
         x(idx==3,1),x(idx==3,2),'g.');
     ctrs(:,1),ctrs(:,2),ctrs(:,3),'kx';
         
             
     

    总的来说,还可以,是因为数据集比较权威。

    贴上help里kmeans 的帮助文档,以后再研究。暂且就会用这个函数就行了。看完这个,终于可以好好看看稀疏编码进行特征提取的问题了。

    help kmeans
    kmeans K-means clustering.
    IDX = kmeans(X, K) partitions the points in the N-by-P data matrix X
    into K clusters. This partition minimizes the sum, over all clusters, of
    the within-cluster sums of point-to-cluster-centroid distances. Rows of X
    correspond to points, columns correspond to variables. Note: when X is a
    vector, kmeans treats it as an N-by-1 data matrix, regardless of its
    orientation. kmeans returns an N-by-1 vector IDX containing the cluster
    indices of each point. By default, kmeans uses squared Euclidean
    distances.

    kmeans treats NaNs as missing data, and ignores any rows of X that
    contain NaNs.

    [IDX, C] = kmeans(X, K) returns the K cluster centroid locations in
    the K-by-P matrix C.

    [IDX, C, SUMD] = kmeans(X, K) returns the within-cluster sums of
    point-to-centroid distances in the 1-by-K vector sumD.

    [IDX, C, SUMD, D] = kmeans(X, K) returns distances from each point
    to every centroid in the N-by-K matrix D.

    [ ... ] = kmeans(..., 'PARAM1',val1, 'PARAM2',val2, ...) specifies
    optional parameter name/value pairs to control the iterative algorithm
    used by kmeans. Parameters are:

    'Distance' - Distance measure, in P-dimensional space, that kmeans
    should minimize with respect to. Choices are:
    'sqEuclidean' - Squared Euclidean distance (the default)
    'cityblock' - Sum of absolute differences, a.k.a. L1 distance
    'cosine' - One minus the cosine of the included angle
    between points (treated as vectors)
    'correlation' - One minus the sample correlation between points
    (treated as sequences of values)
    'Hamming' - Percentage of bits that differ (only suitable
    for binary data)

    'Start' - Method used to choose initial cluster centroid positions,
    sometimes known as "seeds". Choices are:
    'sample' - Select K observations from X at random (the default)
    'uniform' - Select K points uniformly at random from the range
    of X. Not valid for Hamming distance.
    'cluster' - Perform preliminary clustering phase on random 10%
    subsample of X. This preliminary phase is itself
    initialized using 'sample'.
    matrix - A K-by-P matrix of starting locations. In this case,
    you can pass in [] for K, and kmeans infers K from
    the first dimension of the matrix. You can also
    supply a 3D array, implying a value for 'Replicates'
    from the array's third dimension.

    'Replicates' - Number of times to repeat the clustering, each with a
    new set of initial centroids. A positive integer, default is 1.

    'EmptyAction' - Action to take if a cluster loses all of its member
    observations. Choices are:
    'error' - Treat an empty cluster as an error (the default)
    'drop' - Remove any clusters that become empty, and set
    the corresponding values in C and D to NaN.
    'singleton' - Create a new cluster consisting of the one
    observation furthest from its centroid.

    'Options' - Options for the iterative algorithm used to minimize the
    fitting criterion, as created by STATSET. Choices of STATSET
    parameters are:

    'Display' - Level of display output. Choices are 'off', (the
    default), 'iter', and 'final'.
    'MaxIter' - Maximum number of iterations allowed. Default is 100.

    'OnlinePhase' - Flag indicating whether kmeans should perform an "on-line
    update" phase in addition to a "batch update" phase. The on-line phase
    can be time consuming for large data sets, but guarantees a solution
    that is a local minimum of the distance criterion, i.e., a partition of
    the data where moving any single point to a different cluster increases
    the total sum of distances. 'on' (the default) or 'off'.

    Example:

    X = [randn(20,2)+ones(20,2); randn(20,2)-ones(20,2)];
    opts = statset('Display','final');
    [cidx, ctrs] = kmeans(X, 2, 'Distance','city', ...
    'Replicates',5, 'Options',opts);
    plot(X(cidx==1,1),X(cidx==1,2),'r.', ...
    X(cidx==2,1),X(cidx==2,2),'b.', ctrs(:,1),ctrs(:,2),'kx');

  • 相关阅读:
    P20 HTTP 方法的安全性与幂等性
    P19 查询参数
    P18 写代码:过滤和搜索
    P17 过滤和搜索
    P16 HTTP HEAD
    golang的json操作[转]
    Android中的Service 与 Thread 的区别[转]
    iOS的block内存管理
    Go并发编程基础(译)
    golang闭包里的坑
  • 原文地址:https://www.cnblogs.com/natalie/p/4794946.html
Copyright © 2020-2023  润新知