MATLAB实例:聚类初始化方法与数据归一化方法
作者:凯鲁嘎吉 - 博客园 http://www.cnblogs.com/kailugaji/
初始化方法有:随机初始化,K-means初始化,FCM初始化。。。。。。
归一化方法有:不归一化,z-score归一化,最大最小归一化。。。。。。
1. 聚类初始化方法
init_methods.m
function label=init_methods(data, K, choose) % 输入:无标签数据,聚类数,选择方法 % 输出:聚类标签 if choose==1 %随机初始化,随机选K行作为聚类中心,并用欧氏距离计算其他点到其聚类,将数据集分为K类,输出每个样例的类标签 [X_num, ~]=size(data); rand_array=randperm(X_num); %产生1~X_num之间整数的随机排列 para_miu=data(rand_array(1:K), :); %随机排列取前K个数,在X矩阵中取这K行作为初始聚类中心 %欧氏距离,计算(X-para_miu)^2=X^2+para_miu^2-2*X*para_miu',矩阵大小为X_num*K distant=repmat(sum(data.*data,2),1,K)+repmat(sum(para_miu.*para_miu,2)',X_num,1)-2*data*para_miu'; %返回distant每行最小值所在的下标 [~,label]=min(distant,[],2); elseif choose==2 %用kmeans进行初始化聚类,将数据集聚为K类,输出每个样例的类标签 label=kmeans(data, K); elseif choose==3 %用FCM算法进行初始化 options=[NaN, NaN, NaN, 0]; [~, responsivity]=fcm(data, K, options); %用FCM算法求出隶属度矩阵 [~, label]=max(responsivity', [], 2); elseif choose==4 label = litekmeans(data, K,'Replicates',20); end
litekmeans.m
function [label, center, bCon, sumD, D] = litekmeans(X, k, varargin) % [IDX, C] = litekmeans(data, K,'Replicates',20); %LITEKMEANS K-means clustering, accelerated by matlab matrix operations. % % label = LITEKMEANS(X, K) partitions the points in the N-by-P data matrix % X into K clusters. This partition minimizes the sum, over all % clusters, of the within-cluster sums of point-to-cluster-centroid % distances. Rows of X correspond to points, columns correspond to % variables. KMEANS returns an N-by-1 vector label containing the % cluster indices of each point. % % [label, center] = LITEKMEANS(X, K) returns the K cluster centroid % locations in the K-by-P matrix center. % % [label, center, bCon] = LITEKMEANS(X, K) returns the bool value bCon to % indicate whether the iteration is converged. % % [label, center, bCon, SUMD] = LITEKMEANS(X, K) returns the % within-cluster sums of point-to-centroid distances in the 1-by-K vector % sumD. % % [label, center, bCon, SUMD, D] = LITEKMEANS(X, K) returns % distances from each point to every centroid in the N-by-K matrix D. % % [ ... ] = LITEKMEANS(..., 'PARAM1',val1, 'PARAM2',val2, ...) specifies % optional parameter name/value pairs to control the iterative algorithm % used by KMEANS. Parameters are: % % 'Distance' - Distance measure, in P-dimensional space, that KMEANS % should minimize with respect to. Choices are: % {'sqEuclidean'} - Squared Euclidean distance (the default) % 'cosine' - One minus the cosine of the included angle % between points (treated as vectors). Each % row of X SHOULD be normalized to unit. If % the intial center matrix is provided, it % SHOULD also be normalized. % % 'Start' - Method used to choose initial cluster centroid positions, % sometimes known as "seeds". Choices are: % {'sample'} - Select K observations from X at random (the default) % 'cluster' - Perform preliminary clustering phase on random 10% % subsample of X. This preliminary phase is itself % initialized using 'sample'. An additional parameter % clusterMaxIter can be used to control the maximum % number of iterations in each preliminary clustering % problem. % matrix - A K-by-P matrix of starting locations; or a K-by-1 % indicate vector indicating which K points in X % should be used as the initial center. In this case, % you can pass in [] for K, and KMEANS infers K from % the first dimension of the matrix. % % 'MaxIter' - Maximum number of iterations allowed. Default is 100. % % 'Replicates' - Number of times to repeat the clustering, each with a % new set of initial centroids. Default is 1. If the % initial centroids are provided, the replicate will be % automatically set to be 1. % % 'clusterMaxIter' - Only useful when 'Start' is 'cluster'. Maximum number % of iterations of the preliminary clustering phase. % Default is 10. % % % Examples: % % fea = rand(500,10); % [label, center] = litekmeans(fea, 5, 'MaxIter', 50); % % fea = rand(500,10); % [label, center] = litekmeans(fea, 5, 'MaxIter', 50, 'Replicates', 10); % % fea = rand(500,10); % [label, center, bCon, sumD, D] = litekmeans(fea, 5, 'MaxIter', 50); % TSD = sum(sumD); % % fea = rand(500,10); % initcenter = rand(5,10); % [label, center] = litekmeans(fea, 5, 'MaxIter', 50, 'Start', initcenter); % % fea = rand(500,10); % idx=randperm(500); % [label, center] = litekmeans(fea, 5, 'MaxIter', 50, 'Start', idx(1:5)); % % % See also KMEANS % % [Cite] Deng Cai, "Litekmeans: the fastest matlab implementation of % kmeans," Available at: % http://www.zjucadcg.cn/dengcai/Data/Clustering.html, 2011. % % version 2.0 --December/2011 % version 1.0 --November/2011 % % Written by Deng Cai (dengcai AT gmail.com) if nargin < 2 error('litekmeans:TooFewInputs','At least two input arguments required.'); end [n, p] = size(X); pnames = { 'distance' 'start' 'maxiter' 'replicates' 'onlinephase' 'clustermaxiter'}; dflts = {'sqeuclidean' 'sample' [] [] 'off' [] }; [eid,errmsg,distance,start,maxit,reps,online,clustermaxit] = getargs(pnames, dflts, varargin{:}); if ~isempty(eid) error(sprintf('litekmeans:%s',eid),errmsg); end if ischar(distance) distNames = {'sqeuclidean','cosine'}; j = strcmpi(distance, distNames); j = find(j); if length(j) > 1 error('litekmeans:AmbiguousDistance', ... 'Ambiguous ''Distance'' parameter value: %s.', distance); elseif isempty(j) error('litekmeans:UnknownDistance', ... 'Unknown ''Distance'' parameter value: %s.', distance); end distance = distNames{j}; else error('litekmeans:InvalidDistance', ... 'The ''Distance'' parameter value must be a string.'); end center = []; if ischar(start) startNames = {'sample','cluster'}; j = find(strncmpi(start,startNames,length(start))); if length(j) > 1 error(message('litekmeans:AmbiguousStart', start)); elseif isempty(j) error(message('litekmeans:UnknownStart', start)); elseif isempty(k) error('litekmeans:MissingK', ... 'You must specify the number of clusters, K.'); end if j == 2 if floor(.1*n) < 5*k j = 1; end end start = startNames{j}; elseif isnumeric(start) if size(start,2) == p center = start; elseif (size(start,2) == 1 || size(start,1) == 1) center = X(start,:); else error('litekmeans:MisshapedStart', ... 'The ''Start'' matrix must have the same number of columns as X.'); end if isempty(k) k = size(center,1); elseif (k ~= size(center,1)) error('litekmeans:MisshapedStart', ... 'The ''Start'' matrix must have K rows.'); end start = 'numeric'; else error('litekmeans:InvalidStart', ... 'The ''Start'' parameter value must be a string or a numeric matrix or array.'); end % The maximum iteration number is default 100 if isempty(maxit) maxit = 100; end % The maximum iteration number for preliminary clustering phase on random % 10% subsamples is default 10 if isempty(clustermaxit) clustermaxit = 10; end % Assume one replicate if isempty(reps) || ~isempty(center) reps = 1; end if ~(isscalar(k) && isnumeric(k) && isreal(k) && k > 0 && (round(k)==k)) error('litekmeans:InvalidK', ... 'X must be a positive integer value.'); elseif n < k error('litekmeans:TooManyClusters', ... 'X must have more rows than the number of clusters.'); end bestlabel = []; sumD = zeros(1,k); bCon = false; for t=1:reps switch start case 'sample' center = X(randsample(n,k),:); case 'cluster' Xsubset = X(randsample(n,floor(.1*n)),:); [dump, center] = litekmeans(Xsubset, k, varargin{:}, 'start','sample', 'replicates',1 ,'MaxIter',clustermaxit); case 'numeric' end last = 0;label=1; it=0; switch distance case 'sqeuclidean' while any(label ~= last) && it<maxit last = label; bb = full(sum(center.*center,2)'); ab = full(X*center'); D = bb(ones(1,n),:) - 2*ab; [val,label] = min(D,[],2); % assign samples to the nearest centers ll = unique(label); if length(ll) < k %disp([num2str(k-length(ll)),' clusters dropped at iter ',num2str(it)]); missCluster = 1:k; missCluster(ll) = []; missNum = length(missCluster); aa = sum(X.*X,2); val = aa + val; [dump,idx] = sort(val,1,'descend'); label(idx(1:missNum)) = missCluster; end E = sparse(1:n,label,1,n,k,n); % transform label into indicator matrix center = full((E*spdiags(1./sum(E,1)',0,k,k))'*X); % compute center of each cluster it=it+1; end if it<maxit bCon = true; end if isempty(bestlabel) bestlabel = label; bestcenter = center; if reps>1 if it>=maxit aa = full(sum(X.*X,2)); bb = full(sum(center.*center,2)); ab = full(X*center'); D = bsxfun(@plus,aa,bb') - 2*ab; D(D<0) = 0; else aa = full(sum(X.*X,2)); D = aa(:,ones(1,k)) + D; D(D<0) = 0; end D = sqrt(D); for j = 1:k sumD(j) = sum(D(label==j,j)); end bestsumD = sumD; bestD = D; end else if it>=maxit aa = full(sum(X.*X,2)); bb = full(sum(center.*center,2)); ab = full(X*center'); D = bsxfun(@plus,aa,bb') - 2*ab; D(D<0) = 0; else aa = full(sum(X.*X,2)); D = aa(:,ones(1,k)) + D; D(D<0) = 0; end D = sqrt(D); for j = 1:k sumD(j) = sum(D(label==j,j)); end if sum(sumD) < sum(bestsumD) bestlabel = label; bestcenter = center; bestsumD = sumD; bestD = D; end end case 'cosine' while any(label ~= last) && it<maxit last = label; W=full(X*center'); [val,label] = max(W,[],2); % assign samples to the nearest centers ll = unique(label); if length(ll) < k missCluster = 1:k; missCluster(ll) = []; missNum = length(missCluster); [dump,idx] = sort(val); label(idx(1:missNum)) = missCluster; end E = sparse(1:n,label,1,n,k,n); % transform label into indicator matrix center = full((E*spdiags(1./sum(E,1)',0,k,k))'*X); % compute center of each cluster centernorm = sqrt(sum(center.^2, 2)); center = center ./ centernorm(:,ones(1,p)); it=it+1; end if it<maxit bCon = true; end if isempty(bestlabel) bestlabel = label; bestcenter = center; if reps>1 if any(label ~= last) W=full(X*center'); end D = 1-W; for j = 1:k sumD(j) = sum(D(label==j,j)); end bestsumD = sumD; bestD = D; end else if any(label ~= last) W=full(X*center'); end D = 1-W; for j = 1:k sumD(j) = sum(D(label==j,j)); end if sum(sumD) < sum(bestsumD) bestlabel = label; bestcenter = center; bestsumD = sumD; bestD = D; end end end end label = bestlabel; center = bestcenter; if reps>1 sumD = bestsumD; D = bestD; elseif nargout > 3 switch distance case 'sqeuclidean' if it>=maxit aa = full(sum(X.*X,2)); bb = full(sum(center.*center,2)); ab = full(X*center'); D = bsxfun(@plus,aa,bb') - 2*ab; D(D<0) = 0; else aa = full(sum(X.*X,2)); D = aa(:,ones(1,k)) + D; D(D<0) = 0; end D = sqrt(D); case 'cosine' if it>=maxit W=full(X*center'); end D = 1-W; end for j = 1:k sumD(j) = sum(D(label==j,j)); end end function [eid,emsg,varargout]=getargs(pnames,dflts,varargin) %GETARGS Process parameter name/value pairs % [EID,EMSG,A,B,...]=GETARGS(PNAMES,DFLTS,'NAME1',VAL1,'NAME2',VAL2,...) % accepts a cell array PNAMES of valid parameter names, a cell array % DFLTS of default values for the parameters named in PNAMES, and % additional parameter name/value pairs. Returns parameter values A,B,... % in the same order as the names in PNAMES. Outputs corresponding to % entries in PNAMES that are not specified in the name/value pairs are % set to the corresponding value from DFLTS. If nargout is equal to % length(PNAMES)+1, then unrecognized name/value pairs are an error. If % nargout is equal to length(PNAMES)+2, then all unrecognized name/value % pairs are returned in a single cell array following any other outputs. % % EID and EMSG are empty if the arguments are valid. If an error occurs, % EMSG is the text of an error message and EID is the final component % of an error message id. GETARGS does not actually throw any errors, % but rather returns EID and EMSG so that the caller may throw the error. % Outputs will be partially processed after an error occurs. % % This utility can be used for processing name/value pair arguments. % % Example: % pnames = {'color' 'linestyle', 'linewidth'} % dflts = { 'r' '_' '1'} % varargin = {{'linew' 2 'nonesuch' [1 2 3] 'linestyle' ':'} % [eid,emsg,c,ls,lw] = statgetargs(pnames,dflts,varargin{:}) % error % [eid,emsg,c,ls,lw,ur] = statgetargs(pnames,dflts,varargin{:}) % ok % We always create (nparams+2) outputs: % one each for emsg and eid % nparams varargs for values corresponding to names in pnames % If they ask for one more (nargout == nparams+3), it's for unrecognized % names/values % Original Copyright 1993-2008 The MathWorks, Inc. % Modified by Deng Cai (dengcai@gmail.com) 2011.11.27 % Initialize some variables emsg = ''; eid = ''; nparams = length(pnames); varargout = dflts; unrecog = {}; nargs = length(varargin); % Must have name/value pairs if mod(nargs,2)~=0 eid = 'WrongNumberArgs'; emsg = 'Wrong number of arguments.'; else % Process name/value pairs for j=1:2:nargs pname = varargin{j}; if ~ischar(pname) eid = 'BadParamName'; emsg = 'Parameter name must be text.'; break; end i = strcmpi(pname,pnames); i = find(i); if isempty(i) % if they've asked to get back unrecognized names/values, add this % one to the list if nargout > nparams+2 unrecog((end+1):(end+2)) = {varargin{j} varargin{j+1}}; % otherwise, it's an error else eid = 'BadParamName'; emsg = sprintf('Invalid parameter name: %s.',pname); break; end elseif length(i)>1 eid = 'BadParamName'; emsg = sprintf('Ambiguous parameter name: %s.',pname); break; else varargout{i} = varargin{j+1}; end end end varargout{nparams+1} = unrecog;
2. 数据归一化方法:normlization.m
function data = normlization(data, choose) % 数据归一化 if choose==0 % 不归一化 data = data; elseif choose==1 % Z-score归一化 data = bsxfun(@minus, data, mean(data)); data = bsxfun(@rdivide, data, std(data)); elseif choose==2 % 最大-最小归一化处理 [data_num,~]=size(data); data=(data-ones(data_num,1)*min(data))./(ones(data_num,1)*(max(data)-min(data))); end
注意:可以在elseif后面添加自己的方法。