声明:本文用到的代码均来自于PRTools(http://www.prtools.org)模式识别工具箱,并以matlab软件进行实验。
混淆矩阵是模式识别中的常用工具,在PRTools工具箱中有直接的函数confmat可供引用。具体使用方法如下所示:
[C,NE,LABLIST] = CONFMAT(LAB1,LAB2,METHOD,FID) INPUT LAB1 Set of labels LAB2 Set of labels METHOD 'count' (default) to count number of co-occurences in LAB1 and LAB2, 'disagreement' to count relative non-co-occurrence. FID Write text result to file OUTPUT C Confusion matrix NE Total number of errors (empty labels are neglected) LABLIST Unique labels in LAB1 and LAB2
首先简单理解一些词语:
TP(True Positive):被分类器正确分类的正元组。
TN(True Negative):被分类器正确分类的负元组。
FP(False Positive):被错误标记为正元组的负元组。
FN(False Negative):被错误标记为负元组的正元组。
TP与TN告诉我们分类器何时分类正确,FP与FN告诉我们分类器何时分类错误。
对一个M类的数据集, 混淆矩阵(Confusion Matrix)是一个至少M×M的表,它的第i行第j列的数值表示为第i类的元组被标记为第j类的个数。
一个例子,以UCI数据集中的Ionosphere数据集为例,调用PRtools工具箱中的混淆矩阵函数:
(1)首先初始化ionosphere数据集合:
data=load('ionosphere.txt'); [m,k]=size(data); data1=ones(m,k-1); for i=1:k-1 data1(:,i)=(data(:,i)-min(data(:,i)))/(max(data(:,i))-min(data(:,i))); end label=data(:,k); [Y,I]=min(label); if Y(1)==0 for i=1:m label(i)=label(i)+1; end end a=dataset(data1,label);
(2)然后调用confmat.m函数:
[train,test]=gendat(a,0.5); w=treec(train); conf=confmat(test*w)
运行结果:
conf就是混淆矩阵,其矩阵数值含义对应上述表格。
如果不想用PRtools工具箱中的混淆矩阵函数,可以直接自行编写混淆矩阵代码,如下所示,运行时可直接调用。
function [confmatrix] = cfmatrix(actual, predict, classlist, per) % CFMATRIX calculates the confusion matrix for any prediction % algorithm that generates a list of classes to which the test % feature vectors are assigned % % Outputs: confusion matrix % % Actual Classes % p n % ___|_____|______| % Predicted p'| | | % Classes n'| | | % % Inputs: % 1. actual / 2. predict % The inputs provided are the 'actual' classes vector % and the 'predict'ed classes vector. The actual classes are the classes % to which the input feature vectors belong. The predicted classes are the % class to which the input feature vectors are predicted to belong to, % based on a prediction algorithm. % The length of actual class vector and the predicted class vector need to % be the same. If they are not the same, an error message is displayed. % 3. classlist % The third input provides the list of all the classes {p,n,...} for which % the classification is being done. All classes are numbers. % 4. per = 1/0 (default = 0) % This parameter when set to 1 provides the values in the confusion matrix % as percentages. The default provides the values in numbers. % % Example: % >> a = [ 1 2 3 1 2 3 1 1 2 3 2 1 1 2 3]; % >> b = [ 1 2 3 1 2 3 1 1 1 2 2 1 2 1 3]; % >> Cf = cfmatrix(a, b); % % [Avinash Uppuluri: avinash_uv@yahoo.com: Last modified: 08/21/08] % If classlist not entered: make classlist equal to all % unique elements of actual if (nargin < 2) error('Not enough input arguments.'); elseif (nargin == 2) classlist = unique(actual); % default values from actual per = 0; % default is numbers and input 1 for percentage elseif (nargin == 3) per = 0; % default is numbers and input 1 for percentage end if (length(actual) ~= length(predict)) error('First two inputs need to be vectors with equal size.'); elseif ((size(actual,1) ~= 1) && (size(actual,2) ~= 1)) error('First input needs to be a vector and not a matrix'); elseif ((size(predict,1) ~= 1) && (size(predict,2) ~= 1)) error('Second input needs to be a vector and not a matrix'); end format short g; n_class = length(classlist); line_two = '----------'; line_three = '_________|'; for i = 1:n_class obind_class_i = find(actual == classlist(i)); prind_class_i = find(predict == classlist(i)); confmatrix(i,i) = length(intersect(obind_class_i,prind_class_i)); for j = 1:n_class %if (j ~= i) if (j < i) % observed j predicted i confmatrix(i,j) = length(find(actual(prind_class_i) == classlist(j))); % observed i predicted j confmatrix(j,i) = length(find(predict(obind_class_i) == classlist(j))); end end line_two = strcat(line_two,'---',num2str(classlist(i)),'-----'); line_three = strcat(line_three,'__________'); end if (per == 1) confmatrix = (confmatrix ./ length(actual)).*100; end % output to screen disp('------------------------------------------'); disp(' Actual Classes'); disp(line_two); disp('Predicted| '); disp(' Classes| '); disp(line_three); for i = 1:n_class temps = sprintf(' %d ',i); for j = 1:n_class temps = strcat(temps,sprintf(' | %2.1f ',confmatrix(i,j))); end disp(temps); clear temps end disp('------------------------------------------');
混淆矩阵的概念其实很好理解,接下来引申几个很好理解的术语的概念(P:正元组数目,N:负元组数目):
准确率:TP+TN/P+N
错误率:FP+FN/P+N
敏感度、召回率:TP/P
精度:TP/TP+FP
本文主要是从PRtools工具箱中混淆矩阵函数的使用来简单介绍了解混淆矩阵的概念,如有不正确的地方,欢迎指正。