[MATLAB] Simple TFIDF implementation

Term-Frequency word weighting scheme is one of most used in normalization of document-term matrices in text mining and information retrieval.

See wikipedia for details.

function Y = tfidf( X )% FUNCTION computes TF-IDF weighted word histograms.%%   Y = tfidf( X );%% INPUT :%   X        - document-term matrix (documents in columns)%% OUTPUT :%   Y        - TF-IDF weighted document-term matrix% % get term frequenciesX = tf(X); % get inverse document frequenciesI = idf(X); % apply weights for each documentfor j=1:size(X, 2)    X(:, j) = X(:, j)*I(j);end Y = X;  function X = tf(X)% SUBFUNCTION computes word frequencies % for every wordfor i=1:size(X, 1)        % get word i counts for all documents    x = X(i, :);        % sum all word i occurences in the whole collection    sumX = sum( x );        % compute frequency of the word i in the whole collection    if sumX ~= 0        X(i, :) = x / sum(x);    else        % avoiding NaNs : set zero to never appearing words        X(i, :) = 0;    end    end  function I = idf(X)% SUBFUNCTION computes inverse document frequencies % m - number of terms or words% n - number of documents[m, n]=size(X); % allocate space for document idf'sI = zeros(n, 1); % for every documentfor j=1:n        % count non-zero frequency words    nz = nnz( X(:, j) );        % if not zero, assign a weight:    if nz        I(j) = log( m / nz );    end    end

相关阅读:
微信小程序之文件（图片）使用MD5加密（二）
微信小程序之文件（图片）使用MD5加密（一）
02-Django简介
01-Web框架的原理
15-pymysql模块的使用
06-数据类型
05-表的操作
04-库的操作
03-初始mysql语句
02-MySql安装和基本管理

原文地址：https://www.cnblogs.com/youth0826/p/2633688.html