• [MATLAB] Simple TFIDF implementation


    Term-Frequency word weighting scheme is one of most used in normalization of document-term matrices in text mining and information retrieval.

    See wikipedia for details.

    tfidf

    function Y = tfidf( X )
    % FUNCTION computes TF-IDF weighted word histograms.
    %
    %   Y = tfidf( X );
    %
    % INPUT :
    %   X        - document-term matrix (documents in columns)
    %
    % OUTPUT :
    %   Y        - TF-IDF weighted document-term matrix
    %
     
    % get term frequencies
    X = tf(X);
     
    % get inverse document frequencies
    I = idf(X);
     
    % apply weights for each document
    for j=1:size(X, 2)
        X(:, j) = X(:, j)*I(j);
    end
     
    Y = X;
     
     
    function X = tf(X)
    % SUBFUNCTION computes word frequencies
     
    % for every word
    for i=1:size(X, 1)
        
        % get word i counts for all documents
        x = X(i, :);
        
        % sum all word i occurences in the whole collection
        sumX = sum( x );
        
        % compute frequency of the word i in the whole collection
        if sumX ~= 0
            X(i, :) = x / sum(x);
        else
            % avoiding NaNs : set zero to never appearing words
            X(i, :) = 0;
        end
        
    end
     
     
    function I = idf(X)
    % SUBFUNCTION computes inverse document frequencies
     
    % m - number of terms or words
    % n - number of documents
    [m, n]=size(X);
     
    % allocate space for document idf's
    I = zeros(n, 1);
     
    % for every document
    for j=1:n
        
        % count non-zero frequency words
        nz = nnz( X(:, j) );
        
        % if not zero, assign a weight:
        if nz
            I(j) = log( m / nz );
        end
        
    end
  • 相关阅读:
    微信小程序之文件(图片)使用MD5加密(二)
    微信小程序之文件(图片)使用MD5加密(一)
    02-Django简介
    01-Web框架的原理
    15-pymysql模块的使用
    06-数据类型
    05-表的操作
    04-库的操作
    03-初始mysql语句
    02-MySql安装和基本管理
  • 原文地址:https://www.cnblogs.com/youth0826/p/2633688.html
Copyright © 2020-2023  润新知