• [MATLAB] Simple TFIDF implementation


    Term-Frequency word weighting scheme is one of most used in normalization of document-term matrices in text mining and information retrieval.

    See wikipedia for details.

    tfidf

    function Y = tfidf( X )
    % FUNCTION computes TF-IDF weighted word histograms.
    %
    %   Y = tfidf( X );
    %
    % INPUT :
    %   X        - document-term matrix (documents in columns)
    %
    % OUTPUT :
    %   Y        - TF-IDF weighted document-term matrix
    %
     
    % get term frequencies
    X = tf(X);
     
    % get inverse document frequencies
    I = idf(X);
     
    % apply weights for each document
    for j=1:size(X, 2)
        X(:, j) = X(:, j)*I(j);
    end
     
    Y = X;
     
     
    function X = tf(X)
    % SUBFUNCTION computes word frequencies
     
    % for every word
    for i=1:size(X, 1)
        
        % get word i counts for all documents
        x = X(i, :);
        
        % sum all word i occurences in the whole collection
        sumX = sum( x );
        
        % compute frequency of the word i in the whole collection
        if sumX ~= 0
            X(i, :) = x / sum(x);
        else
            % avoiding NaNs : set zero to never appearing words
            X(i, :) = 0;
        end
        
    end
     
     
    function I = idf(X)
    % SUBFUNCTION computes inverse document frequencies
     
    % m - number of terms or words
    % n - number of documents
    [m, n]=size(X);
     
    % allocate space for document idf's
    I = zeros(n, 1);
     
    % for every document
    for j=1:n
        
        % count non-zero frequency words
        nz = nnz( X(:, j) );
        
        % if not zero, assign a weight:
        if nz
            I(j) = log( m / nz );
        end
        
    end
  • 相关阅读:
    新手ui设计师必备——切图规范
    django1.4日志模块配置及使用
    linux chmod命令和chown命令
    python log
    python curses使用
    css3中变形与动画(三)
    django静态文件配置
    centos7 apache httpd安装和配置django项目
    apache httpd服务器403 forbidden的问题
    centos7 mysql数据库安装和配置
  • 原文地址:https://www.cnblogs.com/youth0826/p/2633688.html
Copyright © 2020-2023  润新知