• tf-idf知多少?


    1、最完整的解释

    TF-IDF是一种统计方法,用以评估一个词对于语料库中的其中一份文件的重要程度。

    --------------就是给定语料库的情况下(给定语料库就是说已知语料库的属性信息),给定一个词语term,计算一个term对于文件的重要性(就是计算一个得分),文件是可变的;

    这样的话可以计算在词语在多个文件的得分然后做个排序就好了

    一个直观的感觉是:一个词语在当前文件出现次数越多越重要;但是有可能是停用词

    所以还得限制这个词语在语料库中出现文件次数;

    所以就是一个词语对于一个文件的重要性取决于:  当前这个词语在文件中出现的次数(越多越重要);以及这个词语在整个语料库中出现次数(越少越重要);

    其实就是评价单词对于文件的重要性,这个重要性怎么衡量?-就是后文所说  这里x是term,y是代表document,如果是一个query就会对x做sum

    2、关于 normalization 

    是有两种 normalization的操作,叫做 instance-wise和 feature-wise的normalization:

    >Normalizationしてみる(instance-wise normalization)

    >Scaleを合わせてみる(feature-wise normalization)

    最近在做text classificaion,所以会用到 tf-idf,这个时候一般只会做 instance-wise的,这个也就是针对每个样本的每个特征做nomalizaiton

    具体为什么要做instance normalization?一段文字告诉你

    According to our empirical experience, instance-wise  data normalization makes the optimization problem easier to be solved.  For example, for the following vector x (0, 0, 1, 1, 0, 1,0, 1, 0), the standard data normalization divides each element of x by by the 2-norm of x. The normalized x becomes (0, 0, 0.5, 0.5,0, 0.5, 0, 0.5, 0)

    1. Transforming infrequent features into a special tag. Conceptually, infrequent
    features should only include very little or even no information, so it
    should be very hard for a model to extract those information. In fact, these
    features can be very noisy. We got significant improvement (more than 0.0005)
    by transforming these features into a special tag.

    2. Transforming numerical features (I1-I13) to categorical features.
    Empirically we observe using categorical features is always better than
    using numerical features. However, too many features are generated if
    numerical features are directly transformed into categorical features, so we
    use

    v <- floor(log(v)^2)

    to reduce the number of features generated.

    3. Data normalization. According to our empirical experience, instance-wise
    data normalization makes the optimization problem easier to be solved.
    For example, for the following vector x

    (0, 0, 1, 1, 0, 1, 0, 1, 0),

    the standard data normalization divides each element of x by by the 2-norm
    of x. The normalized x becomes

    (0, 0, 0.5, 0.5, 0, 0.5, 0, 0.5, 0).

    The hashing trick do not contribute any improvement on the leaderboard. We
    apply the hashing trick only because it makes our life easier to generate
    features. :)

  • 相关阅读:
    Excelpackage的简单使用(导出用户信息并保存)
    set nocount on/off的作用,可配合存储过程使用
    在sql中case子句的两种形式
    C#开发微信门户及应用(1)--开始使用微信接口(转)
    张建总的一封信
    Jquery在线工具
    《EnterLib PIAB深入剖析》系列博文汇总 (转)
    微软企业库5.0 学习之路——第十步、使用Unity解耦你的系统—PART1——为什么要使用Unity? (转)
    结合领域驱动设计的SOA分布式软件架构 (转)
    ENode框架旨在帮助我们开发ddd,cqrs、eda和事件采购风格的应用程序。(netfocus 汤雪华)
  • 原文地址:https://www.cnblogs.com/amazement/p/6053944.html
Copyright © 2020-2023  润新知