• Tree-based Model 如何处理categorical variable


    categorical variable 分为 order variale 和 non-order variable,其中order variable直接使用sklearn.preprocess.LabelEncoder是最好的处理方法。对于order variable的处理方法主要在于是否使用one-hot encoding。在这篇quora answer (author: Clem Wang) 中给出了其它的处理方法:

    One can try a few other approaches:

    • look at how the response variable responds to the categorical values and try to group them.
    • Find another ML algorithm that works better with categorical features or with one-hot encoding and use that to train a submodel that just uses the categorical features. Then replace the categorical feature with a probability score. For instance, use a Logistic Regression on the hot-encoded values.
    • Try to combine the categorical feature with some other features.
    • Build N xgboost classifiers, one for each category.

    This may require playing around with the data a bit. Plotting the data may help you see patterns that you didn't know that were there.

    这篇博客对于在xgboost中使用one-hot给出了一个总体结论:

    总结起来的结论,大至两条:

    • 1.对于类别有序的类别型变量,比如age等,当成数值型变量处理可以的。对于非类别有序的类别型变量,推荐one-hot。但是one-hot会增加内存开销以及训练时间开销。
    • 2.类别型变量在范围较小时(tqchen给出的是[10,100]范围内)推荐使用

    其他相关的资料

    comment:re sklearn -- integer encoding vs 1-hot

  • 相关阅读:
    Interesting Finds: 2009 01.15 ~ 01.17
    Interesting Finds: 2008.12.07
    Interesting Finds: 2008.12.31
    10月16号
    10月14号
    10月15号
    10月13号
    10月20号
    10月19号
    10月12号
  • 原文地址:https://www.cnblogs.com/ZeroTensor/p/10097069.html
Copyright © 2020-2023  润新知