Data Leakage in Machine Learning 机器学习训练中的数据泄漏

Data Leakage in Machine Learning 机器学习训练中的数据泄漏

refer to: https://www.kaggle.com/dansbecker/data-leakage

There are two main types of leakage: Leaky Predictors and a Leaky Validation Strategies.

Leaky Predictors

This occurs when your predictors include data that will not be available at the time you make predictions.

模型中用了预测前不可用的feature/data，这会导致在validation中accuracy很高，而在实际环境中部署后，accuracy很低，因为得不到这样的数据。

如，预测肺炎，如果使用“服用抗生素”作为feature，就是这种情况，因为一般是得了肺炎自然会服用抗生素，在预测肺炎这格模型中，不应该使用“服用抗生素”这个feature。

Leaky Validation Strategies

在模型处理过程中，让Validation Data影响到了模型的参数。

For example, this happens if you run preprocessing (like fitting the Imputer for missing values) before calling train_test_split.

例如，当你在调用train_test_split之前，对数据进行了预处理(如Imputer)，而预处理所用数据包含了spit之后的validation data。
相关阅读:
R语言大小写字母转换
 SparkR(R on Spark)编程指南含 dataframe操作
 SparkR(R on Spark)编程指南含 dataframe操作
 R-table和tapply函数
 r table
多变量频率统计——r
R语言-查看加载包、卸除加载包及安装包与卸载包
 flask 电子邮件进阶实践-用模板发送163邮件 --
flask 电子邮件Flask-Mail --
数据库进阶实践-事件监听 --
原文地址：https://www.cnblogs.com/xbit/p/10124742.html

Data Leakage in Machine Learning 机器学习训练中的数据泄漏

Leaky Predictors

Leaky Validation Strategies