refer to: https://www.kaggle.com/dansbecker/data-leakage
There are two main types of leakage: Leaky Predictors and a Leaky Validation Strategies.
Leaky Predictors
This occurs when your predictors include data that will not be available at the time you make predictions.
模型中用了预测前不可用的feature/data,这会导致在validation中accuracy很高,而在实际环境中部署后,accuracy很低,因为得不到这样的数据。
如,预测肺炎,如果使用“服用抗生素”作为feature,就是这种情况,因为一般是得了肺炎自然会服用抗生素,在预测肺炎这格模型中,不应该使用“服用抗生素”这个feature。
Leaky Validation Strategies
在模型处理过程中,让Validation Data影响到了模型的参数。
For example, this happens if you run preprocessing (like fitting the Imputer for missing values) before calling train_test_split.
例如,当你在调用train_test_split之前,对数据进行了预处理(如Imputer),而预处理所用数据包含了spit之后的validation data。