Day 14 Part 2 Kaggle's 30 Days of ML
Course Step 7 of Intermediate Machine Learning.
Data Leakage:
Leakage data will cause the model have good prediction result on validation data, but cannot be applied to other real-life cases.
- Target Leakage:
Let's say our target var is A, and we conclude some rules based on A and make var column B.
If we use B as features to predict the behavior of A, the result will be super ideal. This is target leakage. - Train-Test Contamination:
If we do the preprocessing before sperating the data into training data and validation data, the model would have a super good prediction, but will perform bad in real-life cases.
Example:
Everyone who did not receive a card had no expenditures, while only 2% of those who received a card had no expenditures.
This is a target leakage.
The solution is to simply drop those leak columns.