Day 14 Part 2 Kaggle's 30 Days of ML

Day 14 Part 2 Kaggle's 30 Days of ML

Course Step 7 of Intermediate Machine Learning.


Data Leakage:

Leakage data will cause the model have good prediction result on validation data, but cannot be applied to other real-life cases.

  1. Target Leakage:
    Let's say our target var is A, and we conclude some rules based on A and make var column B.
    If we use B as features to predict the behavior of A, the result will be super ideal. This is target leakage.
  2. Train-Test Contamination:
    If we do the preprocessing before sperating the data into training data and validation data, the model would have a super good prediction, but will perform bad in real-life cases.

Example:

Everyone who did not receive a card had no expenditures, while only 2% of those who received a card had no expenditures.
This is a target leakage.

The solution is to simply drop those leak columns.