Data Leakage: What It Is and Why It Causes Our Predictive Systems to Fail

Data Leakage: What It Is and Why It Causes Our Predictive Systems to Fail

Data leakage represents, together with over/underfitting, the main cause of failure of machine learning projects that go into production

Data leakage is undoubtedly a threat that preys on data scientists, regardless of the level of seniority.

It is that phenomenon that can affect everyone — even professionals with years of experience in the sector.

Together with over/underfitting, it represents the main cause of failure of machine learning projects that go into production.

Data leakage occurs when information present in the training set leaks into the evaluation set (whether validation or test set)

But why does data leakage claim so many victims?

Because even after many experiments and evaluations in the development phase, our models can fail spectacularly in a production scenario.

Avoiding data leakage is not easy. I hope that with this article you’ll understand why and how to avoid it in your projects!

Examples of data leakage

Here’s an example that can be useful for you to understand what data leakage is.

Imagine that we are developers of applied AI and we are employed by a company that manufactures children’s toys in series.

Our task is to create a machine learning model to identify if a toy will be subject to a refund request within 3 days of its sale.

We receive the data from the factory, in the form of images capturing the toy before canning.

Photo by Jerry Wang on Unsplash

We use these images to train our model which performs very well in cross validation and on the test set.

We deliver the model and for the first month the customer reports only 5% defective toy refund requests.

In the second month we prepare for the retraining of the model. The factory sends us more photographs, which we use to expand the initial training dataset.

Again, the model performs well in cross-validation and testing.

This time, however, we receive a communication that customers are making requests and that 90% of these requests refer to a defective toy.

We take a look at the photos….and we notice that the photos sent by the customer in the last batch show the toys subject to refund during the first month.

The new photos supplied by the factory inadvertently include the information regarding the refund request within 3 days of the sale of the toy.

Basically, the customer sent us selected images after the refund request was made, so these images capture specific characteristics of the toys that have been returned. This may include visible damage or obvious defects which have been discovered by the customer and which have led to a refund request.

Consequently we understand that the model is very good at identifying specific defects, thus showing a high performance in development but not in production.

We have to call the customer and explain the situation and possibly clean up the training dataset.

Despite its relevance, the concept of data leakage is little mentioned compared to overfitting. This makes it even more dangerous for the emerging data scientist.

Common causes of data leakage

Now let’s see some of the most frequent causes that lead to data leakage.

1. Timelines split randomly rather than by time

We are usually taught that randomly splitting training, validation, and test datasets is the correct choice.

However, very often our data can be correlated on a time basis, which means that the time variable influences the distribution of our labels.

Let’s take for example a time series that comes from the stock market.

If we randomly split this dataset, data leakage would occur at 100%.

This is because data from the following days would be randomly mixed into the training dataset. Our model would be exposed to the “correct” labels without having to learn them.

It’s kind of like a kid in school has the correct answers to the test that he’s taking. High performance, but very low knowledge.

To avoid the problem, it is useful to split the data based on time: for example if we have data for a month, we train the model on the first 20 days, and we test on the remaining 10, sequentially.

2. Transform the data before splitting it

This is one of the most common causes among data science newcomers.

This is how this error manifests itself

from sklearn import model_selection
from sklearn import preprocessing

# ...

scaler = preprocessing.StandardScaler()
data = scaler.fit_transform(data)

train, test = model_selection.train_test_split(
  data, 
  test_size=0.2, 
  random_state=42
)

# ...

The mistake here that we applied scaling to all the data before splitting them into training and test sets.

In this case, the scaler object requires to know the mean and standard deviation of the dataset to which it is applied.

By providing it with the entire dataset, it will also have stored and used information from the test set, which will shift the mean and standard deviation.

To fix this, always split your data before applying a transformation like scaling.

3. Fill in the missing data with information from the test set

A bit similar to the point above, but from another angle.

A common method for imputing missing data from a column is to fill the cells with the mean or median of all the data in the column.

If the mean is also calculated on the values belonging to the test set, we are generating leakage in the training set.

Again, we split our data before applying imputation.

4. Failure to remove duplicates

If we have duplicate records in our dataset, there is a risk that some of these may appear in both training and test sets.

Hence, the leakage. Our model will clearly be good at predicting the value of such duplicates, reducing the prediction error considerably (but incorrectly).

To avoid this error, you must remove the duplicates before performing the train-test split.

If we do oversampling, i.e. create artificial duplicates to train a model on an unbalanced dataset, then we need to do it after splitting the data.

5. Wrong data generation process

Like the toy factory example above.

Sometimes the leakage comes from how the data we need to train the model on is generated and delivered.

There is no way around this scenario. We just have to be vigilant, ask questions and take nothing for granted, especially when we are not in control of the data generation process just like in the case of the toy factory.

In general, the advice is always to check the data if it is not processed / delivered by a data science team

Conclusions

Thank you for reading this far.

You learned what data leakage is and why it’s so difficult to manage, even for experienced data scientists.

In this case the devil is in the details which, especially for new programmers, are easily overlooked when coding.

The examples I’ve shown you should help you assess whether your project is experiencing leakage.

Usually, if you notice very high performance in development, always check for leakage.

We never want our model to fail in production!

Until next time,
Andrew

Did you find this article valuable?

Support Andrea D’Agostino by becoming a sponsor. Any amount is appreciated!