Cross-Validation Gone Wrong

Every time you want to use machine learning to solve a challenge you face a problem: Choosing the input features, classifier type, ... and then estimating your new model's performance. Simultaneously it is one of the most important steps as the difference in performance can be quite different for different models, so optimising your choices matters.

A universially used technique to help with this task, called model selection, is cross-validation. Using cross-validation is an excellent idea, but as with everything there are some traps waiting for you.

This post is about some of traps and how to avoid them.

As usual, first a few imports:

In [1]:
%matplotlib inline
In [2]:
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.feature_selection import SelectKBest, f_regression


np.random.seed(6450345)

Feature Selection

Often selecting which features/variables to include in your model is part of the model selection step. Let's try it.

One strategy to select a subset of features is to evaluate how correlated each one is with the class labels and then choose the $N$ features with the largest correlation.

Let's create a dataset where the class labels are chosen at random:

In [3]:
def make_data(N=1000, n_vars=10,
              n_classes=2):
    X = np.random.normal(size=(N,n_vars))
    y = np.random.choice(n_classes, N)
    
    return X, y

What does the correlation matrix between the different features and the class labels look like:

In [4]:
X,y = make_data(n_vars=5)

sns.corrplot(np.c_[X, y],
             names=["var%i"%n for n in range(X.shape[1])]+['y'])
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x109f74f90>

As you can see the correlations are not huge, but some variables have a higher correlation than others.

Let's select the 20 features with the largest correlation and then use cross-validation to evaluate the performance of our choice.

In [5]:
X,y = make_data(N=2000, n_vars=50000)

select3 = SelectKBest(f_regression, k=20)
X_three = select3.fit_transform(X,y)

clf = GradientBoostingClassifier()
scores = cross_val_score(clf, X_three, y, cv=5, n_jobs=8)

print "Scores on each subset:"
print scores
avg = (100*np.mean(scores), 100*np.std(scores)/np.sqrt(scores.shape[0]))
print "Average score and uncertainty: (%.2f +- %.3f)%%"%avg
Scores on each subset:
[ 0.6275  0.605   0.5625  0.6125  0.635 ]
Average score and uncertainty: (60.85 +- 1.133)%

Huzzah!

Machine learning is magic! Using features which have no bearing what so ever on the class labels we managed to predict the correct class in about 60% of cases, 10% better than what would be expected from random guessing.

Can you spot where we cheated?

Doing it the Right Way

There is no way to beat logic. None of the features contains any information about the class labels, afterall they were choosen at random. So how did we manage to outperform guessing? We cheated when we selected the features we use to train our classifier. We used the full data to find the two features with the highest correlation with the labels and used those.

If you want cross validation to give you an unbiased estimated of your true performance any and all your choices about features, hyper-parameters and the like have to be performed within the cross validation loop.

Here is how to do it right:

In [6]:
from sklearn.pipeline import make_pipeline

clf = make_pipeline(SelectKBest(f_regression, k=20),
                    GradientBoostingClassifier())

scores = cross_val_score(clf, X, y, cv=5)

print "Scores on each subset:"
print scores
avg = (100*np.mean(scores), 100*np.std(scores)/np.sqrt(scores.shape[0]))
print "Average score and uncertainty: (%.2f +- %.3f)%%"%avg
Scores on each subset:
[ 0.485   0.5     0.5375  0.5175  0.5075]
Average score and uncertainty: (50.95 +- 0.785)%

Once you include your choice of the 20 best variables in your cross validation pipeline, everything is back to normal.

No one, not even super smart ensembles of decision trees can perform miracles. If your labels are unrelated to the features, you might as well just guess.

Take Home Message

Choosing your input features is just one of the many choices you have to make when building your machine-learning application. Remember to make all decisions during the cross validation, otherwise you are in for a rude awakening when your model is confronted with unseen data for the first time.

If you find a mistake or want to tell me something else get in touch on twitter @betatim

This post started life as a ipython notebook, download it or view it online.

Copyright © 2014-2021 - Tim Head