How do you split a dataset into train and test in Python?

In this tutorial, we discuss the idea of a train, test, and dev split of machine learning datasets. This is a common thing to see in large publicly available data sets. The common assumption is that you will develop a system using the train and dev data and then evaluate it on test data.

Many data sets that you study will have this kind of split. You work on the train and dev sets, never looking at a test even if you possess it. Just before publication with all the systems done you run a single test evaluation. When you do error analysis you do it on a test set because of this last point here.

No Fixed Rules for Split

If you work in health care or something where every example is kind of hard-won, you might be reluctant to hold out a test set because it really means that you’re giving up on a whole bunch of data that could be used and studied in an interesting way.

Random splits

In this case, shuffle your data, hold out 10% of the data for training and the rest for testing, and run an experiment. Then make another split, randomly run an experiment, and so forth. You can do that as many times as you want, and you might want to do it a lot to get some insight into how much variance there is in your system’s performance.

When you do this, you probably want to make sure that you have the same distribution of labels across the splits. sklearn makes this very easy.

from sklearn.model_selection import train_test_split
train_test_split(X, Y, test_size=0.2, random_state=42)

Advantages of Random Splits

The advantage is that you can create as many as you want without having to impact the ratio of the train to test examples. The idea is that you can just run as many of these experiments as you want. With a large dataset, there might be an infinite number of ways that you could do these divisions.

The Disadvantage of Random Splits

There’s actually no guarantee that every example will be the same number of times for training and testing. So depending on the nature of the splits that you do, you might be looking at kind of distorted evaluations as you combine them and average them. But this has really nice flexibility to it. In situations where your model is pretty fast to train and test, the fact that you can do lots of splits this way without impacting the ratio of the train-to-test examples is very powerful.

Train-Test split for TensorFlow Keras

Scikit is, as usual, wonderful for helping you do this kind of thing.

x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.4)

There is also a boolean parameter called “shuffle” which is set true as default, so if you don’t want your data to be shuffled you could just set it to false.