Imbalanced data typically refers to a problem with classification problems where the classes are not represented equally. For example, you may have a 2-class (binary) classification problem with 10000 instances. A total of 2000 instances are labeled with Class-1 and the remaining 8000 instances are labeled with Class-2.

This is an imbalanced dataset, and the ratio of Class-1 to Class-2 instances is 20:80. You can have a class imbalance problem on two-class and multi-class classification problems. Most techniques can be used on either.

First, we must understand the structure of our data. It has 10000 randomly generated input data points, and 2 classes split unevenly across data points.

import numpy as np

from sklearn.model_selection import train_test_split

rng = np.random.RandomState(1338)

n_points = 10000
X = rng.randn(n_points, 10)

percentiles_classes = [0.2, 0.8]

y = np.hstack([[ii] * int(n_points * perc) for ii, perc in enumerate(percentiles_classes)])

unique, frequency = np.unique(y, return_counts = True)
print(frequency,np.unique(unique)) #(array([2000, 8000]), array([0, 1]))

There are many ways to split data into training and test sets in order to avoid model overfitting, to standardize the number of groups in test sets, etc. Using train_test_split() from the data science library scikit-learn, you can split your dataset into subsets that minimize the potential for bias in your evaluation and validation process.

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True,test_size=0.2, random_state=42)

unique, frequency = np.unique(y_test, return_counts = True)
print(["{:0.3f}".format(fre/len(y_test)) for fre in frequency])  #['0.205', '0.795']

In a classification setting, it is often chosen to ensure that the train and test sets have approximately the same percentage of samples of each target class as the complete set.

Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance, there could be several times more negative samples than positive samples. In such cases, it is recommended to use stratified split as implemented in train_test_split() to ensure that relative class frequencies are approximately preserved in each train and validation split.

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True,test_size=0.2, random_state=42
,stratify=y)

unique, frequency = np.unique(y_test, return_counts = True)
print(["{:0.3f}".format(fre/len(y_test)) for fre in frequency]) #['0.200', '0.800']

Here is an example of a stratified split on a dataset with 10000 samples from two unbalanced classes. We show the number of samples in each class and compare them.

We can see that train_test_split() preserves the class ratios in both the train and test datasets. It preserves the same percentage for each target class as in the complete set.

If you are performing classification use stratified train_test_split to maintain the imbalance so that the test and train dataset have the same distribution, then never touch the test set again. Stratified ensures that each dataset split has the same proportion of observations with a given label.

Related Post