The imbalance dataset is the fact that the classes are not represented equally. Which is quite common in practice. For instance, fraud detection, prediction of rare adverse drug reactions, etc. There are two methods of dealing with imbalanced datasets, the first one is Oversampling and the other is Class Weight.
Simply alter the dataset to remove such an imbalance like increase the number of minority observations until we’ve reached a balanced dataset.
Simply provides a weight for each class that places more emphasis on the minority classes such that the end result is a classifier learns equally from all classes. Incorporating the weights of the classes into the loss function. In this section, I’ll discuss oversampling.
We use something called samplers for OverSampling. Though we did not use samplers exclusively, PyTorch used it for us internally. When we say shuffle=False, PyTorch ended up using SequentialSampler it gives an index from zero to the length of the dataset. When shuffle=True it ends up using a RandomSampler.
Let’s understand what is the SequentialSampler by just calling it right, We’ll create SequentialSampler and get all the indexes that return a bunch of sequences like zero to the maximum dataset size.
In terms of RandomSampler What it’s going to give you a number between 0 to a maximum length of the dataset, it’s a random number. It does not repeat the same number again these are the two samplers that we end up using by default.
If you have a class imbalance, use a WeightedSampler, so that you have all classes with equal probability. Give an equal sort of weight to the dataset.
I created a dummy data set with a target imbalance of ratio 8: 2.
numSample=1000 batch_size=100 sample=torch.FloatTensor(numSample,10) zero=np.zeros(int(numSample * 0.8),dtype=np.int32) one=np.ones(int(numSample * 0.2),dtype=np.int32) target=np.hstack((zero,one)) dataset=sample.numpy() #split dataset into tran and test set x_train,x_test,y_train,y_test= train_test_split(dataset, target, test_size=0.25, random_state=42, stratify=target, shuffle=True)
Now that we have a dataset we’re going to use this WeightedRandomSampler. What we want to do first of all is that creating class weights for each class.
count=Counter(y_train) class_count=np.array([count,count]) weight=1./class_count print(weight)
Right now we’re just specifying those class weights, then we are going to create sample weights. We’ll do sample weights of this particular index for a particular sample of our data set we’ll set that equal to the class weight.
samples_weight = np.array([weight[t] for t in y_train]) samples_weight=torch.from_numpy(samples_weight)
It seems that weights should have the same length as your number of samples.
WeightedRandomSampler will sample the elements based on the passed weights. Note that you should provide a weight value for each sample in your Dataset.
sampler = WeightedRandomSampler(samples_weight, len(samples_weight))
Now we created those sample weights then we’ll create our sampler and this is going to be our WeightedRandomSampler where we’ll send in the sample weights and the num sample which is going to equal the length of our data set. We can also specify replacement equals True or False. If we set it to false then we’ll only see that example once when we iterate through our entire dataset.
When we’re dealing with an imbalanced dataset and we’re using Oversampling then we always want to use replacement equal True.
By default, the WeightedRandomSampler will use replacement=True. In which case, the samples that would be in a batch would not necessarily be unique.
trainDataset = torch.utils.data.TensorDataset(torch.FloatTensor(x_train), torch.LongTensor(y_train.astype(int))) validDataset = torch.utils.data.TensorDataset(torch.FloatTensor(x_test), torch.LongTensor(y_test.astype(int))) trainLoader = torch.utils.data.DataLoader(dataset = trainDataset, batch_size=batch_size, num_workers=1, sampler = sampler) testLoader = torch.utils.data.DataLoader(dataset = validDataset, batch_size=batch_size, shuffle=False, num_workers=1)
Now that we’ve created our loader so our loader is just going to be a data loader of that dataset. What’s different is that we specify a sampler and in this case, our sampler is just going to equal WeightedRandomSampler.
The for loop should loop through all your train samples with each batch containing approx the same amount of zeros and ones.
It is essential to ensure your datasets share approximately the same ratio of examples from each class so that you can achieve consistent predictive performance scores. Some classification algorithms are also extremely sensitive to the class ratio of the data that they are trained on.