Loss function for multi-class and multi-label classification in Keras and PyTorch

A loss function computes a single numerical value that the learning process will attempt to minimize. Loss calculation typically involves taking the difference between the desired outputs for some training samples and the outputs produced by the model when fed those samples—for example, the difference between the predicted output by our model and the actual output.

Based on the mathematical expression(loss function) we choose, we can emphasize certain errors. Conceptually, a loss function is a way of prioritizing which errors to fix from our training samples, so that our parameter updates result in adjustments to the outputs for the highly weighted samples instead of changes to some other samples that had a smaller loss.

In this tutorial first, we define what multi-class and multi-label classification mean and then we discuss How to select a loss function to train a neural network in Keras and PyTorch.

Multi-Class Classification Loss Function

The goal of a multi-class classification problem is to predict a value that can be one of three or more possible discrete values, such as “red,” “yellow” or “green” for a traffic signal. The problem is often framed as predicting an integer value, where each class is assigned a unique integer value from 0 to (num_classes – 1). The problem is often implemented as predicting the probability of the example belonging to each known class.

In the classification problem, Cross-entropy loss is the obvious choice.

Softmax Classifier — Softmax classifier and CrossEntropy Loss

Here is how we calculate CrossEntropy loss in a simple multi-class classification case when the target labels are mutually exclusive. During the loss computation, we only care about the logit corresponding to the truth target label and how large it is compared to other labels. Softmax makes all predicted probabilities sum to 1, so there couldn’t be several correct answers.

Keras Multi-Class Classification Loss Function

Sparse Cross-entropy is the default loss function to use for multi-class classification problems. It is intended for use with multi-class classification where the target values are in the set {0, 1, 3, …, n}, where each class is assigned a unique integer value.

Cross-entropy will calculate a score that summarizes the average difference between the actual and predicted probability distributions for all classes in the problem. The score is minimized and a perfect cross-entropy value is 0.

Sparse cross-entropy can be used in keras for multi-class classification by using ‘sparse_categorical_crossentropy‘ when calling the compile() function.

model.compile(loss='sparse_categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

The function requires that the output layer is configured with an n nodes (one for each class), in this case, three nodes, and a ‘softmax‘ activation in order to predict the probability for each class.

model.add(Dense(3, activation='softmax'))

PyTorch Multi-Class Classification Loss Function

PyTorch has standard loss functions that we can use: for example, nn.BCEWithLogitsLoss() for a binary classification problem, and a nn.CrossEntropyLoss() for a multi-class classification problem like MNIST.

criterion = nn.CrossEntropyLoss()

It takes the logit prediction and ground truth as parameters, and returns the loss. Two things to keep in mind for this function:

1. Loss functions like this usually take the logit as a parameter, rather than the post-softmax probability distributions. This is for numerical stability.
2. This loss function also takes the ground-truth integer index as a parameter, rather than a one-hot vector.

loss = criterion(y, torch.Tensor([8]).long()) # digit 8 = the 8-th class
print(loss)

Multi-Label classification Loss Function

In a multi-class classification, our true label usually corresponds to a single integer. However, in multi-label classification, inputs can be associated with multiple classes. For example, a movie poster can have multiple genres.

Multi-label classification problem we can have any number of classes associated with it. We’ll assume that the labels are mutually exclusive and thus, instead of one hot encoding, we’ll try multi-label binarization. Here the label is transformed into a binary vector such that all values are zero except the indexes associated with each class in that label, which is marked with a 1.

Treat each prediction independently. For example, using the Sigmoid function as a normalizer for each logit value separately.

Keras Multi-Label classification Loss Function

For a multi-label classification problem, we use the sigmoid activation function for the output layer. What sigmoid does is that it allows you to have a high probability for all your classes or some of them or none of them.

model.add(Dense(y_train.shape[1], activation='sigmoid'))  #notice activation for output layer.

In multi-label classification, we use a binary classifier where each neuron(y_train.shape[1]) in the output layer is responsible for one vs all class classification. binary_crossentropy is suited for binary classification and thus used for multi-label classification.

model.compile(loss='binary_crossentropy', optimizer='adam)

PyTorch Multi-Label classification Loss Function

Here we have several correct labels and predicted probability for each label. We can compare these probabilities with the probabilities of the correct labels using BinaryCrossEntropy loss.

Sigmoid Classifier With BCE — Sigmoid classifier and BinaryCrossEntropyLoss

BinaryCrossEntropy loss defines in PyTorch code simply as:

criterion = nn.BCELoss()

During the inference specifying a threshold value, so all labels with probabilities higher than it is considered predicted labels, and others are skipped for example threshold value of 0.5.