How to choose cross-entropy loss function in Keras?

In Deep Learning error of the current state must be estimated repeatedly. This requires the choice of an error function or loss function, that can be used to estimate the loss of the model so that the weights can be updated to reduce the loss on the next evaluation.

The choice of loss function must be specific to the problem, such as binary, multi-class, or multi-label classification. Further, the configuration of the output layer must also be appropriate for the chosen loss function.

In this tutorial, you will discover three cross-entropy loss functions and “how to choose a loss function for your deep learning model”.

Binary cross-entropy

It is intended to use with binary classification where the target value is 0 or 1. It will calculate a difference between the actual and predicted probability distributions for predicting class 1. The score is minimized and a perfect value is 0.

It calculates the loss of an example by computing the following average:

output size is the number of scalar values in the model output.

The output layer needs to configure with a single node and a “sigmoid” activation in order to predict the probability for class 1. An example of Binary cross-entropy loss for binary classification problems is listed below.

model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

Categorical cross-entropy

It is the default loss function to use for multi-class classification problems where each class is assigned a unique integer value from 0 to (num_classes – 1).

It will calculate the average difference between the actual and predicted probability distributions for all classes in the problem. The score is minimized and a perfect cross-entropy value is 0.

The target needs to be one-hot encoded which makes them directly appropriate to use with the categorical cross-entropy loss function.

The output layer is configured with n nodes (one for each class), in this MNIST case, 10 nodes, and a “softmax“ activation in order to predict the probability for each class.

model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

Differences between binary cross-entropy and categorical cross-entropy

Binary cross-entropy is for binary classification and categorical cross-entropy is for multi-class classification, but both work for binary classification, for categorical cross-entropy you need to change data to categorical(one-hot encoding).

Categorical cross-entropy is based on the assumption that only 1 class is correct out of all possible ones (the target should be [0,0,0,0,1,0] if the 5 class) while binary-cross-entropy works on each individual output separately implying that each case can belong to multiple classes(Multi-label) for instance if predicting music critic contains labels like Happy, Hopeful, Laidback, Relaxing, etc. That they will buy multiple ones; i.e. output like [0,1,0,1,0,1] is a valid one if you are using binary-cross-entropy.

Sparse categorical cross-entropy

It is frustrating when using cross-entropy with classification problems with a large number of labels like the 1000 classes. This can mean that the target element of each training example may require a one-hot encoded vector with thousands of zero values, requiring significant memory.

Sparse cross-entropy addresses this by performing the same cross-entropy calculation of error, without requiring that the target variable be one-hot encoded prior to training.

Sparse cross-entropy can be used in keras for multi-class classification by using:

model.add(Dense(10, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

Different between sparse categorical cross-entropy Vs categorical cross-entropy

If you use categorical-cross-entropy you need one-hot encoding, and if you use sparse-categorical-cross-entropy you encode as normal integers.

Use sparse categorical cross-entropy when your classes are mutually exclusive (when each sample belongs exactly to one class) and categorical cross-entropy when one sample can have multiple classes or labels.

This allows for conserving time and memory. Consider the case of 1000 classes when they are mutually exclusive – just 1 log instead of summing up 1000 for each sample, just one integer instead of 1000 floats.

The formula is the same in both cases, so no impact on accuracy should be there, sparse-cross-entropy is possibly cheaper in terms of computation.