In the previous post, we discuss the difference between ReLU, Sigmoid, and Tanh activation functions. ReLU is mainly used in hidden layers of an NN to add non-linearities to our model. But others, like sigmoid (for binary) and softmax (for multiclass), are added at the last output layer, which results in class probabilities as the output of the model. If the sigmoid or softmax activations are not included at the output layer, then the model will compute the logits instead of the class probabilities.

The activation function in classification problems depends on the type of problem such as binary or multiclass and the type of output such as logits or probabilities, we should choose the appropriate loss function to train our model.


Binary cross-entropy is the loss function for binary classification with a single output unit, and categorical cross-entropy is the loss function for multiclass classification. In PyTorch, the categorical cross-entropy loss takes in ground truth labels as integers, for example, y=2, out of three classes, 0, 1, and 2.


Binary cross-entropy with logits loss combines a Sigmoid layer and the BCELoss in one single class. It is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability. You can read more about ‘how log function improves numeric stability’. 

The following code will show you how to use these loss functions with two different formats, where either the logits or class probabilities are given as inputs to the loss functions:

input = torch.randn(3, requires_grad=True)
target = torch.empty(3).random_(2)

sigmoid = nn.Sigmoid()
bce_loss = nn.BCELoss()
bce_logits_loss = nn.BCEWithLogitsLoss()

#with class probabilities
output = bce_loss(probabilities, target)

print(output.item()) #0.4326

#with logits
output = bce_logits_loss(input, target)

print(output.item()) #0.4326

Computing the binary loss by providing the logits, and not the class probabilities is usually preferred due to numerical stability. For binary classification, we can either provide logits as inputs to the loss function BCEWithLogitsLoss() or compute the probabilities based on the logits and feed them to the loss function BCELoss().

BCEWithLogitsLoss returns logits during inference, the values of the logits might be harder to interpret, so you need to apply a sigmoid to get the probabilities. If you don’t need the probabilities to get the predictions from logits, you could apply a threshold (e.g. out > 0.0) for binary and multi-label classification with CrossEntropyLoss you could apply argmax(output, dim=1).

Binary classification with two outputs

The binary classification model returns a single output value. It is implied that P(class = 0|x)= 1 – P(class = 1|x); hence, we do not need a second output unit in order to obtain the probability of the negative class. However, sometimes you need two outputs for each training example and interpret them as probabilities of each class: P(class = 0|x) versus P(class = 1|x). Then, in such a case, using a softmax function instead of the logistic sigmoid to normalize the outputs. So that they sum to 1 and categorical cross-entropy is the appropriate loss function.

Related Post