Multi-layer neural networks end with real-valued output scores and that are not conveniently scaled, which may be difficult to work with. Here the softmax is very useful because it converts the scores to a normalized probability distribution.

Many activations will not be compatible with the calculation because their outputs are not interpretable as probabilities (i.e., their outputs do not sum to 1). Softmax is often used with cross-entropy for multiclass classification because it guarantees a well-behaved probability distribution function.

In this post, we talked about the softmax function and the cross-entropy loss these are one of the most common functions used in neural networks so you should know how they work and also talk about the math behind these and how we can use them in Python and PyTorch.

Cross-Entropy loss is used to optimize classification models. The understanding of Cross-Entropy is pegged on an understanding of the Softmax activation function. Let’s First understand the Softmax activation function.

Softmax Activation function

The softmax activation function transforms a vector of K real values into values between 0 and 1 so that they can be interpreted as probabilities. The input values can be positive, negative, zero, or greater than one.

Softmax input output

Softmax function turns logits [0.1, 0.9, 4.0] into probabilities [0.05, 0.10, 0.85], and the probabilities sum to 1 by taking the exponents of each output and then normalizing each number by the sum of those exponents so the entire output vector adds up to one. The purpose of the Cross-Entropy is to take the output probabilities (P) and measure the distance from the true values.

Here’s the Python code for the Softmax function.

def softmax(x):
  return np.exp(x)/np.sum(np.exp(x),axis=0)

We use numpy.exp(power) to take the special number to any power we want. We compute the sum of all the transformed logits and normalize each of the transformed logits.

x=np.array([0.1, 0.9, 4.0])


print('Softmax in Python :',output)

#Softmax in Python : [0.04672966 0.10399876 0.84927158]

If one of the inputs is small or negative, the softmax turns it into a small probability, and if the input is large, then it turns it into a large probability, but it will always remain between 0 and 1.

PyTorch Softmax function rescales an n-dimensional input Tensor so that the elements of the n-dimensional output Tensor lie in the range [0,1] and sum to 1.

Here’s the PyTorch code for the Softmax function.


#tensor([0.0467, 0.1040, 0.8493], dtype=torch.float64)

The function torch.nn.functional.softmax takes two parameters: input and dim. The softmax operation is applied to all slices of input along with the specified dim and will rescale them so that the elements lie in the range (0, 1) and sum to 1. It specifies the axis along which to apply the softmax activation. 


A lot of times the softmax function is combined with Cross-entropy loss. Cross-entropy calculates the difference between two probability distributions or calculates the total entropy between the distributions. Cross-entropy can be used as a loss function when optimizing classification models.

The cross entropy formula takes in two distributions, the true distribution p(y) and the estimated distribution q(y) defined over the discrete variable y.

Cross-entropy Loss Function

This can be used in multi-class problems. The loss increases as the predicted probability diverges from the actual label. So the better our prediction the lower is our loss. Here we have examples.

Cross-Entropy Loss input
Y must be one-hot encoded.

Now let’s have a look at the code and how we do this in NumPy and Python. We have the sum over the actual time’s log of the predicted labels and then we must put a minus one at the beginning and normalize it by the number of samples.

#Cross Entropy Loss

def cross_entropy(y,y_pre):
  return loss/float(y_pre.shape[0])

Then we create our Y as I said this must be one hot encode so here we put our two predictions. so these are now probabilities the first one has a good prediction because also here class two has the highest probability and the second prediction is a bad prediction here class two gets a very low probability and class two gets a high probability now and then computes cross-entropy.

y=np.array([0,0,1]) #class #2



print('Loss 1:',l1)
print('Loss 2:',l2)

Loss 1: 0.07438118377140324
Loss 2: 0.7675283643313485

Here we see that the first prediction has a low loss the second prediction has a high loss and now again let’s see how we can do this in PyTorch, for this first we create the loss.

loss =nn.CrossEntropyLoss()

Here we have to be careful because the cross-entropy loss already applies the LogSoftmax and then the negative log-likelihood(nn.LogSoftmax+nn.NLLLoss). We must not implement the softmax layer for ourselves. The second thing is that our Y must not be one-hot encoded so we should only put the correct class label here. The y_pred has raw logits so no softmax here.




print(l1.item()) #0.3850
print(l2.item()) #2.4398

Here we see that our good prediction has lower cross-entropy loss so this works and now to get the actual prediction we can do it like this so let’s.


print(predict1) #tensor([2])
print(predict2) #tensor([0])

So this is how we get the predictions and what’s also very good is that the loss in PyTorch allows for multiple samples so let’s increase our samples.







So this is how we can use the softmax and cross-entropy loss in PyTorch and Python.

Related Post