You aren’t likely to need to implement cross-entropy loss from scratch yourself in practice, but you must understand the cross-entropy loss function because it is used in nearly every classification model. 

In this tutorial, We discuss cross-entropy loss for dependent variables with more than two categories. The cross-entropy loss function is similar to BCELoss as we discussed in the previous post, but it has two benefits:

  • It works even when our dependent variable has more than two categories. 
  • It results in faster and more reliable training. 

Cross-entropy tells us the expected number of bits per trial. We have optimized our encoding for the incorrect distribution q(x), and subtract from that the entropy, which tells us the expected number of bits per trial given.

Cross Entropy Loss Formula

Cross-entropy provides a way of measuring the distinctness of two distributions.

PyTorch Cross Entropy Formula

The log(1/q(x)) can be interpreted as the optimal binary string length assigned to each outcome, assuming outcomes appear according to probability distribution q(x). Note that this is an expectation with respect to p(x).

We can understand the cross entropy to mean the expected string length for any trial given we have optimized for the encoding scheme for distribution q(x) while, in reality, all of the outcomes appear according to the distribution p(x).

This can happen in an experiment where we have only limited a priori information about the experiment, so we assume some distribution q(x) to optimize our encoding scheme, but as we carry out trials, we learn more information that gets us closer to the true distribution p(x). 

Cross Entropy Loss Example

One common example where cross-entropy is used multiclass classification. The neural network’s objective is to learn a distribution over target classes such that, for any given example xi, matches the true distribution p(y|x = xi), which has all of its probability mass placed over the true label yi and zero probability over all other classes. 

Minimizing the sum of cross-entropies between the learned distribution and the true distribution over all examples is the same as minimizing the negative log-likelihood of the data.

Create PyTorch Module

The PyTorch nn module provides all of the baseline functionality necessary for defining, training, and testing a model. 

import torch
import torch.nn as nn

class MNISTClassifier(nn.Module):
    def __init__(self):
      super(MNISTClassifier, self).__init__()
      self.layer1 = nn.Linear(28*28*1, 256, bias=True)
      self.layer2 = nn.Linear(256, 10, bias=True)
      self.relu = nn.ReLU()
    def forward(self, x):
      x = self.layer1(x)
      x = self.relu(x)
      out = self.layer2(x)
      return out

The forward method defines how the layers initialized in your model’s constructor interact with the input to generate the model’s output.

x = torch.randn((5, 28*28*1))
classifier = MNISTClassifier()
out = classifier(x)

Note that we implicitly call the forward function when using the classifier model as a function in the final line. 

Create loss function

To train the model, we need a loss metric to evaluate the model. During training, once we calculate this loss metric, call backward() on the computed loss.

This will store the gradient in each parameter’s grad attribute. Since we have defined a classifier model, we can use the cross-entropy loss metric from PyTorch nn:

loss = nn.CrossEntropyLoss()
target = torch.tensor([0,3,2,8,2])
computed_loss = loss(out, target)
computed_loss.backward()

During gradient descent, we need to adjust the parameters based on their gradients. We could do this manually, but PyTorch has abstracted away this functionality into the torch.optim module.

This module provides functionality for determining the optimizer, which may be more complex than classic gradient descent, and updating the parameters of the model. You can define the optimizer as follows:

from torch import optim

lr = 1e-3
optimizer = optim.SGD(classifier.parameters(), lr=lr)

optimizer.step() # Updates parameters
optimizer.zero_grad() # Zeroes out gradients

This code works for a single minibatch—performing training over the entire dataset would require manually shuffling the dataset at each epoch and splitting the dataset into minibatches that can be iterated through. 

Related Post

What is Categorical Cross Entropy Loss Function in Keras?

Advantage of using LogSoftmax vs Softmax vs Crossentropyloss in PyTorch

How to implement softmax and cross-entropy in Python and PyTorch?

How to choose cross-entropy loss function in Keras?

PyTorch BCEWithLogitsLoss with example

How to use class weight in CrossEntropyLoss for an imbalanced dataset?

Understand PyTorch BCELoss and BCEWithLogitsLoss Loss functions