PyTorch Loss Backward Example

We’ve known that PyTorch computes all the gradients with a magic call loss.backward, but let’s explore what’s happening behind the scenes.

In this tutorial, we will look even further into what is derivative, how to calculate the gradient and explore how a neural network’s actual forward and backward passes are done.

We will build everything from scratch, initially using only basic plain Python (except for indexing into PyTorch tensors), and then replace the plain Python with PyTorch functionality after we’ve seen how to create it.

We’ll write a neural net from the ground up, and then implement back-propagation manually so we know exactly what’s happening in PyTorch when we call loss.backward.

What is a derivative?

It is one magic step where we calculate the gradients. The derivative of a function tells you how much a change in its parameters will change its result.

For any function, such as the quadratic function we can calculate its derivative. The derivative is another function. It calculates the change, rather than the value.

For example, the derivative of the quadratic function at the value 3 tells us how rapidly the function changes at the value 3. More specifically, you may recall that gradient is defined as rise/run; that is, the change in the value of the function, divided by the change in the value of the parameter.

How to calculate the gradients?

You won’t have to calculate any gradients yourself, PyTorch is able to automatically compute the derivative of any function. Let’s see an example.

xt = torch.tensor(5.).requires_grad_()

Notice the special method requires_grad_(). It’s used to tell PyTorch that we want to calculate gradients with respect to that variable at that value.

It is essentially tagging the variable, so PyTorch will remember to keep track of how to compute gradients of the other direct calculations on it that you will ask for.

In those contexts, the “gradient” of a function is just another function (i.e., its derivative), But in deep learning, “gradient” usually means the value of a function’s derivative at a particular argument value.

Now we calculate our function with that value. Notice how PyTorch prints not just the value calculated, but also a note that it has a gradient function it’ll be using to calculate our gradients when needed:

def fun(x):
  return (x**2)

yt = fun(xt)
print(yt) #tensor(25., grad_fn=<PowBackward0>)

Finally, we tell PyTorch to calculate the gradients for us:

yt.backward()

The “backward” here refers to backpropagation, which is the name given to the process of calculating the derivative of each layer. This is called the backward pass of the network, as opposed to the forward pass, which is where the activations are calculated.

We can now view the gradients by checking the grad attribute of our tensor:

print(xt.grad) #tensor(10.)

The derivative of x**2 is 2*x, and we have x=5, so the gradients should be 2*5=10, which is what PyTorch calculated for us.

The gradients tell us only the slope of our function, they don’t tell us exactly how far to adjust the parameters. But they do give us some idea of how far.

If the slope is very large, that may suggest that we have more adjustments to make, whereas if the slope is very small, that may suggest that we are close to the optimal value.

End-to-End SGD Example

We’ve seen how to calculate gradients. Now it’s time to look at an SGD example. Let’s start with a simple:

X = torch.arange(-5, 5, 0.1).view(-1, 1)
func = -5 * X
Y = func + 0.4 * torch.randn(X.size())

Forward Pass

We will need to compute all the gradients of a given loss with respect to its parameters, which is known as the backward pass. In a forward pass, where we compute the output of the model on a given input, based on the matrix products.

# defining the function for forward pass for prediction
def forward(x):
    return w * x + b

Loss Function

The loss function will return a value based on a prediction and a target, where lower values of the function correspond to “better” predictions. For continuous data, it’s common to use mean squared error:

step_size = 0.1

# evaluating data points with Mean Square Error.
def mse(preds, targets):
  return ((preds-targets)**2).mean()

Initialize the parameters

First, we initialize the parameters and tell PyTorch that we want to track their gradients using requires_grad_:

w = torch.tensor(-10.0, requires_grad=True)
b = torch.tensor(-20.0, requires_grad=True)

Calculate the predictions

Next, we calculate the predictions:

Y_pred = forward(X)

loss = mse(Y_pred, Y)

print(loss) #tensor(346.0281, grad_fn=<MeanBackward0>)

Calculate the gradients

The next step is to calculate the gradients or an approximation of how the parameters need to change:

loss.backward()
print(w.grad,b.grad) #tensor(53.9505) tensor(-32.4779)

We can use these gradients to improve our parameters.

Step the weights

Now we need to update the parameters based on the gradients we just calculated:

w.data = w.data - step_size * w.grad.data
b.data = b.data - step_size * b.grad.data

# zeroing gradients after each iteration
w.grad.data.zero_()
b.grad.data.zero_()

To calculate the gradients, we call backward on the loss. But this loss was itself calculated by mse, which in turn took preds as an input, which was calculated using forward() taking as an input params, which was the object on which we originally called required_grads_()—which is the original call that now allows us to call backward on loss.

How loss.backward(), optimizer.step() and optimizer.zero_grad() related in PyTorch

What does optimizer.step() and scheduler.step() do?

What does require_grad=false or true in PyTorch?

Print Computed Gradient Values of PyTorch Model