Two common issues with training recurrent neural networks are vanishing gradients and exploding gradients. Exploding gradients can occur when the gradient becomes too large, resulting in an unstable network. Vanishing gradients can happen when optimization gets stuck at a certain point because the gradient is too small to progress. The training process can be made stable by changing the gradients either by scaling the vector norm or clipping gradient values to a range.

Using gradient clipping you can prevent exploding gradients in neural networks. Gradient clipping limits the magnitude of the gradient. There are many ways to compute gradient clipping, but a common one is to rescale gradients so that their norm is at most a particular value. With gradient clipping, pre-determined gradient thresholds are introduced, and then gradient norms that exceed this threshold are scaled down to match the norm. This prevents any gradient to have a norm greater than the threshold and thus the gradients are clipped. 

There are two main methods for updating the error derivative:

1. Gradient Scaling: Whenever the gradient norm is greater than a particular threshold, we clip the gradient norm so that it stays within the threshold. This threshold is sometimes set to 1. You probably want to clip the whole gradient by its global norm.

2. Gradient Clipping: It forces the gradient values to a specific minimum or maximum value if the gradient exceeded an expected range. We set a threshold value and if the gradient is more than that then it is clipped.

Gradient Scaling

In RNN the gradients tend to grow very large (exploding gradient) and clipping them helps to prevent this from happening. Using torch.nn.utils.clip_grad_norm_ to keep the gradients within a specific range.

For example, we could specify a norm of 1.0, meaning that if the vector norm for a gradient exceeds 1.0, then the values in the vector will be rescaled so that the norm of the vector equals 1.0.

The clip_grad_norm_ modifies the gradient after the entire backpropagation has taken place. clip_grad_norm_ is invoked after all of the gradients have been updated. I.e. between loss.backward() and optimizer.step(). So during loss.backward(), the gradients that are propagated backward are not clipped until the backward pass completes and clip_grad_norm_() is invoked. optimizer.step() will then use the updated gradients.

The norm is computed over all gradients together as if they were concatenated into a single vector. All of the gradient coefficients are multiplied by the same clip_coef.

Now, we’ll define the training loop in which gradient calculation along with optimizer steps will be defined. Here we’ll also define our clipping instruction.

epochs=10
for epoch in range(epochs):
      total_loss = 0.0
      total_acc=0.0
      for i, batch in enumerate(train_iterator):
            (feature, batch_length), label = batch.comment_text, batch.toxic
            
            optimizer.zero_grad()
            
            output = model(feature, batch_length).squeeze() 
            # print(output)
            
            loss = loss_function(output, label)
            acc=model_accuracy(output,label)
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=2.0, norm_type=2)
            optimizer.step()

            total_loss += loss.item()
            total_acc+=acc.item() 
      
      print(f"loss on epoch {epoch} = {total_loss/len(train_iterator)}")
      print(f"accuracy on epoch {epoch} = {total_acc/len(train_iterator)}")

2. Gradient Clipping by value

It clips the derivatives of the loss function to have a given value if a gradient value is less than a negative threshold or more than the positive threshold.

For example, we could specify a norm of 0.5, meaning that if a gradient value was less than -0.5, it is set to -0.5 and if it is more than 0.5, then it will be set to 0.5.

To apply Clip-by-norm you can change this line to:

nn.utils.clip_grad_value_(model.parameters(), clip_value=1.0)

The value for the gradient vector norm or preferred range can be configured by trial and error, by using common values used in the literature, or by first observing common vector norms or ranges via experimentation and then choosing a sensible value.

Related Post