When PyTorch neural network is instantiated, it’s common practice to use default weight and bias initialization. In the case of a Linear layer, the PyTorch documentation is not clear, and the source code is surprisingly complicated. Weight matrix Initialization prevents the gradient from exploding or vanishing during a forward pass through a deep neural network.

In order to complete a single forward pass we’ll have to perform a matrix multiplication between layer inputs and weights at each layer. The product of this multiplication at one layer becomes the inputs of the subsequent layer, and so on and so forth.

Weight Initialization

Weight Initialization is a process to create weight.No matter how to initialize the weight, it will be updated “well” eventually. Why do we care about initialization if the weight can be updated during training?

But the reality is different. If we random initialize the weight, it will cause two problems, the vanishing gradient problem and exploding gradient problem.

Vanishing gradient problem means weights vanish to 0. Because these weights are multiplied along with the layers in the backpropagation phase. If we initialize weights very small(<1), the gradients tend to get smaller and smaller as we go backward with hidden layers during backpropagation.

Exploding gradient problem means weights explode to infinity(NaN). Because these weights are multiplied along with the layers in the backpropagation phase. If we initialize weights very large(>1), the gradients tend to get larger and larger as we go backward with hidden layers during backpropagation.

Kaiming He Weight Initialization

Weight initialization methods need to be compatible with the choice of an activation function, mismatch can potentially affect training negatively.

ReLU is one of the most commonly used activation functions in deep learning. It is very convenient for scaling to large neural networks. On one hand, it is inexpensive to calculate the derivative during backpropagation because it is a linear function with a step-function derivative. 

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification” He et al. (2015), presents a methodology to optimally initialize neural network layers using a ReLU activation function. This technique allows the neural network to start in a regime with constant variance between inputs and outputs both in terms of forward and backward passes, which empirically showed meaningful improvement in training stability and speed. 

Kaiming Initialization, or He Initialization, is a weight initialization method for neural networks It takes non-linearity of activation functions into account, such as ReLU activations.

The Kaiming Initialization method avoids reducing or magnifying the magnitudes of input signals exponentially. Using a derivation they work out the condition to stop this happening.

Default Weight initialization For Linear Layer

Having good initial weights can place the neural network close to the optimal solution. This allows the neural network to come to the best solution quickly.

PyTorch has default weight initialization behavior for every kind of layer. You can see that linear layers are initialized with a kaiming_uniform.

class Linear(Module):
   def __init__(self, in_features: int, out_features: int, bias: bool = True,
         .........
        self.reset_parameters()

    def reset_parameters(self) -> None:
        # Setting a=sqrt(5) in kaiming_uniform is the same as initializing with
        # uniform(-1/sqrt(in_features), 1/sqrt(in_features)). 
        init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0
            init.uniform_(self.bias, -bound, bound)

Default Weight Initialization For Conv Layer

Default weight initialization for the Conv layer is kaiming_uniform() for weights and uniform() for biases.

class _ConvNd(Module):
   def __init__(...)
      ......
     self.reset_parameters()

   def reset_parameters(self) -> None:
        # Setting a=sqrt(5) in kaiming_uniform is the same as initializing with
        # uniform(-1/sqrt(k), 1/sqrt(k)), where k = weight.size(1) * prod(*kernel_size)

        init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            if fan_in != 0:
                bound = 1 / math.sqrt(fan_in)
                init.uniform_(self.bias, -bound, bound)

Citations
(1) Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, He et al. (2015)

Related Post

How to initialize weight and bias in PyTorch?