Adam optimizer become a default method of choice for training feed-forward and recurrent neural networks. Adam does not generalize as well as SGD with momentum when tested on a diverse set of deep learning tasks such as image classification, character-level language modeling, and constituency parsing. Adam lies in its dysfunctional implementation of weight decay.
Adam limits the potential benefit of weight decay regularization because the weights do not decay multiplicatively. Adam might be outperformed by SGD with momentum because that L2 regularization or weight decay is implemented suboptimally in common deep-learning libraries. Adam leads to worse results than SGD with momentum (for which L2 regularization behaves as expected).
Weight decay and L2 regularization in Adam
The weight decay, decay the weights by θ exponentially as:
θt+1 = (1 − λ)θt − α∇ft(θt)
where λ defines the rate of the weight decay per step and ∇ft(θt) is the t-th batch gradient to be multiplied by a learning rate α. For standard SGD, it is equivalent to standard L2 regularization.
L2 regularization and weight decay regularization are equivalent for standard stochastic gradient descent when rescaled by the learning rate, this is not the case for Adam. While common implementations of these algorithms employ L2 regularization may be misleading due to the inequivalence we expose.
Decoupling weight decay
AdamW is a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function.
AdamW decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and substantially improves Adam’s generalization performance, allowing it to compete with SGD with momentum on image classification datasets. Decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and PyTorch.
#TensorFlow fa.optimizers.AdamW( weight_decay: Union[FloatTensorLike, Callable], learning_rate: Union[FloatTensorLike, Callable] = 0.001, beta_1: Union[FloatTensorLike, Callable] = 0.9, beta_2: Union[FloatTensorLike, Callable] = 0.999, epsilon: tfa.types.FloatTensorLike = 1e-07, amsgrad: bool = False, name: str = 'AdamW', **kwargs ) #PyTorch torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False, *, maximize=False, foreach=None, capturable=False)
Adam generalizes substantially better with decoupled weight decay than with L2 regularization. This holds true for various image recognition datasets (CIFAR-10 and ImageNet32x32), decoupled weight decay renders the optimal settings of the learning rate and the weight decay factor much more independent, thereby easing hyperparameter optimization.
One fact that is often overlooked already for the simple case of SGD is that in order for the equivalence to hold, the L2 regularizer λ′ has to be set to λ α , i.e., if there is an overall best weight decay value λ, the best value of λ′ is tightly coupled with the learning rate α. In order to decouple the effects of these two hyperparameters, AdamW to decouple the weight decay step.
Adam different with L2 regularization, the sums of the gradient of the loss function and the gradient of the regularizer (i.e., the L2 norm of the weights) are adapted, whereas, with decoupled weight decay, only the gradients of the loss function are adapted (with the weight decay step separated from the adaptive gradient mechanism).
With L2 regularization both types of gradients are normalized by their magnitudes, and therefore weights x with large typical gradient magnitude s are regularized by a smaller relative amount than other weights.
In contrast, decoupled weight decay regularizes all weights with the same rate λ, effectively regularizing weights x with large s more than standard L2 regularization does.
The AdamW optimizer described in “Decoupled Weight Decay Regularization” by Loshch ilov & Hutter(https://arxiv.org/pdf/1711.05101.pdf).