How to assign num_workers to PyTorch DataLoader?
Choosing the best value for the num_workers argument depends on your hardware, characteristics of your training data (such as its size and shape), the cost of your transform function, and what other processing is happening on the CPU at the same time. A simple heuristic is to use the number of available CPU cores.
Differences between Learning Rate and Weight Decay Hyperparameters in Neural networks.
The amount of regularization must be balanced for each dataset and architecture. Recognition of this principle permits the general use of super-convergence. Reducing other forms of regularization and regularizing with very large learning rates makes training significantly more efficient.
Weight Decay parameter for SGD optimizer in PyTorch
L2 regularization is also referred to as weight decay. The reason for this name is that thinking about SGD and backpropagation, the negative gradient of the L2 regularization term with respect to a parameter w_i is – 2 * lambda * w_i, where lambda is the aforementioned hyperparameter, simply named weight decay in PyTorch.
How loss.backward(), optimizer.step() and optimizer.zero_grad() related in PyTorch
When we call loss.backward(), PyTorch traverses this graph in the reverse direction to compute the gradients and accumulate their values in the grad attribute of those tensors (the leaf nodes of the graph).