Category Archives: Keras
Loss function for multi-class and multi-label classification in Keras and PyTorch
In multi-label classification, we use a binary classifier where each neuron(y_train.shape) in the output layer is responsible for one vs all class classification. binary_crossentropy is suited for binary classification and thus used for multi-label classification.
Activation function for Output Layer in Regression, Binary, Multi-Class, and Multi-Label Classification
The ReLU activation function is a default choice for the hidden layers. For the output layer, in general, you will want the logistics activation function for binary classification, the softmax activation function for multiclass classification, and no activation function for regression.
Adam optimizer with learning rate weight decay using AdamW in keras
Common deep learning libraries only implement L2 regularization, not the original weight decay. Therefore, on datasets where the use of L2 regularization is beneficial for SGD on many popular image classification datasets, Adam leads to worse results than SGD with momentum for which L2 regularization behaves as expected.
Split data set into Train and Test set Using Keras image_dataset_from_directory/folder.
Fraction of the training data to be used as validation data. The Keras will set apart this fraction of the training data. The validation data is selected from the last samples in the x and y data provided.
Split Imbalanced dataset using sklearn Stratified train_test_split().
Stratified train_test_split to maintain the imbalance so that the test and train dataset have the same distribution, then never touch the test set again. Stratified ensure that each dataset split has the same proportion of observations with a given label.
Concatenate two layers using keras.layers.concatenate() example
It is for the neural network to learn both deep patterns using the deep path and simple rules through the short path. In contrast, regular MLP forces all the data to flow through the entire stack of layers. These simple patterns in the data may end up being distorted by this sequence of transformations.
PyTorch AdamW and Adam with weight decay optimizers
Adam does not generalize as well as SGD with momentum when tested on a diverse set of deep learning tasks such as image classification, character-level language modeling, and constituency parsing. Adam lies in its dysfunctional implementation of weight decay.
Differences between Learning Rate and Weight Decay Hyperparameters in Neural networks.
The amount of regularization must be balanced for each dataset and architecture. Recognition of this principle permits the general use of super-convergence. Reducing other forms of regularization and regularizing with very large learning rates makes training significantly more efficient.