Activation function for Output Layer in Regression, Binary, Multi-Class, and Multi-Label Classification

A Neural Network is composed of one input layer, one or more hidden layers, and one final layer called the output layer. The layers close to the input layer are usually called the lower layers, and the ones close to the outputs are usually called the upper layers. Every layer except the output layer includes a bias neuron and is fully connected to the next layer.

The output layer is the final layer in the neural network where desired predictions are obtained. There is one output layer in a neural network that produces the desired final prediction. It has its own set of weights and biases that are applied before the final output is derived.

The activation function for the output layer may be different than the hidden layers based on the problem. For example, Softmax activation is used to derive the final classes in a classification problem.

The output is a vector of values that may need further post-processing to convert them to business-related values. For example, in a classification problem, the output is a set of probabilities that needs to be mapped to the corresponding business classes. After completing this tutorial, you will know:

How many neurons do you need in the output layer?
What activation function should you use in the output layer?

Regression

If you want to predict a single value (e.g.the price of a house, given many of its features), then you just need a single output neuron: its output is the predicted value.

The output layer should have exactly one node in the regression problem. Here we are not trying to map inputs to a variety of class labels but rather trying to predict a single continuous target value for each sample. Therefore, our network should have one output node to return one – and exactly one – output prediction for each sample in our dataset.

The output layer is just a single neuron using the sigmoid activation function to output the estimated probability that the review expresses a positive sentiment regarding the movie.

    model = Sequential()
    model.add(Dense(10, input_dim=X_train.shape[1], activation='relu'))
    model.add(Dense(30, activation='relu'))
    model.add(Dense(40, activation='relu'))
    model.add(Dense(1))

The activation function for a regression problem will be linear. This can be defined by using activation = ‘linear’ or leaving it unspecified to employ the default parameter value activation = None.

For multivariate regression (i.e., to predict multiple values at once), you need one output neuron per output dimension. For example, to locate the center of an object in an image, you need to predict 2D coordinates, so you need two output neurons.

Classification

There are main three types of classification problems to consider when training neural networks–each with its own type of output layer configurations.

Binary Classification

For a binary classification problem, you just need a single output neuron using the logistic activation function: the output will be a number between 0 and 1, which you can interpret as the estimated probability of the positive class. The estimated probability of the negative class is equal to one minus that number.

If we were doing binary classification (with one or more binary labels), then we would use the “sigmoid” (i.e., logistic) activation function in the output layer

To classify IMDB sentiment as positive or negative, you just need one neuron in the output layer of a neural network—for example, indicating the probability that the review is positive or negative. You would typically use the logistic activation function in the output layer when estimating a probability.

model = models.Sequential()
model.add(layers.Dense(16, kernel_regularizer=regularizers.l1(0.001), activation='relu', input_shape=(10000,)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(16, kernel_regularizer=regularizers.l1(0.001),activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))

Multi-Class Classification

If each instance can belong only to a single class, out of three or more possible classes (e.g., classes 0 through 9 for MNIST image classification), then you need to have one output neuron per class, and you should use the softmax activation function for the whole output layer.

The softmax function will ensure that all the estimated probabilities are between 0 and 1 and that they add up to 1 (which is required if the classes are exclusive). This is called multiclass classification.

 model = Sequential()
 model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', input_shape=(28, 28, 1)))
 model.add(MaxPooling2D((2, 2)))
 model.add(Flatten())
 model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
 model.add(Dense(10, activation='softmax'))

We add a Dense output layer with 10 neurons (one per class), using the softmax activation function (because the classes are exclusive).

The number of neurons in the input and output layers is determined by the type of input and output your task requires. For example, the MNIST task requires 28 × 28 =784 input neurons and 10 output neurons.

Multi-Label Classification

The multi-label classification is similar to multi-class except that target labels are no longer mutually exclusive. In a multi-label, the goal is to capture all the labels for a given entity as accurately as possible.

This is like fitting multiple binary classification problems at once–one for each target class. The multi-label scenario has a value for each class, each row does not sum to 1. Instead, a separate binary classification is run for each of these values per row.

As the multi-label problem fits binary classifications for each class in the target variable, it should follow the binary classification case closely. The number of nodes in the final layer should be equivalent to the number of classes in our multi-label problem and the activation function of our output node should be sigmoid. For example, in Planet image classification there are 17 possible tags each image contains.

IMG_SHAPE = (TARGET_SIZE, TARGET_SIZE, 3)

base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
                                               include_top=False,
                                               weights='imagenet')
base_model.trainable = False

model = tf.keras.Sequential([
  base_model,
  tf.keras.layers.GlobalAveragePooling2D(),
  tf.keras.layers.Dense(17, activation='sigmoid')
])

The output layer is a bit special: instead of computing the matrix multiplication of the inputs and the weight vector, each neuron outputs the square of the Euclidian distance between its input vector and its weight vector. Each output measures how much the image belongs to a particular digit class.

The ReLU activation function is a default choice for the hidden layers. For the output layer, in general, you will want the logistics activation function for binary classification, the softmax activation function for multiclass classification, and no activation function for regression.