Binary classification is the most fundamental problem in machine learning. Logistic regression is perhaps the most common technique used for binary classification. Naive Bayes classification and decision tree classification are two other alternatives. Neural networks are more powerful than these alternatives, in both the mathematical sense and ordinary language sense, but neural networks are more complex than the alternatives.

Binary classification looks at input and predicts which of two possible classes it belongs to. For example sentiment analysis, spam detection, and credit-card fraud detection. Such models are trained with datasets labeled with 1s and 0s representing the two classes.

The differences between neural network binary classification and multi-class classification are surprisingly tricky. In this article, I explain two different approaches to implementing neural network binary classification.

The common approach in a neural network for binary classification is to encode the variable to predict using just one value, for example, dog as 0 and cat as 1. There are two ways to design a binary neural network classifier, the **two-output neurons** technique, and the **single-output neurons** technique. Understanding the differences between the two approaches for binary classification -using two output neurons or one output neuron — is the main focus of this tutorial.

## Single output neurons for Binary Classification

The one-node output technique for neural network binary classification is shown in the bottom diagram. Here, the dog is encoded as 0, and the cat is encoded as 1 in the training data. The value of the single output node is 0.78. Because this value is closer to 1 than to 0, the neural network predicts the image is a cat.

When using the one-node technique for binary classification, the single output node value is computed using **sigmoid activation**. The result of the sigmoid activation function will always be a value between 0.0 and 1.0 and so the output value can be loosely interpreted as the probability of the class that is encoded as 1 (cat).

```
inputs = tf.keras.Input(shape = (img_height, img_width, 3))
x = tf.keras.layers.Conv2D(filters = 16, kernel_size = (3,3),activation = 'relu')(inputs)
x = tf.keras.layers.MaxPool2D()(x)
x = tf.keras.layers.Conv2D(filters = 32, kernel_size = (3,3),activation = 'relu')(x)
x = tf.keras.layers.MaxPool2D()(x)
x = tf.keras.layers.GlobalAveragePooling2D()(x)
x = tf.keras.layers.Dense(128, activation = 'relu')(x)
x = tf.keras.layers.Dense(128, activation = 'relu')(x)
outputs = tf.keras.layers.Dense(1, activation = 'sigmoid')(x)
model = tf.keras.Model(inputs = inputs , outputs = outputs)
model.compile(optimizer = 'adam',oss = 'binary_crossentropy',metrics = ['accuracy'])
```

A single neuron with sigmoid activation in the output layer and specifying *binary_crossentropy* as the loss function. The output from the network is a probability from 0.0 to 1.0.

### Predict Binary classification

When using the one-node technique, if the output value is less than 0.5 the predicted class is corresponding to 0, and if the output value is greater than 0.5 the predicted class is corresponding to 1.

**Two output neurons for Binary Classification**

The larger of the two output node values are in the first position so the computed output values map to (0) and so the neural network predicts the image is Dog.

```
inputs = tf.keras.Input(shape = (img_height, img_width, 3))
x = tf.keras.layers.Conv2D(filters = 16, kernel_size = (3,3),activation = 'relu')(inputs)
x = tf.keras.layers.MaxPool2D()(x)
x = tf.keras.layers.Conv2D(filters = 32, kernel_size = (3,3),activation = 'relu')(x)
x = tf.keras.layers.MaxPool2D()(x)
x = tf.keras.layers.GlobalAveragePooling2D()(x)
x = tf.keras.layers.Dense(128, activation = 'relu')(x)
x = tf.keras.layers.Dense(128, activation = 'relu')(x)
outputs = tf.keras.layers.Dense(2,activation = 'softmax')(x)
model = tf.keras.Model(inputs = inputs , outputs = outputs)
model.compile(optimizer='adam',loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),metrics=['accuracy'])
```

Neural network classifiers that have two or more output nodes use **softmax activation** and **SparseCategoricalCrossentropy as loss function**. The point of softmax activation is to scale the output node values so that they sum to 1.0. Then the output values can be loosely interpreted as probabilities and easily mapped to one of the encoded classes to predict.

Please note that for a binary classification problem, if you use 2 neurons in the output layer, you should use ‘softmax’ as the activation function of the output layer, and if you use 1 neuron, ‘sigmoid’ would be a choice.

##### Which design for binary classification is better one-node or the two-node technique?

The one-node technique would seem to be preferable because it requires **fewer weights and biases**, and therefore it should be easier to train than a neural network that uses the two-node technique. In fact, the one-node technique is the most **common approach** used for neural network binary classification.

The two-node technique code is exactly the same for a binary classification problem and a multi-class classification problem like **softmax activation** + **SparseCategoricalCrossentropy as loss function**.

This code works for either binary or multi-class classification if you use the two-node technique for binary problems. But if you use the one-node technique you must add branching logic.

When you see neural network code where the number of output nodes is set to 2, you can be fairly sure that the model is using two-node binary classification because multi-class classification would have three or more output nodes and one-node binary classification would have one output node.

### Run this code in google colab.

### Related Post

- What is Categorical Cross Entropy Loss Function in Keras?
- Activation function for Output Layer in Regression, Binary, Multi-Class, and Multi-Label Classification
- Loss function for multi-class and multi-label classification in Keras and PyTorch
- Understand PyTorch BCELoss and BCEWithLogitsLoss Loss functions