After training a classification model, such as Image or Text classification you need to examine the performance on a test data set. A common approach is to compute loss or accuracy.

You can inspect the classifier performance more closely by plotting a ROC curve and computing performance metrics. For example, you can find the threshold that maximizes the classification accuracy, or assess how the classifier performs in the regions of high sensitivity and high specificity.

ROC(Receiver Operating Characteristic) Curve

ROC curve plots the true positive rate versus the false positive rate for different thresholds of classification scores. Each point on a ROC curve corresponds to a pair of TPR and FPR values for a specific threshold value. You can find different pairs of TPR and FPR values by varying the threshold value and then creating a ROC curve using the pairs.

ROC curves are typically used in binary classification, where the TPR and FPR can be defined unambiguously. In the case of multiclass classification, a notion of TPR or FPR is obtained only after binarizing the output

For a multiclass classification problem, you can use the one-versus-all design and find a ROC curve for each class. The one-versus-all treats a multiclass classification problem as a set of binary classification problems and assumes one class as positive and the rest as negative in each binary problem.

A second method of using the ROC curve for multi-class models is the one-on-one (OvO) method. With this method, you’ll train a new binary classifier for every possible combination of categories.

By doing this, we reduce the multiclass classification output into a binary classification one, and so it is possible to use all the known binary classification metrics to evaluate this scenario.


With this method, comparing each class against all the others at the same time. For example, one class as a “positive” class, while all the others (the rest) are considered as the “negative” class.

We must repeat this for each class present on the data, so for a 3-class dataset we get 3 different One vs Rest scores. In the end, we can average them (simple or weighted average) to have a final One vs Rest model score.

Plot ROC Curve For Multi-Class Classification

Scikit-learn defines API for quick plotting and visual adjustments without recalculation. It provides Display classes that expose two methods for creating plots: from_estimator and from_predictions

The from_estimator() will take a fitted estimator and some data (X and y) and create a Display object. Sometimes, we would like to only compute the predictions once and one should use from_predictions instead. In the following example, we plot a ROC curve for a fitted LogisticRegression.

First, we load the Iris plants dataset which contains 3 classes, each one corresponding to a type of iris plant.  Then, we train a LogisticRegression on a training dataset.

import numpy as np

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
target_names = iris.target_names
X, y =,
y = iris.target_names[y]

Add noisy features to make the problem harder.

random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
n_classes = len(np.unique(y))
X = np.concatenate([X, random_state.randn(n_samples, 200 * n_features)], axis=1)

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.5, stratify=y, random_state=0)

We train a LogisticRegression model which can naturally handle multiclass problems.

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
y_score =, y_train).predict_proba(X_test)

The roc curve requires either the probabilities or the non-thresholded decision values from the estimator. Since the logistic regression provides a decision function, we will use it to plot the roc curve.

We use a LabelBinarizer to binarize the target by one-hot-encoding in a One vs All fashion. This means that the target of shape (n_samples,) is mapped to a target of shape (n_samples, n_classes).

from sklearn.preprocessing import LabelBinarizer

label_binarizer = LabelBinarizer().fit(y_train)
y_onehot_test = label_binarizer.transform(y_test)
y_onehot_test.shape  # (n_samples, n_classes) (75, 3)

Plot ROC Curve

In this example, we will construct display objects, RocCurveDisplay. This is an alternative to using their corresponding plot functions when a model’s predictions are already computed or expensive to compute. Note that this is advanced usage.

import matplotlib.pyplot as plt

from sklearn.metrics import RocCurveDisplay

    y_onehot_test[:, class_id],
    y_score[:, class_id],
    name=f"{class_of_interest} vs the rest",
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("One-vs-Rest ROC curves:\nVirginica vs (Setosa & Versicolor)")

In the case where the main interest is not the plot but the ROC-AUC score itself, we can reproduce the value shown in the plot using roc_auc_score.

from sklearn.metrics import roc_auc_score

micro_roc_auc_ovr = roc_auc_score(

print(f"Micro-averaged One-vs-Rest ROC AUC score:\n{micro_roc_auc_ovr:.2f}")

One vs One

One vs One is really similar to One vs All, but instead of comparing each class with the rest, we compare all possible two-class combinations.

When there are three categories to distinguish, for instance, three separate binary classifiers would be developed, like Iris plants dataset: setosa vs versicolor, versicolor vs virginica and virginica vs setosa.

In the OvO scheme, the first step is to identify all possible unique combinations of pairs. The computation of scores is done by treating one of the elements in a given pair as the positive class and the other element as the negative class, then re-computing the score by inversing the roles and taking the mean of both scores.

from itertools import combinations
from sklearn.metrics import roc_curve,auc

pair_list = list(combinations(np.unique(y), 2))

pair_scores = []
mean_tpr = dict()
fpr_grid = np.linspace(0.0, 1.0, 1000)

for ix, (label_a, label_b) in enumerate(pair_list):
    a_mask = y_test == label_a
    b_mask = y_test == label_b
    ab_mask = np.logical_or(a_mask, b_mask)

    a_true = a_mask[ab_mask]
    b_true = b_mask[ab_mask]

    idx_a = np.flatnonzero(label_binarizer.classes_ == label_a)[0]
    idx_b = np.flatnonzero(label_binarizer.classes_ == label_b)[0]

    fpr_a, tpr_a, _ = roc_curve(a_true, y_score[ab_mask, idx_a])
    fpr_b, tpr_b, _ = roc_curve(b_true, y_score[ab_mask, idx_b])

    mean_tpr[ix] = np.zeros_like(fpr_grid)
    mean_tpr[ix] += np.interp(fpr_grid, fpr_a, tpr_a)
    mean_tpr[ix] += np.interp(fpr_grid, fpr_b, tpr_b)
    mean_tpr[ix] /= 2
    mean_score = auc(fpr_grid, mean_tpr[ix])

    fig, ax = plt.subplots(figsize=(6, 6))
        label=f"Mean {label_a} vs {label_b} (AUC = {mean_score :.2f})",
        y_score[ab_mask, idx_a],
        name=f"{label_a} as positive class",
        y_score[ab_mask, idx_b],
        name=f"{label_b} as positive class",
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title(f"{target_names[idx_a]} vs {label_b} ROC curves")
ROC curve for Multi-class One Vs One

Area Under ROC Curve (AUC)

The AUC provides an aggregate performance measure across all possible thresholds. The AUC values are in the range of 0 to 1, and larger AUC values indicate better classifier performance.One can also assert that the macro-average we computed “by hand” is equivalent to the implemented average="macro" option of the roc_auc_score function.

macro_roc_auc_ovo = roc_auc_score(

print(f"Macro-averaged One-vs-One ROC AUC score:\n{macro_roc_auc_ovo:.2f}")

Related Post