If you choose the wrong metric to evaluate your models, you are likely to choose a poor model and be misled about the expected performance of your model.

Standard evaluation metrics treat all classes as equally important. For imbalanced classification problems typically the rate of classification errors of the minority class is more important than the majority class. 

In the previous post, Calculate Precision, Recall and F1 score for Keras model, I explained precision, recall, and F1 score, and how to calculate them. In this post, I’ll explain another popular metric, the F1-Macro.

The F1 score is an important metric to evaluate the performance of classification models, especially for unbalanced classes where the binary accuracy is useless.

The dataset

The dataset is hosted on Kaggle and contains Wikipedia comments which have been labeled by human raters for toxic behavior.

import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

Count Multiple Category

Something important to notice is that all category is not represented in the same quantity. Some of them can be very infrequent which may represent a hard challenge for any ML algorithm. 

How to calculate the f1-macro score.

First, compute the per-class precision and recall for all classes, then combine these pairs to compute the per-class F1 scores, and finally use the arithmetic mean of these per-class F1 scores as the f1-macro score. For example, the F1 scores of “toxic”, “severe_toxic”, “obscene”, “threat”, “insult”, and “identity_hate” is 55%, 34%, 45%,26%,41%, and 29 % respectively, and thus the macro F1-score is:

Macro-F1 = (55% + 34% + 45% + 26% + 41% + 29 % ) / 6 = 38.33%

Keras custom callbacks

This metric is only meaningful for the whole dataset so we need to create custom keras callbacks for f1-macro calculation.

class MetricsCallback(keras.callbacks.Callback):
    def __init__(self):
        super(MetricsCallback, self).__init__()
    def  on_train_begin(self,logs={}):
    def on_epoch_end(self, epoch, logs=None):


on_train_begin is initialized at the beginning of the training. Here we initiate a list to hold the values, which are computed in on_epoch_end. Later on, we can access these lists as usual instance variables.

Python’s sklearn library is the most popular machine-learning package, and it provides the sklearn.metrics.f1_score function, which computes f1-macro.

score=f1_score(y_true, y_pred, average='macro')
print(" F1 macro :",score)

Now, define the model, and add the callback parameter in the fit function.

model.fit(x_train, y_train,validation_data=(x_test,y_test),batch_size=batch_size, epochs=5,callbacks=[metrics])
Keras f1-macro

The F1-macro will always be somewhere in between precision and mean. But it behaves differently: the F1-macro gives a larger weight to lower numbers.


f1-macro represent the final evaluation metric you really care about. Unlike the loss function, it has to be more intuitive in order to understand the performance of the model in the real world.

Related Post

Run this code in Google Colab