Understanding WeightedKappaLoss Using Keras

June 22, 2022 Post a Comment

I'm using Keras to try to predict a vector of scores (0-1) using a sequence of events. For example, X is a sequence of 3 vectors comprised of 6 features each, while y is a vector o

Solution 1:

Let we separate the goal to two sub-goals, we walk through the purpose, concept, mathematical details of Weighted Kappa first, after that we summarize the things to note when we try to use WeightedKappaLoss in tensorflow

PS: you can skip the understand part if you only care about usage

Weighted Kappa detailed explanation

Since the Weighted Kappa can be see as Cohen's kappa + weights, so we need to understand the Cohen's kappa first

Example of Cohen's kappa

Suppose we have two classifier (A and B) trying to classify 50 statements into two categories (True and False), the way they classify those statements wrt each other in a contingency table:

         B
         True False
A True   20   5     25 statements A think is true
  False  10   15    25 statements A think is false
         30 statements B think is true
              20 statements B think is false

Now suppose we want know: How reliable the prediction A and B made?

What we can do is simply take the percentage of classified statements which A and B agree with each other, i.e proportion of observed agreement denote as Po, so:

Po = (20 + 15) / 50 = 0.7

But this is problematic, because there have probability that A and B agree with each other by random chance, i.e proportion of expected chance agreement denote as Pe, if we use observed percentage as expect probability, then:

Pe = (probability statement A think is true) * (probability statement B think is true) +
     (probability statement A think is false) * (probability statement B think is false) 
   = (25 / 50) * (30 / 50) + 
     (25 / 50) * (20 / 50)
   = 0.5

Cohen's kappa coefficient denote as K that incorporate Po and Pe to give us more robust prediction about reliability of prediction A and B made:

K = (Po - Pe) / (1 - Pe) = 1 - (1 - Po) / (1 - Pe) = 1 - (1 - 0.7) / (1 - 0.5) = 0.4

We can see the more A and B are agree with each other (Po higher) and less they agree because of chance (Pe lower), the more Cohen's kappa "think" the result is reliable

Now assume A is the labels (ground truth) of statements, then K is telling us how reliable the B's prediction are, i.e how much prediction agree with labels when take random chance into consideration

Weights for Cohen's kappa

We define the contingency table with m classes formally:

                                    classifier 2
                       class.1  class.2  class... class.k  Sum over row
               class.1   n11      n12      ...      n1k      n1+  
               class.2   n21      n22      ...      n2k      n2+  
classifier 1   class...  ...      ...      ...      ...      ...  
               class.k   nk1      nk2      ...      nkk      nk+  
       Sum over column   n+1      n+2      ...      n+k      N   # total sum of all table cells

The table cells contain the counts of cross-classified categories denote as nij, i,j for row and column index respectively

Consider those k ordinal classes are separate from two categorical classes, e.g separate 1, 0 into five classes 1, 0.75, 0.5, 0.25, 0 which have a smooth ordered transition, we cannot say the classes are independent except the first and last class, e.g very good, good, normal, bad, very bad, the very good and good are not independent and the good should closer to bad than to very bad

Since the adjacent classes are interdependent then in order to calculate the quantity related to agreement we need define this dependency, i.e Weights denote as Wij, it assigned to each cell in the contingency table, value of weight (within range [0, 1]) depend on how close two classes are

Now let's look at Po and Pe formula in Weighted Kappa:

And Po and Pe formula in Cohen's kappa:

We can see Po and Pe formula in Cohen's kappa is special case of formula in Weighted Kappa, where weight = 1 assigned to all diagonal cells and weight = 0 elsewhere, when we calculate K (Cohen's kappa coefficient) using Po and Pe formula in Weighted Kappa we also take dependency between adjacent classes into consideration

Here are two commonly used weighting system:

Linear weight:

Quadratic weight:

Where, |i-j| is the distance between classes and k is the number of classes

Weighted Kappa Loss

This loss is use in case we mentioned before where one classifier is the labels, and the purpose of this loss is to make the model's (another classifier) prediction as reliable as possible, i.e encourage model to make more prediction agree with labels while make less random guess when take dependency between adjacent classes into consideration

The formula of Weighted Kappa Loss given by:

It just take formula of negative Cohen's kappa coefficient and get rid of constant -1 then apply natural logarithm on it, where dij = |i-j| for Linear weight, dij = (|i-j|)^2 for Quadratic weight

Following is the source code of Weighted Kappa Loss written with tensroflow, as you can see it just implement the formula of Weighted Kappa Loss above:

import warnings
from typing import Optional

import tensorflow as tf
from typeguard import typechecked

from tensorflow_addons.utils.types import Number

class WeightedKappaLoss(tf.keras.losses.Loss):
    @typechecked
    def __init__(
        self,
        num_classes: int,
        weightage: Optional[str] = "quadratic",
        name: Optional[str] = "cohen_kappa_loss",
        epsilon: Optional[Number] = 1e-6,
        dtype: Optional[tf.DType] = tf.float32,
        reduction: str = tf.keras.losses.Reduction.NONE,
    ):
        super().__init__(name=name, reduction=reduction)
        warnings.warn(
            "The data type for `WeightedKappaLoss` defaults to "
            "`tf.keras.backend.floatx()`."
            "The argument `dtype` will be removed in Addons `0.12`.",
            DeprecationWarning,
        )
        if weightage not in ("linear", "quadratic"):
            raise ValueError("Unknown kappa weighting type.")

        self.weightage = weightage
        self.num_classes = num_classes
        self.epsilon = epsilon or tf.keras.backend.epsilon()
        label_vec = tf.range(num_classes, dtype=tf.keras.backend.floatx())
        self.row_label_vec = tf.reshape(label_vec, [1, num_classes])
        self.col_label_vec = tf.reshape(label_vec, [num_classes, 1])
        col_mat = tf.tile(self.col_label_vec, [1, num_classes])
        row_mat = tf.tile(self.row_label_vec, [num_classes, 1])
        if weightage == "linear":
            self.weight_mat = tf.abs(col_mat - row_mat)
        else:
            self.weight_mat = (col_mat - row_mat) ** 2

    def call(self, y_true, y_pred):
        y_true = tf.cast(y_true, dtype=self.col_label_vec.dtype)
        y_pred = tf.cast(y_pred, dtype=self.weight_mat.dtype)
        batch_size = tf.shape(y_true)[0]
        cat_labels = tf.matmul(y_true, self.col_label_vec)
        cat_label_mat = tf.tile(cat_labels, [1, self.num_classes])
        row_label_mat = tf.tile(self.row_label_vec, [batch_size, 1])
        if self.weightage == "linear":
            weight = tf.abs(cat_label_mat - row_label_mat)
        else:
            weight = (cat_label_mat - row_label_mat) ** 2
        numerator = tf.reduce_sum(weight * y_pred)
        label_dist = tf.reduce_sum(y_true, axis=0, keepdims=True)
        pred_dist = tf.reduce_sum(y_pred, axis=0, keepdims=True)
        w_pred_dist = tf.matmul(self.weight_mat, pred_dist, transpose_b=True)
        denominator = tf.reduce_sum(tf.matmul(label_dist, w_pred_dist))
        denominator /= tf.cast(batch_size, dtype=denominator.dtype)
        loss = tf.math.divide_no_nan(numerator, denominator)
        return tf.math.log(loss + self.epsilon)

    def get_config(self):
        config = {
            "num_classes": self.num_classes,
            "weightage": self.weightage,
            "epsilon": self.epsilon,
        }
        base_config = super().get_config()
        return {**base_config, **config}

Usage of Weighted Kappa Loss

We can using Weighted Kappa Loss whenever we can form our problem to Ordinal Classification Problems, i.e the classes form a smooth ordered transition and adjacent classes are interdependent, like ranking something with very good, good, normal, bad, very bad, and the output of the model should be like Softmax results

We cannot using Weighted Kappa Loss when we try to predict the vector of scores (0-1) even if they can sum to 1, since the Weights in each elements of vector is different and this loss not ask how different is the value by subtract, but ask how many are the number by multiplication, e.g:

import tensorflow as tf
from tensorflow_addons.losses import WeightedKappaLoss

y_true = tf.constant([[0.1, 0.2, 0.6, 0.1], [0.1, 0.5, 0.3, 0.1],
                      [0.8, 0.05, 0.05, 0.1], [0.01, 0.09, 0.1, 0.8]])
y_pred_0 = tf.constant([[0.1, 0.2, 0.6, 0.1], [0.1, 0.5, 0.3, 0.1],
                      [0.8, 0.05, 0.05, 0.1], [0.01, 0.09, 0.1, 0.8]])
y_pred_1 = tf.constant([[0.0, 0.1, 0.9, 0.0], [0.1, 0.5, 0.3, 0.1],
                      [0.8, 0.05, 0.05, 0.1], [0.01, 0.09, 0.1, 0.8]])

kappa_loss = WeightedKappaLoss(weightage='linear', num_classes=4)
loss_0 = kappa_loss(y_true, y_pred_0)
loss_1 = kappa_loss(y_true, y_pred_1)
print('Loss_0: {}, loss_1: {}'.format(loss_0.numpy(), loss_1.numpy()))

Outputs:

# y_pred_0 equal to y_true yet loss_1 is smaller than loss_0
Loss_0: -0.7053321599960327, loss_1: -0.8015820980072021

Your code in Colab is working correctly in the context of Ordinal Classification Problems, since the function you form X->Y is very simple (int of X is Y index + 1), so the model learn it fairly quick and accurate, as we can see K (Cohen's kappa coefficient) up to 1.0 and Weighted Kappa Loss drop below -13.0 (which in practice usually is minimal we can expect)

In summary, you can using Weighted Kappa Loss unless you can form your problem to Ordinal Classification Problems which have labels in one-hot fashion, if you can and trying to solve the LTR (Learning to rank) problems, then you can check this tutorial of implement ListNet and this tutorial of tensorflow_ranking for better result, otherwise you shouldn't using Weighted Kappa Loss, if you can only form your problem to Regression Problems, then you should do the same as your original solution