Classification Metrics — why accuracy is inaccurate!

7 min readApr 28, 2022

It all began in prehistoric times. It was the chieftain’s daughter’s wedding. Angbo the cave dweller was tasked to separate good potatoes from spoiled ones. Angbo, being somewhat lazy, randomly put spuds in one or the other pile. This resulted in a rather disastrous dinner party. Needless to say, the chieftain was not impressed, and thus began the study of classification metrics.

In the modern world, we speak of artificial intelligence and machine learning, masterful chess computers and self-driving cars. These marvels are real and exist today, created solely by human ingenuity. One can reasonably imagine that these things have nothing tangible in common with what existed barely a hundred years ago. Who would have thought, in 1922, of a self-driving car, right? But if we pause and look under the hood of our modern car, we shall find Angbo’s ghost.

Let me clarify. Consider the very specific area of classification problems where we are trying to separate two (or more) classes of ‘things’ or ‘behaviors’ from each other. For the self-driving car, such a problem may manifest as the answer to a simple question — “Will I hit that pedestrian if I continue driving?”

Of course, the answer to that is paramount for the car, and perhaps even more so for the pedestrian in question. To obtain this answer, we would need a machine-learning model. We would then train this model till it can predict, with reasonable certainty, whether the pedestrian would be hit or spared on the current trajectory.

We can program the car to activate the braking system if the model predicts a collision.

At the same time, we don’t want it screeching to a halt every time a mere shadow flits across the street.

The problem we now face is almost identical to the one that tormented Angbo (or rather, his tribe). We need to know how good our predictions are. In other words, we need to evaluate our model. We can’t have pedestrians knocked about any more than we can serve bad potatoes at the chieftain’s party.

There are a number of metrics that can be used for this purpose. The easiest to understand is called “accuracy”. The formula is straightforward — add up the total number of correct predictions and divide it by the total number of predictions made.

If Angbo had a pile of a hundred potatoes to sort, of which twenty were bad, the process might have gone something like this. Angbo, since he was not really looking at the potatoes as he was sorting them, likely put in quite a few bad ones into the good pile as he went along. The total numbers could have been, plausibly, as follows

Good pile: 75 potatoes, of which 14 are bad and 61 are good. That is, 61 correct guesses.
Bad pile: 25 potatoes, of which 6 are bad and 19 are good. That is, 6 correct guesses. Note that in this pile, Angbo was supposed to put in the bad potatoes, so the good ones are the misclassified ones.

If we were to evaluate Angbo’s performance using accuracy, we would say he got 67 (61 + 6) correct out of a total of a hundred. In other words, Angbo’s accuracy was 67% (or 0.67). Also note, since Angbo is just randomly throwing potatoes into piles, he embodies what is called a no-skill model.

So now that we understand how accuracy is calculated, let us see why it is a poor choice to evaluate model performance in some situations. To do this, let us transport Angbo to an ultra-modern Machine Learning centre. We will also need to buy him a hundred new potatoes, because as you will remember, his stock was all consumed at the chieftain’s party.

Time for a twist in the tale. Since we bought the potatoes at a modern store that puts consumer interests first, we got ninety-nine good potatoes and only one bad one.

So we have Angbo, sitting in a corner, sorting the potatoes. Now Angbo is a smart fellow, even if somewhat lazy, and he has somehow caught on to the fact that we will be measuring his performance using accuracy. So he develops a strategy and makes just one giant pile of potatoes and calls them all “good potatoes”.

Our crack team of evaluators is flabbergasted. Obviously, Angbo did not do any sorting at all. But they are duty-bound to perform an objective evaluation, so they did, and the numbers are as follows.

Good pile: 100 potatoes, of which 99 are good and 1 is bad.
Bad pile: 0 potatoes.
Accuracy: 99 correct/100 total = 99%

So Angbo created a process that was able to generate 99% accuracy. Now you can see where the problem lies. It is inherent to the distribution of the pototoes. Since we had so few bad ones (just one, in fact), labelling them all as “good” was apparently a smart move.

To rephrase, when classification problems have skewed data with gross class imbalance, accuracy becomes a poor representation of model skill. Let’s consider one more example.

Imagine a machine learning model that is meant to detect fraudulent credit-card transactions. Since we believe most transactions to be genuine, the classes (“fraud”, “genuine”) are quite imbalanced. A no-skill model that classifies every transaction as genuine would appear to do very well on such data, if accuracy were the sole criterion for judging model performance.

So what can we do about it?

We need to recognize this specific limitation of accuracy and consider some other metrics. There are quite a few of these. Among the most commonly used are precision and recall.

Before moving to these measures, we need to understand a few more technical terms relating to the output of our classification models. Let’s dive right in.

Positive Class: The positive class needs to be defined for a given situation. For example, for Angbo, a bad potato may have been considered a positive class. Usually, we label the rare event, or the one we are actively looking for, as the positive class. In the credit card case, the positive class would then be the fraudulent transaction.

Negative Class: The other class. That is, the good potato and the genuine transaction. Of course, there may be more than two classes in some scenarios, but we are not going there today.

Let’s do a quick recap of what Angbo originally did.

Bad pile (Positive Class): 25 potatoes, of which 6 are actually bad and 19 are actually good.
Good pile (Negative Class): 75 potatoes, of which 14 are actually bad and 61 are actually good.

True Positives: The number of positive predictions that actually belong to the positive class. In Angbo’s original effort, this would have been 6 potatoes. Remember, our positives are bad potatoes.

True Negatives: The number of negative predictions that actually belong to the negative class. In Angbo’s original effort, this would have been 61 potatoes.

False Positives: The number of positive predictions that actually belong to the negative class, and are thus incorrect. Angbo got 19 of these.

False Negatives: The number of negative predictions that actually belong to the positive class, and are thus incorrectly predicted. Angbo got 19 of these.

If we were to depict those numbers pictorially, we might get something like this.

This lovely diagram has a name — its called a confusion matrix. It tells us about how our predictions fared when compared to the actual class labels.

Now we have the required information to talk about precision and recall.

Precision = True Positives / (True Positives + False Positives)

Precision refers to the number of correctly identified positives in the heap that was labelled positive. It can also be looked at as the fraction of relevant instances among the retrieved instances. For Angbo, this number would be 6/25.

Recall = True Positives/ (True Positives + False Negatives)

Recall is a measure of how many bad potatoes was Angbo able to detect, from the total number of bad potatoes that existed. The answer to that, of course, is, not enough, as the chieftain would have us know. In numerical terms, it is 6/20. Put another way, recall is the fraction of relevant instances that were retrieved by the model.

Now, we can better appreciate the value these measures bring. Let us calculate them for our imbalanced credit card problem and a no-skill model that predicts all transactions as genuine. The confusion matrix may look something like this.

Let’s work the numbers. Of a hundred transactions (the sum of all boxes), there was only one fraudulent one. The remaining 99 were genuine.

Our no-skill model predicted them all to be genuine, and therefore there is nothing in the left-hand column.

With this data, the following metrics can be calculated:

Accuracy: 99/100 = 0.99
Precision: 0/0 = undefined
Recall: 0/1 = 0

The accuracy score is really high at 0.99, but this happened because of the imbalanced classes. In this situation, recall is a better measure and its value — 0 — truly catches the lack of skill of the model.

There are other measures that can capture how skillful a model is, such as the F1-score, which combines the information available from precision and recall. Another very useful measure is the ROC-AUC (Receiver Operating Characteristics — Area Under the Curve), but I suspect you have had enough for one day.

So I shall leave you here, pondering over classification metrics, a problem eloquently captured by William Shakespeare’s to be or not to be dilemma. But before we part ways, a disclaimer is necessary.

Disclaimer: Since we talk about accuracy, it would be appropriate to state here that the accuracy of Angbo’s story, or even his existence, is debatable at best.

— Saurabh

References and Further Reading:

Classification Metrics — why accuracy is inaccurate!

Written by Saurabh