# A Guide to Base Rate Error in Machine Learning

The performance of machine learning models is obtained by testing them. We use many statistical tests, but we all know that no statistical test is perfect. Some errors in models are easy to understand but hard to catch. The base rate fallacy can be seen as an easy-to-understand but hard-to-find error. The concept of base rate error is taken from behavioral science. In this article, we will discuss this error and we will also understand its applicability to machine learning. The main points to be discussed in the article are listed below.

## Contents

1. What is the base rate?
2. What is the base rate fallacy?
3. Base rate error in machine learning
4. Why does the base rate fallacy occur?
5. How to avoid the fallacy of the base rate?

Let’s start by understanding the base rate first.

## What is the base rate?

In statistics, the base rate can be thought of as probabilities of classes that are not conditioned by evidence of features. We can also think of the base rate as prior probabilities. We can understand it by using the example of engineers in the world. So if 2% of the population are engineers in this world, the base rate of engineers is simply 1%.

In many statistical analyses, we find that the base rate is difficult to compare. Let’s say 2000 people beat covid-19 using any type of treatment. This will seem like a good number until we look at the entire population that has undergone a similar type of treatment. Let’s say we find out that the base rate of treatment success is only 1/50, which means that only 2000 people are successful in defeating covid using the treatment when it is applied to 100,000 people. It’s such a crucial number and that’s how we get a clearer report on the treatment using the base rate.

By the example above, we can understand how important base rate information is when performing statistical analysis. Failure to use a base rate in the statistical analysis can be termed a base rate error. Let’s see what the base rate fallacy is.

Are you looking for a comprehensive repository of Python libraries used in data science, check here.

## What is the base rate fallacy?

In a general sense, we can say that error can be defined as the use of faulty reasoning, wrong moves or invalid moves when constructing an argument. It can be said that it will seem stronger than its actual strength.

Base rate error is also a kind of error also known as base rate bias and base rate neglect. This type of error contains base rate information and specific information. Base rate data may be ignored in favor of individualized data. We can also consider the fallacy as part of the negligence of extension.

## Base rate error in machine learning

In the above, we discussed that this error is something related to ignoring information and we know about machine learning that the underlying models work based on information ( we can also say that information is data). Consider an example of classification models where we use the confusion matrix to describe the performance of the classification models.

The process of creating a confusion matrix is ​​followed by testing the model on the test data and the confusion matrix tells us the number of good predictions and bad predictions of the model. In the confusion matrix, the false negative paradox and the false positive paradox are examples of base rate error.

Let’s say there is a machine learning model for facial recognition of happy people, which yields more false positive test results than true positives. We want the model to predict 99% accurately and analyze 1000 people every day, judging it by the number of tests, higher accuracy can be compensated and the end result will determine many more false positives than true ones.

We can measure the likelihood of positive results by the accuracy of the test and the quality of the sampled population. We can say in summary that if the given part with a condition is lower, the false positive rate will give more false than positive if the base rate error is present.

Let’s understand it by an example in which a model is applied to classify a population of 1000 samples, the model indicates that 40% are class A and provides a false positive rate of 5% and zero false negative rate.

From class A and positive samples

1000 X (40/100) = 400, these samples receive a true positive

Class B and negative samples

1000X [(100 – 40)/100] X 0.05 = 30, these samples will receive a false positive

So 1000 – (400 + 30) = 570 samples are negative

The final precision measurement will be

400/(30+400) = 93%

The confusion matrix will look like this:

Say it is applied on different 1000 samples where only 2% is class A sample, then the confusion matrix will look like this

In this case, we can say that 20 of the 69 samples are predicted correctly. Thus, the probability that the model predicts correctly will be 29% for a similar test whose result is 93% accurate.

## Why does the base rate fallacy occur?

In studies, we can find a number of reasons behind the presence of errors, and they all relate to a matter of relevance, i.e., we ignore the base rate information. Most of the timebase rate information is classified as irrelevant and ignores its preprocessing. Sometimes we also find that the representative heuristic becomes the reason for the base rate error.

## How to avoid the fallacy of the base rate?

As mentioned above, ignoring the base rate information leads to the base rate error and we can also avoid the base rate error by paying attention to the base rate information. We may also need to understand what samples exist that are not as reliable predictors as we think.

We are bound to put in more effort when we measure the probability of an event occurring. Bayesian methods help us measure the probability distribution of the uplift and become a way to reduce base rate error.

## Last words

In this article, we have discussed the base rate error that can be found in the results of models when used to make predictions and which occurs due to ignorance of rate information. basic. Along with this, we have discussed how this error occurs and how we can avoid it. 