Evaluating model performance: Generalization, Bias-Variance tradeoff and overfitting vs. underfitting


The main objective of machine learning models is to achieve good performance when new –and not previously seen– entries are analyzed. This is known as generalization. That is, it’s essential for the model to function correctly across a whole range of new inputs, not just the examples that it was trained with.

To analyze the performance of the models, the training set is divided into 2 disjoint subsets. One of these subsets, the “training set”, is used to train the parameters of the model. The other subset, the “validation set”, is used to estimate the generalization error during and after training. In general, between 70% to 80% of the data is used for training, while the rest is used for validation.

There are two main sources of error in machine learning models: bias and variance. These concepts can help us determine whether it’s necessary to incorporate more data, as well as decide which tactics can be applied to improve the performance of the models, improving time management and efficiency in the process.

We can describe these concepts in an easy-to-understand way as:
1. Bias: the error rate of a learning algorithm on the training set.
2. Variance: how much worse the performance of an algorithm on the validation set is with respect to the training set. This is, the gap between training and validation error.
As a result, the two factors that dictate how well a machine learning algorithm works are determined by its ability to:

  • reduce training error
  • reduce the gap between training and validation error.

These two factors correlate in turn with the two main challenges found in machine learning on daily basis: underfitting and overfitting.

Underfitting occurs when the model is not able to obtain a sufficiently low error on the training set, meaning it’s not able to find or model relevant data patterns from data input. In turn, overfitting occurs when the performance of the model on the training set is good, but it lowers considerably on the validation set, meaning the gap between the training error and the validation error is too big.

In the latter case, the model works well on already known instances (from the training set), but can’t generalize properly (make predictions on unseen examples, contemplated in the validation set).
Andrew Ng in [2] presents a series of examples that consider error metrics on a classifier, which help interpret the concepts of underfitting, overfitting, bias and variance:

  • Example 1: Overfitting, low bias and high variance
    Training error = 1%
    Validation error = 11%
    This algorithm has low bias (1%) but suffers from high variance (10%). The classifier has a low training error, but fails to properly generalize on the validation set.
  • Example 2: Underfitting, high bias and low variance
    Training error = 15%
    Validation error = 16%
    In this example, bias is 15% and variance is 1%. The classifier fits poorly to the training set with a 15% error (underfitting), but the error on the validation set is only slightly higher than this.
  • Example 3: High bias and high variance
    Training error = 15%
    Validation error = 30%
    In this example, bias is 15%, and variance is 15%. This classifier displays low performance on the training set (high bias), and even worse performance on the validation set (high variance).
  • Example 4: Good classifier, low bias and low variance
    Training error = 0.5%
    Validation error = 1%
    This is an example of a good classifier, with low bias and low variance.

It is important to point out that not all problems present the same degree of complexity. For instance, it’s not the same thing to classify images of cats than to implement a voice recognition system, where there may be a lot of background noise and the audio is unintelligible even to a person.

This is why another error must be considered to determine the performance of a model: the optimal error rate, also known as Bayes error rate.

In the first of the previous cases (classification of cat images), it’s fair to say that an optimal error rate close to 0% can be expected, while in the second case a greater error of close to 10% is possible.
With this consideration in mind, the definition of bias changes slightly, being:

bias = optimal error rate + avoidable bias

Here, avoidable bias represents the difference between training error and optimal error rate, that is, it reflects how much worse the performance of the algorithm on the training set is with respect to an optimal classifier.

Let’s go back to example 2 shown above, where we had a high-bias classifier:

  • Training error = 15%
  • Validation error = 16%

If we consider that this classifier has an optimal error rate of 14%, we should say that the error attributed particularly to the avoidable bias, is 1%, as is the error attributed to the variance. We can then say that the algorithm has achieved a very good performance, with little room for improvement. In fact, its generalization power is just 2% under the optimal error rate.

In actual practice, the optimal error rate is difficult to estimate. A good way around this is to calculate the performance that can be achieved by people performing the task in question (as long as the task can be performed by humans) and use this error as an estimate of the optimal error rate. This is mentioned in [2] as human level performance.

Which are some good alternatives to reduce BIAS and VARIANCE?

These are some good alternatives that may be useful when dealing with high BIAS and/or high VARIANCE algorithms:

Alternatives to reduce BIAS

  • Increase the capacity/complexity of the model (see next article, model capacity and bias-variance tradeoff) (number/type of inputs, number of neurons/layers in neural networks): This allows for a better data adjustment to the training set. Variance might be increased.
  • Reduce or eliminate regularization. Simply put, regularization is a technique that prevents the algorithm from fitting too much (overfitting) to the training data, so reducing regularization could help the model better fit to the training data, although it might increase variance.

Alternatives to reduce VARIANCE

  • Add more training data
  • Add or increase regularization. This might increase bias
  • Add early stopping, stop gradient descent early, based on the validation error
  • Feature selection to decrease number/type of input features.
  • Reduce the capacity/complexity of the model (see next article, model capacity and bias-variance tradeoff): this might increase bias.

Alternatives to reduce both:


  • Modify the input features considering error analysis observations. Error analysis might encourage the creation of additional features that help the algorithm eliminate particular errors.
  • Modify the model architecture (in the case of neural networks). Different types of architectures might be more suitable to a specific problem.

Workflow to reduce BIAS and VARIANCE

The following figure attempts to represent a workflow that can be used to evaluate and improve the performance of a machine learning algorithm, based on the ideas and concepts put forward by Andrew Ng in his online courses:

As the figure shows, the performance of the algorithm on the training set is analyzed first. In case of high BIAS (underfitting), some of the techniques are applied to reduce it and the model is then retrained.

Otherwise, the performance of the algorithm on the validation set is analyzed to detect a HIGH VARIANCE (overfitting) issue. If so, any of the appropriate techniques are applied before going back to the beginning.

It is necessary to repeat this cycle from the beginning, not only to validate the applied strategy, but also to corroborate that an improvement in the VARIANCE has not affected the BIAS of the model, since, as we previously saw, there is a trade-off between BIAS and VARIANCE (techniques to deal with these issues usually reduce one at the expense of increasing the other).

Key takeaway points

As we have seen, the relationship between bias and variance of a model is strongly related to the concepts of underfitting and overfitting. These are all factors that strongly impact the performance of a model, and its ability to generalize and predict new entries. This is why it’s so important to always keep in mind how to detect these problems and the available alternatives to successfully deal with them. We will develop the concepts of model capacity (presented in this article as an alternative to deal with BIAS and VARIANCE) and learning curves in the next article. Model capacity is defined as the ability of the model to adjust to a wide variety of functions and it’s a factor that directly influences its overfitting and underfitting. Learning curves are a good technique to interpret bias and variance, and also realize if you have a good set of data (in size) or more data collection is convenient.

[1] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
[2] Ng, A. Machine learning yearning: Technical strategy for ai engineers in the era of deep learning. (2019).

Leave a Reply

Your email address will not be published. Required fields are marked *