Model Capacity and Learning Curves
In our first entry we talked about some important techniques to evaluate and improve performance on machine learning models, and today we want to dive right into two major concepts that directly correlate with this performance: model capacity and learning curves.
The first concept directly influences the overfitting and underfitting of a model. The second is a technique that helps identify bias and variance issues that may be affecting it, and figure out whether it may be convenient to increase the size of the data set to improve the performance of the model.
What is model capacity?
The capacity of a model can be defined as its ability to adjust to a wide variety of functions, this is, the number of functions that a model can select as a possible solution for a problem. For example, a linear regression model includes all grade 1 polynomials of the form y = w * x + b, that is, all possible solutions .
It is said that machine learning algorithms generally work best when their capacity measures well in relation to the true complexity of the task they need to perform and the amount of training data they have been provided.
In this sense, low-capacity models may have difficulty fitting to the training set, and may not be able to solve complex tasks.
On the other hand, high-capacity models can solve complex tasks, but when their capacity is greater than needed to solve the current task, they can overfit, memorizing properties of the training set that do not generalize well or are not suitable to perform predictions on the validation set.
The following figure (taken from ) shows the relation between the error and the model capacity, where the generalization error (or validation error) draws a U-shaped curve depending on the capacity of the model.
On the far left, the training and generalization error are high. This region corresponds to an underfit, high bias model. As the capacity increases, the training error decreases, but the gap between both errors widens, especially when crossing the optimum capacity point, where the generation error begins to increase. This area represents an overfit model (low bias and high variance), where its capacity is too large, well above the optimum point.
In other words, the graph shows that as the capacity of a model increases, the variance tends to increase and the bias to decrease.
Goodfellow et al.  show a simple example to describe the relation between capacity, underfitting and overfitting.
In the previous figure (taken from ), a linear, a quadratic and a 9-degree model are compared, all of which attempt to adjust to a problem where the true implicit function is quadratic. The linear function (left) is unable to capture the curvature relative to the real problem, going into underfitting. The 9-degree predictor (right) is able to represent the function correctly, but it’s also capable of representing a large number of functions that go exactly through the training points, since there are more parameters than training examples. At this point, there is very little chance of choosing a solution able to correctly generalize with so many different solutions at hand. The model basically ends up memorizing each training example and unable to extract the correct structure of the solution, going into overfitting.
Finally, it can be seen how the quadratic model (center) best maps the true structure of the problem (as expected in this case by knowing in advance the implicit function associated with the problem), reason why it will generalize well on new data.
It’s important to note that increasing the size of the dataset will not always help improve the performance of a machine learning model during training.
In this sense, learning curves become a good technique, not only to interpret the bias and variance of a model and diagnose whether it’s overfitting or underfitting, but to determine whether collecting more training data would further contribute to the improvement of the model’s performance.
In particular, each “learning curve” shows how the error variates as the size of the training set (m) increases.
What these curves attempt to achieve is to analyze the effect of m (the example number of trainings examples) on the training error and validation error.
In the curve graph it’s interesting to also incorporate the optimal error rate or the desired performance, as to have a better idea of the performance of the model, but in relation to the avoidable bias.
In general terms, and as seen in the previous figure (taken from ), learning curves behave or respond as follows: when the training set is small, it’s easy to adjust the model to each training example, so the training set error is zero or very small.
As the training set grows, it becomes increasingly difficult to adjust the model to each training example, so the training error increases. In other words, and in a general sense, the training error grows as m grows. The opposite occurs with the validation error, which decreases as m grows as it tries to approximate to the desired performance.
It is possible to estimate from the graph:
- the bias (and more precisely the avoidable bias) as the gap between the training error curve and the desired performance, and
- the variance, examining the gap between the validation learning curve and the training learning curve.
The bigger these gaps, the greater the bias and the variance respectively.
Another consideration to keep in mind when analyzing these graphs is that, overall, the algorithm exhibits a better performance on the training set, reason why the validation error curve is usually strictly above the training error curve.
Below are some graphic examples (also taken from ) and their respective analyzes:
This is a high bias and low variance model. The training error is well above the expected value, but the gap between the two errors is small. It is a model that suffers from underfitting.
At this point, adding more training data would not help reduce bias, given that, as we said earlier, the training error will increase (or remain practically constant at most) as “m” grows, moving away from the desired performance. In addition, the validation error curve will remain above the training error curve, so achieving the desired performance will not be possible either.
That is why at this point the need to apply some of the techniques mentioned in the previous article to reduce BIAS becomes necessary.
This model presents a low bias and a high variance. The training error is low (even well below the desired performance), but the gap between the two errors is big. This model is suffering from overfitting.
At this point it may be beneficial to increase the size of the dataset. There is the possibility for the validation error curve to continue to decrease and converge with the training error curve (which still has room to grow with respect to the desired performance).
This is a low bias (avoidable bias), high variance model. The training error is low (close to the desired performance), but the gap between the two errors is big.
It could also be beneficial to incorporate more training data at this point. While it is possible that the training error increases, it’s also possible that it doesn’t exceed the desired performance by too much. That is, increasing the avoidable bias is probable, but at the expense of reducing the validation error by reducing the gap between the two errors and contributing to the generalization of the model.
In any case, if the bias increased considerably, some of the techniques introduced in the previous article should be applied additionally to counteract it.
This is a high bias (avoidable bias), high variance model. The training error is well above the desired performance, and the gap between the two errors is big.
In this instance, adding more training data would not seem to be an acceptable solution. It is likely that more training data could reduce the validation error, and thus the gap between the two errors (reducing variance in the process). But this would in turn increase the training error, further increasing the bias of the model, which is already high.
Key takeaway points
As we have seen in Part I and II, the relationship between bias and variance is strongly related to the concepts of underfitting and overfitting, as well as with the concept of model capacity. There is precisely a trade-off between bias-variance in relation to the capacity of a model. A more complex model (higher capacity) is prone to overfit (low bias and high variance), while simpler models (lower capacity) tend to underfit (high bias and low variance). That is why, despite being a complex task, discovering and utilizing a model whose capacity is proportional to the complexity of the problem at hand will lead to a balance between the bias-variance of the model, which in turn will improve its performance.
On the other hand, the concept of learning curves was also introduced, which turns out to be a great tool to interpret the bias and variance of a model, and to determine whether it’s convenient or not to incorporate new training data. This is really important in practice, since it can be a very demanding task in both time and costs, and which may take a toll when it comes to improving the performance of the model.
Particularly as we have seen in the examples of the previous graphs (and going back to the techniques mentioned in the previous article to reduce BIAS and VARIANCE) increasing the size of the dataset can be beneficial for those models that suffer from high variance, since it would help reduce the gap between training error and validation error. However, it would not be as effective in models suffering from high bias, since if other measures are not taken, increasing the size of the dataset would only lead to further increasing the bias.
 Goodfellow, Ian; Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
 Ng, Andrew. Machine learning yearning: Technical strategy for ai engineers in the era of deep learning. 2019.
 Rodríguez, Jesús. A Different Way to Think About Overfitting and Underfitting in Machine Learning Part I: Capacity. 2017.