How does L1 and L2 regularization prevent overfitting?

Last Updated : 14 May, 2024

Overfitting is a recurring problem in machine learning that can harm a model's capacity to perform well and be generalized. Regularization is a useful tactic for addressing this problem since it keeps models from becoming too complicated and, thus, too customized to the training set. L1 and L2, two widely used regularization techniques, provide different solutions for this issue. In this article, we will be exploring how does regularization prevents overfitting.

How do we avoid Overfitting?

Overfitting occurs when a machine learning model learns the training data too well, to the extent that it starts to memorize noise and random fluctuations in the data rather than capturing the underlying patterns. This can result in poor performance when the model is applied to new, unseen data. Essentially, it's like a student who memorizes the answers to specific questions without truly understanding the material, and then struggles when faced with new questions or scenarios. Avoiding overfitting is crucial in developing robust and generalizable machine learning models.

To improve a model's performance, various techniques can be applied. These include methods like dropout, which randomly removes neurons during training, adaptive regularization to adjust regularization strength based on data, and early stopping to halt training when performance plateaus, along with experimenting with different architectures and applying L1 or L2 regularization for controlling overfitting. Here, we will emphasize on L1 and L2 regularization.

How does L1, and L2 regularization prevent overfitting?

L1 regularization, or Lasso regularization, introduces a penalty term based on the absolute values of the weights into the model's cost function. This penalty encourages the model to prioritize a smaller set of significant features, aiding in feature selection. By reducing feature complexity, L1 regularization helps prevent overfitting.

We can represent the modified loss function as:

L_{L1} = L_{original} + \lambda \sum_{i=1}^{n}|w_i|

Here,

L_{L1} is the new loss function with L1 regularization.
L_{orginal} is the original loss function without regularization.
\lambda is the regularization parameter
n is the number of features
w_i are the coefficients of the features.

The term \lambda \sum_{i=1}^{n}|w_i|penalizes large coefficients by adding their absolute values to the loss function.

L2 regularization, also known as Ridge regularization, incorporates a penalty term proportional to the square of the weights into the model's cost function. This encourages the model to evenly distribute weights across all features, preventing overreliance on any single feature and thereby reducing overfitting.

We can represent the modified loss function as:

L_{L2} = L_{original} + \lambda \sum_{i=1}^{n}|w_i^{2}|

Here,

L_{L2} is the new loss function with L2 regularization
L_{original} is the original loss function without regularization
\lambda is the regularization parameter
n is the number of features
w_i are the coefficients of the features

The term \lambda \sum_{i=1}^{n} w_{i}^{2} penalizes large coefficients by adding their squared values to the loss function.

In essence, both L1 and L2 regularization techniques counter overfitting by simplifying the model and promoting more balanced weight distribution across features.