Regularization in Machine Learning: What I Learned by Implementing It From Scratch

Introduction: Why I decided to implement Regularization From Scratch
After implementing Linear machine learning models. I learned about overfitting, bias-variance tradeoff, and regularization in theory. L1 and L2 penalties existed, and I could explain their formulas. But that understanding was shallow, because i never observed how regularization actually changes a model.
Mostly we all know the equations like λ‖w‖₁ or λ‖w‖² and then jump straight to results. What we never know is what happens to the weights during training, why some weights shrink others disappear entirely, or why certain datasets make regularization look useless.
So instead of relying on the machine learning libraries, I decided to implement logistic regression with L1, L2 and Elastic Net regularization from scratch using only NumPy. My goal was not to get better accuracy, but to understood behavior:
When does regularization help?
When does it feels useless?
Why does L1 perform feature selection while L2 does not?
What role does optimization play in all of this?
The Problem Regularization Is Meant to Solve:
At its core, regularization exists to control model complexity.
In logistic regression, the model learns a weight for every feature. If the dataset has many features — especially noisy or redundant ones — the model can assign large weights to patterns that appear important in the training data but do not generalize. This is classes overfitting: The model performs well on seen data and poorly on unseen data.
A model can achieve high training accuracy and even decent test accuracy while still relying on unstable or overly complex decision boundaries. Without inspecting the learned weights, It is easy to assume the model has learned something meaningful and we never know are there some features getting much higher weight compared to other which might make the model overfit and leads to high variance.
Regularization addresses this by penalizing large or unnecessary weights during training. Instead of only minimizing the loss function, the model is forced to balance two objectives:
Fit the data
Keep the model simple
Regularization does not act after training. It directly modifies the optimization process. Every gradient update is influenced by the penalty term, which changes how the model explores the parameter space.
Without regularization, logistic regression has no reason to prefer small weights over large ones if both achieve the same loss. With regularization, the model is explicitly biased towards simpler solutions. that bias is what improves generalization when the data is noisy, high-dimensional, or limited in size.
A example:
Imagine a dataset with two features:
x1: actually related to the outcomex2: pure noise
A logistic regression model without regularization has no way to know this in advance during training, it may notice that x2 accidentally correlates with the labels in the training set and assign it a large weight.
The model now looks something like:
$$\hat{y} = \sigma(3.2\,x_1 - 2.8\,x_2)$$
On the training data, this works well. But on new data, the noise feature x2 no longer behaves the same way, and the model’s performance drops.
Regularization changes how models perceive the whole training method. By penalizing large weights, the model is forced to ask: “Is this feature important enough to justify a large weight?”
Regularization (Without Math)
At a high level, regularization is about adding preference to the learning process.
In standard logistic regression, training focuses on one objective: minimizing the loss on the training data. Any set of weights that achieve a low loss is acceptable, regardless of how large or complex those weights are.
Regularization introduces a second objective: Keep the model simple.
There is a important distinction. Regularization does not clean the data, remove features, or fix overfitting automatically. It simply biases the optimization process toward solutions that use smaller or fewer weights. The model is still free to fit the data well, but it must justify complexity by overcoming the penalty.
Different regularization methods encode different definition of simplicity:
Some prefer many small weights
Others prefer few non-zero weights
Some try to balance both
In the next sections, I break down how L2, L1, and Elastic Net regularization each define “simplicity” in their own way, and how that definition directly affects what the model learns.
Ridge (L2) Regularization: Shrinking Weights, Not Removing Features
Ridge regularization, also know as L2 regularization, penalizes the squared magnitude of the model’s weights. In practice, this means that large weight are discouraged more strongly than small ones.
The penalty term added to the loss function is:
$$\lambda \lVert w \rVert_2^2$$
This has an important consequence during training. The gradient of the L2 penalty is proportional to the weight itself, which means that:
Large weights experience a strong pull toward zero
Small weights experience a weak pull
As a result, L2 regularization tends to shrink all weights smoothly, rather than eliminating any particular feature. Features that are useful keep non-zero weights, but their influence is moderated. Features that are less useful still remain in the model, just with very small coefficient.
From an intuition standpoint, we can say that L2 regularization encourages the model to spread importance across features instead of relying heavily on a small subset. This makes the model more stable, especially when features are noisy or correlated.
On key observation from my experiments was that L2 regularization rarely produced zero weights, even when many features were pure noise. Instead, the overall weight norm decreased, and the decision boundary became smoother. This confirmed that L2 is best viewed as a tool for controlling magnitude, not for feature selection.
In short:
L2 reduces variance
L2 improves numerical stability
L2 keeps all features, but with restrained influence.
This behavior becomes much clearer when contrasted with L1 regularization, which takes a very different approach to enforcing simplicity.
Lasoo (L1) Regularization: Why Some Wights Die Completely
L1 regularization, penalizes the absolute value of the wights rather than their squared magnitude. The penalty term added to the loss function is:
$$\lambda \lVert w \rVert_1$$
At first glance, it might look like a small change compared to L2. In practice, it leads to a fundamentally different behavior.
The key difference lies in how the penalty affects weights during optimization. The gradient of the L1 penalty does not depend on the size of the weight. Whether a weight is large or small, the regularization term applies roughly the same push towards zero.
This has an important consequence: small weights are punished just as aggressively as large ones. When a feature contributes only weakly to reducing the loss, the constant pull from the L1 penalty can overpower its weight all the way to zero.
Once a weight reaches zero, the model effectively stops using that feature.
From an intuition standpoint, L1 regularization encourages the model to make hard choices:
Either a feature is important enough to survive
Or it is removed entirely
This is why L1 regularization is often described as performing implicit feature selection. In datasets with many irrelevant or weakly informative features, L1 can dramatically simplify the model by discarding noise.
In my experiments, this behavior was visible not just in accuracy metrics, but directly in the learned weights. While L2 reduced the overall magnitude of weights, L1 produced a large number of near-zero values, revealing which features the model considered unnecessary.
However, this aggressive behavior comes with trade-offs. When features are highly correlated, L1 may arbitrarily select one feature and discard the others, even if all of them carry useful information. This limitation motivates the use of Elastic Net, Which combines the strength of both L1 and L2.
Elastic Net: Balancing Shrinkage and Sparsity
Elastic Net regularization was introduced to address the limitations of using L1 or L2 alone. It combines both penalties into a single objective, allowing the model to balance weight shrinkage and feature selection.
$$\lambda \left( \alpha \lVert w \rVert_1 + (1 - \alpha) \lVert w \rVert_2^2 \right)$$
Here the parameter α controls the trade-off:
α = 1 behaves like pure L1 regularization
α = 0 behaves like pure L2 regularization
Values in between mix both behaviors
The combination leads to a more flexible notion of simplicity. The L2 component stabilizes training by shrinking weights smoothly, while the L1 component still encourages sparsity by pushing less useful features toward zero.
One scenario Where Elastic Net becomes particularly useful is when features are highly correlated. In such cases:
L1 tends to select one feature and ignore the rest
L2 keeps all correlated features but shrinks them together
Elastic Net often keeps groups of correlated features while still removing irrelevant ones.
In my experiments, Elastic Net behaved as expected only when the dataset actually stressed these properties. On simple or low-dimensional datasets, its behavior often looked indistinguishable from L2. This reinforced an important lesson: the usefulness of a regularization method depends heavily on the structure of the data
Elastic Net is not a default replacement for L1 or L2. It is a targeted tool for situations where features correlation and stability both matters.
How L1 and L2 Penalties Differ Mathematically (Without Going Deep into Calculus)
Regularization penalties are applied per weights, not as abstract norms. Looking at them this way makes their behavior much easier to understand.
For a single feature with weight w, the L2 penalty is:
$$\lambda \, w^2$$
If the model has multiple features, this penalty is simply summed across all weights:
$$\lambda \sum_j w_j^2$$
The key properly of this formulation is that the penalty grows quadratically. As a weight becomes larger, the cost of increasing it further rises very quickly. This strongly discourages large weights but still allows small ones to exist.
In contrast, the L1 penalty for a single weight is:
$$\lambda \, |w|$$
And across all features:
$$\lambda \sum_j |w_j|$$
Here, the penalty grows linearly with the weight magnitude. Increasing a weight from 0.1 to 0.2 is penalized just as increasing it from 2.0 to 2.1. There is no “safe zone” for small weights.
The difference explains the core behavioral contrast:
L2 softly discourages large weights but tolerates many small ones
L1 applies constant pressure, making weak features disappear entirely.
Regularization does not know which feature are meaningful. It treats every weight equally and lets the optimization process decide which ones are worth keeping despite the penalty.
This mathematical perspective clarifies why L1 leads to sparsity and L2 leads to smooth shrinkage — a distinction that becomes even more obvious when looking at gradient updates and training dynamics in the implementation.
Implementation Strategy: Building Regularized Logistic Regression From Scratch
With the intuition and mathematical behavior of L1 and L2 clear, the next step was to translate those ideas into an actual learning algorithm.
The model follows the standard structure of logistic regression. Given an input feature vector x, the prediction is computed as:
$$\hat{y} = \sigma(w^\top x + b)$$
where the sigmoid function is:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
Training minimizes the average binary cross-entropy loss over the dataset:
$$\mathcal{L}{\text{log}} = -\frac{1}{n} \sum{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i) \right]$$
Regularization is incorporated by adding a penalty term to this objective and modifying the gradient accordingly. Rather than writing separate training loops for each type of regularization, a single gradient loop was used, with the regularization behavior controlled through parameter.
For L2 regularization, the gradient update the sign of each weight
$$\nabla_w \leftarrow \nabla_w + \frac{\lambda}{n} w$$
This directly implements the idea that large weights should be penalized more heavily, causing them to shrink smoothly over time.
For L1 regularization, the update uses the sign of each weight:
$$\nabla_w \leftarrow \nabla_w + \frac{\lambda}{n} \,\text{sign}(w)$$
This constant-magnitude penalty is what drives sparsity. Weak features are not protected by having small weights; they pushed toward zero just as strongly as large ones.
Elastic Net combines both effects using a mixing parameters α:
$$\nabla_w \leftarrow \nabla_w + \frac{\lambda}{n} \left( \alpha\,\text{sign}(w) + (1 - \alpha)\,w \right)$$
All models were trained using the same optimizer, learning rate and number of iterations. This ensured that any difference in behavior were due solely to the regularization method and not to training artifacts.
A key takeaway from this stage was that regularization cannot be treated as a plug-in feature. It is deeply tied to the optimization process, influencing every parameter update and shaping the solution space from the very first step.
First Experiment: When Regularization Didn’t Help (and Why)
After implementing L1, L2 and Elastic Net regularization, the first instinct was to test them on a simple classification dataset and compare accuracy across models. The expectation was that regularization would immediately improve performance and clearly demonstrate its benefits.
That is not what happened.
In the initial experiment, I trained four models:
Logistic regression without regularization
Logistic regression with L2 regularization
Logistic regression with L1 regularization
Logistic regression with Elastic Net
All models were trained using the same optimizer settings and evaluated using training and test accuracy.
Surprisingly, the results were almost identical across all models. The training and test performance showed little variation, and the learned weights did not differ significantly either. At first this looked like a mistake in the implementation.
| Model | Train Acc | Test Acc |
------------------------------------
No Reg 0.58 0.64
L2 0.58 0.64
L1 0.58 0.67
Elastic Net 0.58 0.64
However, after verifying the gradients and updates, it became clear that the issue was not in the code but in the dataset design. The dataset used in the first experiment had relatively few features and limited noise. In such scenarios, logistic regression is already simple enough that overfitting is unlikely to occur. Since the model was not relying on unstable or unnecessary features, adding a penalty term had very little effect on the learned solution.
# Using sklearn library, to generate custom dataset
from sklearn.datasets import make_classification
# Simple, low-dimensional dataset
X, y = make_classification(
n_samples=500,
n_features=5,
n_informative=3,
n_redundant=0,
n_noise=0,
random_state=42
)
The experiment revealed an important but often overlooked lesson: Regularization only becomes meaningful when the dataset creates pressure for overfitting.
If the model is already well-constrained by the data, regularization is ineffective — It means the experiment is not challenging enough to reveal its behavior. This realization led to the design of a more demanding test case, where the dataset intentionally constrained many noisy and irrelevant features. That experiment finally exposed the differences between L1, L2 and Elastic Net in a cleat and measurable way.
Feature Selection Stress Test: Where Regularization Finally showed its power
To make regularization matter, I redesigned the dataset to encourage overfitting. The new setup had many features, but only a small subset actually carried useful signal. The rest were pure noise.
X, y = make_classification(
n_samples=600,
n_features=100,
n_informative=5,
n_redundant=0,
n_noise=95,
random_state=42
)
All models were trained with the same optimizer, learning rate and number of iterations. The unregularized model achieved perfect training accuracy, but its weights were large and spread across many features. This indicated reliance on noise.
L2 regularization reduced the overall weight magnitude. However, most features still had non-zero weights. The models was smoother, but not simpler.
L1 regularization produced a very different result. A large number of weights were driven close to zero. Despite discarding many features, test accuracy improved. This confirmed that L1 was performing effective feature selection rather than just shrinking weights.
Elastic Net showed behavior dependent on the mixing parameter. In this dataset, where features were mostly independent, it behaved similarly to L1 or L2 depending on the chosen balance.
MODEL COMPARISON (Feature Selection Stress Test)
no_reg
Train accuracy : 0.967
Test accuracy : 0.942
||w|| : 4.41
Near-zero wts : 0
----------------------------------------
l2
Train accuracy : 0.966
Test accuracy : 0.945
||w|| : 3.40
Near-zero wts : 3
----------------------------------------
l1
Train accuracy : 0.967
Test accuracy : 0.963
||w|| : 3.81
Near-zero wts : 74
----------------------------------------
elasticnet
Train accuracy : 0.967
Test accuracy : 0.958
||w|| : 4.41
Near-zero wts : 0
----------------------------------------
Immediate Observations
Training accuracy did not change: All models fit the training data perfectly. Regularization did not weaken learning
L1 behaved very differently: It pushed most weights close to zero (74 out of 100), clearly performing feature selection.
L2 shrank weights but did not eliminate them: The weight norm dropped, nut most features stayed active.
Elastic Net Behaved like no regularization here: With the chosen hyper parameters, sparsity did not emerge.
At this point, accuracy alone is misleading. The real difference lies in how the model represents the solution, not how well it fits the training data.
Why these results made sense:
The behaviors seen in the result follow directly from how each regularization term affects the weights.
No regularization: There is no cost to keeping unnecessary weights. So even noise features stay active.
L2 (Ridge): Penalizes large weights, encouraging the model to shrink all weights rather than eliminate them. This reduces over-reliance on any single feature but keeps most features in the model.
L1 (Lasoo): Applies the same push regardless of weight size. Weak features are pushed exactly to zero, which explains the large number of near-zero and the implicit feature selection.
Elastic Net: Combine both penalties. In this experiment, the L1 component was not strong enough to enforce sparsity, so its behavior resembled no regularization.
Regularization does not limit learning capacity, it guides how the model distributes importance across features.
Now i would like to attach the implementation code for regularized logistic regression for the readers to test it on their own.
import numpy as np
class RegularizedLogisticRegression:
def __init__(self, lr=0.01, n_iters=1000, lam=0.01, l1_ratio=0.5, reg_type=None):
self.lr = lr # Learning rate
self.n_iters = n_iters # Number of iterations
self.reg_type = reg_type # 'l1', 'l2', or 'elasticnet'
self.lam = lam # Regularization strength
self.l1_ratio = l1_ratio # Ratio for elastic net
def _sigmoid(self, z):
return 1 / (1+np.exp(-z))
def fit(self, X, y):
n_samples, n_features = X.shape
self.w = np.zeros(n_features)
self.b = 0
for _ in range(self.n_iters):
linear = np.dot(X, self.w) + self.b
y_hat = self._sigmoid(linear)
dw = (1 / n_samples) * np.dot(X.T, (y_hat - y))
db = (1 / n_samples) * np.sum(y_hat - y)
# Regularization
if self.reg_type == 'l2':
dw += (self.lam / n_samples) * self.w
elif self.reg_type == 'l1':
dw += (self.lam / n_samples) * np.sign(self.w)
elif self.reg_type == 'elasticnet':
dw += (self.lam / n_samples) * (self.l1_ratio * np.sign(self.w) + (1 - self.l1_ratio) * self.w)
self.w -= self.lr * dw
self.b -= self.lr * db
def predict(self, X):
linear = np.dot(X, self.w) + self.b
return (self._sigmoid(linear) >= 0.5).astype(int)
Final code for testing and comparing all the regularization.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from visualization import (
plot_weight_distribution,
plot_sparsity_vs_lambda,
plot_weight_norm_vs_lambda
)
from regularized_logistic_regression import RegularizedLogisticRegression
# Dataset designed to highlight L1 behavior
X, y = make_classification(
n_samples=500,
n_features=100,
n_informative=3,
n_redundant=0,
n_repeated=0,
n_clusters_per_class=1,
class_sep=2.0,
flip_y=0.0,
random_state=42
)
# Train / test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Models
models = {
"no_reg": RegularizedLogisticRegression(
lr=0.05, n_iters=5000
),
"l2": RegularizedLogisticRegression(
lr=0.05, n_iters=5000,
reg_type="l2", lam=1.0
),
"l1": RegularizedLogisticRegression(
lr=0.05, n_iters=5000,
reg_type="l1", lam=1.0
),
"elasticnet": RegularizedLogisticRegression(
lr=0.05, n_iters=5000,
reg_type="elastic", lam=1.0, l1_ratio=0.7
),
}
print("\nMODEL COMPARISON (Feature Selection Stress Test)\n")
# rain & evaluate
for name, model in models.items():
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
weight_norm = np.linalg.norm(model.w)
zero_weights = np.sum(np.abs(model.w) < 1e-3)
print(f"{name}")
print(f" Train accuracy : {train_acc:.3f}")
print(f" Test accuracy : {test_acc:.3f}")
print(f" ||w|| : {weight_norm:.2f}")
print(f" Near-zero wts : {zero_weights}")
print("-" * 40)
models = [model_no, model_l2, model_l1]
labels = ["No Reg", "L2", "L1"]
plot_weight_distribution(
models,
labels,
title="Weight Distribution Comparison"
)
Conclusion:
Implementing regularization from scratch changed how I understand it beyond textbook definitions.
L2 regularization does not remove features. It controls weight magnitude and stabilizes learning.
L1 regularization actively performs feature selection by pushing weak features to zero.
Elastic Net is not “better by default” — its behavior depends entirely on how much L1 influence is present.
One important takeaway was that accuracy alone hides the real effect of regularization.
All models fit the training data equally well, but they arrived at very different internal representations.
Writing the code myself made the trade-offs clear:
regularization is not about improving scores blindly, but about enforcing simpler, more meaningful models.
This understanding would not have been possible by only using high-level libraries. In real-world scenario where the data quality is one of the important factors, choosing important features that actually effect our outcome is a important task.
Ultimately, regularization is less about tuning hyper-parameters and more about making deliberate choices on how a model should behave under uncertainty.
Implementing it from scratch turned regularization from a formula into a concrete, observable design decision.
Thanks for reading, and I hope this breakdown helps you understand regularization beyond just formulas and library defaults.
