L1/L2 Regularization Simulator — Lasso vs Ridge

Visualize the behavior of the two regularizers that fight overfitting in machine learning: L1 (Lasso) and L2 (Ridge). Change the regularization strength λ and the original regression coefficients to compare in real time how L1 drives weights to exactly zero for a sparse model while L2 shrinks every coefficient proportionally.

Parameters

Regularization strength λ

Larger means stronger shrinkage. 0 = no regularization (OLS)

Original weight A (OLS estimate)

Unregularized coefficient A (a strong feature)

Original weight B

A small coefficient B (possibly a noise feature)

Original weight C

A negative coefficient C (a feature with opposite sign)

Results

—

Weight A (L1)

—

Weight A (L2)

—

Weight B (L1)

—

Weight B (L2)

—

Weights zeroed by L1

—

L1 mean shrinkage (%)

—

Shrinkage map — input weight w₀ → regularized weight

The x-axis is the original weight w₀, the y-axis is the regularized weight. L1 zeroes weights inside the [−λ,λ] band (the pulsing region); L2 is a straight line through the origin with slope 1/(1+λ). Points A, B, C are the current three weights.

Weight A vs regularization strength λ

Three-weight comparison (original / L1 / L2)

Theory & Key Formulas

$$w_{L2}=\frac{w_0}{1+\lambda}\qquad\text{(Ridge: proportional shrinkage)}$$

L2 regularization shrinks the original weight w₀ uniformly by the factor 1/(1+λ). No matter how large λ becomes, it never reaches zero.

$$w_{L1}=\operatorname{sign}(w_0)\max(|w_0|-\lambda,\,0)\qquad\text{(Lasso: soft-threshold)}$$

L1 regularization shrinks weights with the soft-threshold operator. Weights with |w₀| ≤ λ become exactly zero, giving a sparse model. L2 produces no such exact zeros.

$$\hat{w}=\arg\min_w\;\tfrac12(w-w_0)^2+\lambda\,R(w),\quad R_{L1}=|w|,\;R_{L2}=\tfrac12 w^2$$

Both balance "closeness to the original estimate w₀" against the penalty R(w). For an orthogonal design, the solution has the closed form of the two equations above.

What is L1/L2 Regularization?

🙋

I hear "regularization" all the time in machine learning. What is it actually for?

🎓

In short, it is a brake against overfitting. A model trying to fit the training data can keep pushing its coefficients (weights) larger and larger. Then it matches the training data perfectly but predicts nothing useful on new data. So we add a rule to the loss function: "pay a penalty if a weight grows too large." The two classic versions are L1 and L2. As you raise λ in this tool, you can see both of them shrink the weights.

🙋

L1 and L2 both "shrink weights", right? So what's the difference?

🎓

The way they shrink is decisively different. L2 (Ridge) uses w₀/(1+λ): it pulls every weight down by the same proportion — 3 becomes 1.5, 0.6 becomes 0.3, and so on. But L1 (Lasso) uses a soft-threshold mechanism that drives any weight with |w₀| smaller than λ to exactly zero. Set weight B to 0.6 and λ to 1.0 on the left. L2 gives 0.30, but L1 gives 0.00 — feature B disappears from the model entirely.

🙋

Wait, it makes it zero? That means it's throwing the feature away, right?

🎓

Exactly — it is "automatic feature selection". A feature with a zero weight is never used in the prediction, so it effectively drops out of the model. In real problems you often have, say, 100 explanatory variables but only 10 that truly matter. With Lasso the other 90 coefficients become zero, giving an interpretable model where you can see at a glance which variables matter. That flat band in the middle of the shrinkage map is the "zeroing range" set by λ.

🙋

Then L1 always sounds better — does L2 ever have a job to do?

🎓

No, each has its strength. L2 shrinks weights smoothly, so when there is a group of strongly correlated features it pulls them all down evenly and keeps things stable. L1 in that case tends to keep just one of the correlated group and zero the rest — and which one survives can flip with small data fluctuations. So: use Ridge when "many variables each contribute a little" or variables are correlated, use Lasso when "only a few variables matter." If you want both strengths, mix them with Elastic Net.

🙋

How do you decide how big λ should be? Is bigger always safer?

🎓

No, too big is bad. Raising λ shrinks the weights and simplifies the model, but push it too far and you crush even the weights you genuinely need, and the model can no longer capture the structure of the data — that is underfitting. Too small a λ, and you overfit. So in practice you use cross-validation: try many λ values on a logarithmic scale and pick the one with the smallest validation error. On the "Weight A vs λ" chart you will see the L1 weight land on zero at some λ, while L2 approaches zero but never quite touches it.

Frequently Asked Questions

The biggest difference is whether a weight can be driven to exactly zero. L1 regularization (Lasso) shrinks weights with the soft-threshold operator w_L1 = sign(w0)·max(|w0|−λ, 0), setting any weight with |w0| ≤ λ to exactly zero. This automatically discards unimportant features and yields a sparse model. L2 regularization (Ridge), by contrast, shrinks every weight by the same proportion through w_L2 = w0/(1+λ) but never reaches an exact zero. A simple way to remember it: L1 does feature selection, L2 does smooth shrinkage.

Because the L1 penalty λ|w| has a non-differentiable corner at w=0. During optimization, as long as the loss reduction from moving the weight slightly away from zero is smaller than the increase in the L1 penalty (slope λ), the weight stays pinned at zero. For an orthogonal design this condition is simply |w0| ≤ λ, which gives the soft-threshold operator sign(w0)·max(|w0|−λ,0). The L2 penalty λw² is smooth with zero derivative at w=0, so there is no force pinning the weight to zero — only proportional shrinkage occurs.

Lasso (L1) suits problems with many features where you believe only a few really matter: the coefficients of irrelevant features become zero, making the model easier to interpret. Ridge (L2) is more stable when many features each contribute a little, or when there are groups of strongly correlated features. Lasso tends to keep just one feature from a correlated group, while Ridge shrinks the whole group evenly. When you want a balance of both, use Elastic Net, which combines an L1 and an L2 penalty.

λ is a hyperparameter that controls the trade-off between overfitting and underfitting, and the standard way to pick it is cross-validation. A larger λ shrinks the weights more and makes the model simpler, but if it is too large the model cannot capture the structure of the data and underfits (higher bias). A smaller λ fits the training data better but risks overfitting (higher variance). In practice you try several λ values on a logarithmic scale and choose the one that minimizes the validation error. Moving the λ slider in this tool shows intuitively how the weights are shrunk and zeroed.

Real-World Applications

Feature selection in high-dimensional data: For problems with thousands or tens of thousands of explanatory variables but few samples — such as gene-expression data or bag-of-words text features — Lasso (L1) is the standard choice. It keeps only the few variables that truly drive the prediction and zeroes the rest, giving an interpretable model that answers "which genes are linked to the disease" or "which words drive the classification". The shrinkage map in this tool visualizes exactly this "zeroing band set by λ".

Stabilizing models with multicollinearity: In regressions where the explanatory variables are strongly correlated — sensor readings or economic indicators, for example — ordinary least squares (OLS) without regularization produces wildly unstable coefficients. Ridge (L2) shrinks the correlated group evenly to stabilize the coefficients and reduce prediction variance. Lasso has the weakness here of keeping just one variable from the correlated group, so the variable that gets selected can flip with data noise.

Weight decay in neural networks: "Weight decay", widely used in deep learning, is L2 regularization itself. By pulling each weight a little toward zero at every update, it prevents a huge network from memorizing the training data and overfitting. Adding L1 on top can zero the weights of unnecessary connections, making the network sparse and inference lighter.

Compressed sensing and signal recovery: Compressed sensing reconstructs an original signal from few measurements, and uses L1 minimization under the assumption that the signal is sparse (most components are zero). The zeroing property of L1 correctly estimates components that should be zero as zero. It is applied widely — in fast MRI imaging and in radar and communication signal processing.

Common Misconceptions and Pitfalls

The most common mistake is applying regularization without standardizing the features. Both L1 and L2 penalize the magnitude of the weights, so if the features are on wildly different scales the penalty becomes unfair. Put "height (cm)" and "annual income (currency units)" into the same model, and the large-scale variable gets a small coefficient and unfairly escapes shrinkage. This tool also assumes "features are already standardized" and uses the simple formula w_L2 = w₀/(1+λ). With real data, always standardize each feature to mean 0 and variance 1 before regularizing.

Next, interpreting "the variables Lasso selected" as the causally important variables. Lasso simply splits coefficients into zero and non-zero to lower prediction error; it guarantees nothing about causation. In particular, when a group of strongly correlated variables exists, Lasso picks one of them "by chance" and zeroes the rest. Change the data slightly and the selected variable can swap out — this is not rare. If the stability of variable selection matters, re-estimate many times with bootstrap to find variables that "are always selected", or use Elastic Net to be safe.

Finally, the belief that "the larger λ, the better the model". Regularization suppresses overfitting but always introduces bias (a systematic offset). Push λ too high and even the weights that genuinely matter get shrunk and zeroed, and the model underfits — it can no longer capture the structure of the data. In this tool, pushing λ toward 5 makes even weight A become zero under L1 and heavily crushed under L2. The optimal λ sits right between overfitting and underfitting; the proper approach is to pick the point where cross-validation gives the smallest validation error.

L1/L2 Regularization Simulator — Lasso vs Ridge

What is L1/L2 Regularization?

Frequently Asked Questions

Real-World Applications

Common Misconceptions and Pitfalls

How to Use

Worked Example

Practical Notes