Visualize the behavior of the two regularizers that fight overfitting in machine learning: L1 (Lasso) and L2 (Ridge). Change the regularization strength λ and the original regression coefficients to compare in real time how L1 drives weights to exactly zero for a sparse model while L2 shrinks every coefficient proportionally.
Parameters
Regularization strength λ
Larger means stronger shrinkage. 0 = no regularization (OLS)
Original weight A (OLS estimate)
Unregularized coefficient A (a strong feature)
Original weight B
A small coefficient B (possibly a noise feature)
Original weight C
A negative coefficient C (a feature with opposite sign)
The x-axis is the original weight w₀, the y-axis is the regularized weight. L1 zeroes weights inside the [−λ,λ] band (the pulsing region); L2 is a straight line through the origin with slope 1/(1+λ). Points A, B, C are the current three weights.
L1 regularization shrinks weights with the soft-threshold operator. Weights with |w₀| ≤ λ become exactly zero, giving a sparse model. L2 produces no such exact zeros.
Both balance "closeness to the original estimate w₀" against the penalty R(w). For an orthogonal design, the solution has the closed form of the two equations above.
What is L1/L2 Regularization?
🙋
I hear "regularization" all the time in machine learning. What is it actually for?
🎓
In short, it is a brake against overfitting. A model trying to fit the training data can keep pushing its coefficients (weights) larger and larger. Then it matches the training data perfectly but predicts nothing useful on new data. So we add a rule to the loss function: "pay a penalty if a weight grows too large." The two classic versions are L1 and L2. As you raise λ in this tool, you can see both of them shrink the weights.
🙋
L1 and L2 both "shrink weights", right? So what's the difference?
🎓
The way they shrink is decisively different. L2 (Ridge) uses w₀/(1+λ): it pulls every weight down by the same proportion — 3 becomes 1.5, 0.6 becomes 0.3, and so on. But L1 (Lasso) uses a soft-threshold mechanism that drives any weight with |w₀| smaller than λ to exactly zero. Set weight B to 0.6 and λ to 1.0 on the left. L2 gives 0.30, but L1 gives 0.00 — feature B disappears from the model entirely.
🙋
Wait, it makes it zero? That means it's throwing the feature away, right?
🎓
Exactly — it is "automatic feature selection". A feature with a zero weight is never used in the prediction, so it effectively drops out of the model. In real problems you often have, say, 100 explanatory variables but only 10 that truly matter. With Lasso the other 90 coefficients become zero, giving an interpretable model where you can see at a glance which variables matter. That flat band in the middle of the shrinkage map is the "zeroing range" set by λ.
🙋
Then L1 always sounds better — does L2 ever have a job to do?
🎓
No, each has its strength. L2 shrinks weights smoothly, so when there is a group of strongly correlated features it pulls them all down evenly and keeps things stable. L1 in that case tends to keep just one of the correlated group and zero the rest — and which one survives can flip with small data fluctuations. So: use Ridge when "many variables each contribute a little" or variables are correlated, use Lasso when "only a few variables matter." If you want both strengths, mix them with Elastic Net.
🙋
How do you decide how big λ should be? Is bigger always safer?
🎓
No, too big is bad. Raising λ shrinks the weights and simplifies the model, but push it too far and you crush even the weights you genuinely need, and the model can no longer capture the structure of the data — that is underfitting. Too small a λ, and you overfit. So in practice you use cross-validation: try many λ values on a logarithmic scale and pick the one with the smallest validation error. On the "Weight A vs λ" chart you will see the L1 weight land on zero at some λ, while L2 approaches zero but never quite touches it.
Frequently Asked Questions
The biggest difference is whether a weight can be driven to exactly zero. L1 regularization (Lasso) shrinks weights with the soft-threshold operator w_L1 = sign(w0)·max(|w0|−λ, 0), setting any weight with |w0| ≤ λ to exactly zero. This automatically discards unimportant features and yields a sparse model. L2 regularization (Ridge), by contrast, shrinks every weight by the same proportion through w_L2 = w0/(1+λ) but never reaches an exact zero. A simple way to remember it: L1 does feature selection, L2 does smooth shrinkage.
Because the L1 penalty λ|w| has a non-differentiable corner at w=0. During optimization, as long as the loss reduction from moving the weight slightly away from zero is smaller than the increase in the L1 penalty (slope λ), the weight stays pinned at zero. For an orthogonal design this condition is simply |w0| ≤ λ, which gives the soft-threshold operator sign(w0)·max(|w0|−λ,0). The L2 penalty λw² is smooth with zero derivative at w=0, so there is no force pinning the weight to zero — only proportional shrinkage occurs.
Lasso (L1) suits problems with many features where you believe only a few really matter: the coefficients of irrelevant features become zero, making the model easier to interpret. Ridge (L2) is more stable when many features each contribute a little, or when there are groups of strongly correlated features. Lasso tends to keep just one feature from a correlated group, while Ridge shrinks the whole group evenly. When you want a balance of both, use Elastic Net, which combines an L1 and an L2 penalty.
λ is a hyperparameter that controls the trade-off between overfitting and underfitting, and the standard way to pick it is cross-validation. A larger λ shrinks the weights more and makes the model simpler, but if it is too large the model cannot capture the structure of the data and underfits (higher bias). A smaller λ fits the training data better but risks overfitting (higher variance). In practice you try several λ values on a logarithmic scale and choose the one that minimizes the validation error. Moving the λ slider in this tool shows intuitively how the weights are shrunk and zeroed.
Real-World Applications
Feature selection in high-dimensional data: For problems with thousands or tens of thousands of explanatory variables but few samples — such as gene-expression data or bag-of-words text features — Lasso (L1) is the standard choice. It keeps only the few variables that truly drive the prediction and zeroes the rest, giving an interpretable model that answers "which genes are linked to the disease" or "which words drive the classification". The shrinkage map in this tool visualizes exactly this "zeroing band set by λ".
Stabilizing models with multicollinearity: In regressions where the explanatory variables are strongly correlated — sensor readings or economic indicators, for example — ordinary least squares (OLS) without regularization produces wildly unstable coefficients. Ridge (L2) shrinks the correlated group evenly to stabilize the coefficients and reduce prediction variance. Lasso has the weakness here of keeping just one variable from the correlated group, so the variable that gets selected can flip with data noise.
Weight decay in neural networks: "Weight decay", widely used in deep learning, is L2 regularization itself. By pulling each weight a little toward zero at every update, it prevents a huge network from memorizing the training data and overfitting. Adding L1 on top can zero the weights of unnecessary connections, making the network sparse and inference lighter.
Compressed sensing and signal recovery: Compressed sensing reconstructs an original signal from few measurements, and uses L1 minimization under the assumption that the signal is sparse (most components are zero). The zeroing property of L1 correctly estimates components that should be zero as zero. It is applied widely — in fast MRI imaging and in radar and communication signal processing.
Common Misconceptions and Pitfalls
The most common mistake is applying regularization without standardizing the features. Both L1 and L2 penalize the magnitude of the weights, so if the features are on wildly different scales the penalty becomes unfair. Put "height (cm)" and "annual income (currency units)" into the same model, and the large-scale variable gets a small coefficient and unfairly escapes shrinkage. This tool also assumes "features are already standardized" and uses the simple formula w_L2 = w₀/(1+λ). With real data, always standardize each feature to mean 0 and variance 1 before regularizing.
Next, interpreting "the variables Lasso selected" as the causally important variables. Lasso simply splits coefficients into zero and non-zero to lower prediction error; it guarantees nothing about causation. In particular, when a group of strongly correlated variables exists, Lasso picks one of them "by chance" and zeroes the rest. Change the data slightly and the selected variable can swap out — this is not rare. If the stability of variable selection matters, re-estimate many times with bootstrap to find variables that "are always selected", or use Elastic Net to be safe.
Finally, the belief that "the larger λ, the better the model". Regularization suppresses overfitting but always introduces bias (a systematic offset). Push λ too high and even the weights that genuinely matter get shrunk and zeroed, and the model underfits — it can no longer capture the structure of the data. In this tool, pushing λ toward 5 makes even weight A become zero under L1 and heavily crushed under L2. The optimal λ sits right between overfitting and underfitting; the proper approach is to pick the point where cross-validation gives the smallest validation error.
How to Use
Set λ (lambda) regularization strength using the slider or numeric input; higher values increase penalty magnitude.
Input initial weights w0a, w0b, w0c (e.g., 0.8, 0.5, 0.3) representing raw model coefficients before regularization.
Run the simulator to compare L1 (Lasso) and L2 (Ridge) outcomes: observe which weights L1 shrinks to exactly zero versus L2's continuous reduction.
Worked Example
Given initial weights w0a=1.2, w0b=0.6, w0c=0.4 and λ=0.5: L2 Ridge shrinks all weights proportionally (w0a→0.92, w0b→0.46, w0c→0.31). L1 Lasso applies fixed penalty per weight, zeroing w0c entirely while reducing w0a→0.7 and w0b→0.1. At λ=1.0, Lasso zeros 1–2 coefficients; Ridge reduces all three but retains nonzero values. Mean shrinkage under L1 reaches 42% while L2 achieves 35% at identical λ.
Practical Notes
Use Lasso (L1) for feature selection in high-dimensional datasets (e.g., gene expression with 20,000+ features); automatic zeroing eliminates irrelevant predictors without manual screening.
Use Ridge (L2) when multicollinearity is severe in linear regression or neural networks; it distributes penalty evenly across correlated features, preserving all signals.
Monitor "Weights zeroed by L1" output; if zero weights exceed 60% of features, λ may be excessive—reduce by 0.2–0.3 increments to retain meaningful coefficients.
In logistics or classification tasks, L1 regularization at λ=0.3–0.7 typically balances sparsity and predictive accuracy; test both on hold-out validation sets.