Why is it called regression if it is used for classification?

Logistic regression does regress a continuous quantity: the probability p(y=1|x). The sigmoid function turns the linear score z = w.x + b into a number between 0 and 1, and the final class is obtained by thresholding this probability at 0.5. The continuous probability output is the regression part, while thresholding gives a classifier. The historical name and the practical use are therefore slightly out of step, which is a common source of early confusion.

How do you extend it to more than two classes (softmax regression)?

For three or more classes, replace the sigmoid by the softmax function sigma(z_k) = exp(z_k)/sum_j exp(z_j). Each class gets its own weight vector w_k and the softmax normalizes the scores so that all class probabilities sum to one. The loss is the categorical cross-entropy and the gradient has the same shape as in the binary case. In scikit-learn this is a one-liner with LogisticRegression(multi_class='multinomial').

What does L2 regularization actually do?

L2 regularization adds a penalty lambda * ||w||^2 to the loss, which prevents the weights from growing without bound. When the data is linearly separable, the sigmoid is happy to make weights arbitrarily large to push predictions toward 0 or 1, which causes overfitting and numerical instability. Setting lambda between 0.01 and 1.0 typically improves generalization. In the simulator, sweeping lambda from 0 to 1 visibly shrinks w_1.

What is the difference between L1 and L2 regularization?

L1 regularization adds lambda * sum |w_i| and can shrink unimportant weights exactly to zero, which doubles as feature selection. L2 adds lambda * sum w_i^2 and shrinks weights smoothly without zeroing them out. In practice L2 is the default; L1 or Elastic Net (L1+L2) is preferred when you also want to prune features. This simulator implements L2 only.

Logistic Regression (2D Binary Classifier) Simulator — Free Online Calculator

Parameters

Learning rate η

—

Iterations

—

L2 regularization λ

—

Data seed

—

Data is generated deterministically by an LCG with the given seed (30 points per class, 60 total, sigma=1.2). Gradient descent updates use the full batch.

Results

—

Training accuracy

—

Final cross-entropy loss

—

Weight w₁

—

Bias b

Scatter, decision boundary and probability contours

Red = class 0 / Blue = class 1 / Green = decision boundary (p=0.5) / Shading = probability contours (p=0.3, 0.5, 0.7) / Bottom-left = (w₁, w₂, b)

Theory & Key Formulas

Logistic regression is a linear classifier that turns a linear score $z = \mathbf{w}\cdot\mathbf{x} + b$ into a probability with the sigmoid function and minimizes the cross-entropy loss by gradient descent.

Predicted probability via the sigmoid function:

$$p(y=1 \mid \mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-z}}$$

Cross-entropy loss ($n$ samples, $y_i \in \{0,1\}$):

$$L = -\frac{1}{n}\sum_{i=1}^{n}\bigl[y_i \log p_i + (1 - y_i)\log(1 - p_i)\bigr]$$

Gradient and weight update with L2 regularization $\lambda$:

$$\frac{\partial L}{\partial \mathbf{w}} = \frac{1}{n}\mathbf{X}^{\top}(\mathbf{p}-\mathbf{y}) + \lambda\mathbf{w}, \quad \mathbf{w} \leftarrow \mathbf{w} - \eta\frac{\partial L}{\partial \mathbf{w}}$$

The decision boundary (the p=0.5 contour) is the straight line $\mathbf{w}\cdot\mathbf{x}+b=0$, i.e. $w_1 x_1 + w_2 x_2 + b = 0$.

About the logistic regression simulator

🙋

People keep saying "logistic regression" but using it for classification. Is it regression or classification? I'm confused.

🎓

Everyone trips over this at first. Roughly, logistic regression does regress a continuous quantity — the probability that a point belongs to class 1. Then we threshold that probability at 0.5 to get a class label. Look at the canvas above: the green line is the decision boundary, exactly where p=0.5. Red and blue dots are the two classes.

🙋

There's a soft red/blue gradient around the green line. What is that?

🎓

Those are probability contours. The sigmoid $\sigma(z) = 1/(1+e^{-z})$ maps the linear score $z$ into a probability between 0 and 1. The dashed lines mark p=0.3 (red side) and p=0.7 (blue side). Notice how the colour is faint near the green boundary — that's where the classifier is least confident.

🙋

I dropped the learning rate to 0.001 and the line barely moves anymore.

🎓

The learning rate is the step size of gradient descent. Too small and 500 iterations don't reach the minimum — weights stay near zero. Crank it up to 1.0 and the steps overshoot, sometimes oscillating. In practice you start somewhere in 0.01 to 0.1 and watch the loss curve. Try sweeping eta while watching the "final cross-entropy loss" card to find a sweet spot.

🙋

When I bumped L2 regularization to 1.0, w₁ shrank and the boundary became less tilted.

🎓

Nice catch. L2 adds $\lambda\|\mathbf{w}\|^2$ to the loss to penalize large weights. When the data is linearly separable, the sigmoid keeps growing the weights to make predictions more "confident", which causes overfitting. L2 reins that in. In practice, even if training accuracy drops a bit, you turn lambda up when generalization improves.

FAQ

Replace the sigmoid by the softmax function $\sigma(z_k) = e^{z_k}/\sum_j e^{z_j}$, with one weight vector $\mathbf{w}_k$ per class. The result is "softmax regression" or "multinomial logistic regression". The loss becomes the categorical cross-entropy and the gradient has the same shape as in the binary case. In scikit-learn it is the one-liner LogisticRegression(multi_class='multinomial').

L2 ($\lambda\sum w_i^2$) shrinks all weights smoothly and gives a stable, low-variance model. L1 ($\lambda\sum|w_i|$) pushes some weights to exactly zero, which doubles as feature selection. The default choice is L2; switch to L1 or Elastic Net (L1+L2) when the feature space is high-dimensional and you want to prune unimportant features. This simulator implements L2 only.

Plain logistic regression only draws straight boundaries, so circular or XOR-shaped data are out of reach. Two common fixes: (1) add polynomial features such as $x_1^2, x_2^2, x_1 x_2$ to lift the problem into a higher-dimensional space where it becomes linearly separable; (2) switch to a more expressive model (neural network, SVM with RBF kernel, gradient boosting). The simulator's data is nearly linearly separable so the default settings reach over 95% training accuracy.

This simulator uses batch gradient descent: every iteration uses all training points to compute the gradient. It is stable but slow for big datasets. Stochastic gradient descent (SGD) updates the weights after each individual sample (or mini-batch), trading stability for speed and noise that helps escape shallow local optima. Mini-batch SGD is the default in modern deep-learning libraries.

Real-world applications

Clinical decision support: Logistic regression is the workhorse for predicting binary outcomes from clinical features such as blood pressure, glucose level and age. Because it is a linear model, every coefficient $w_i$ translates directly into an interpretable odds ratio, which is essential when explaining predictions to physicians and regulators. Many medical statistics textbooks introduce odds ratios specifically through the logistic regression coefficient.

Credit scoring and marketing: Predicting the probability that a loan applicant will default or that a customer will buy a product remains a textbook use case. Even when tree ensembles or deep models are more accurate, regulated industries (banking, insurance) often stick with logistic regression because the decisions must be explainable. Scorecards used in credit operations are built directly from logistic regression coefficients.

Click-through rate prediction: Online ad platforms use L1-regularized logistic regression with billions of sparse features as a scalable baseline. Early Google and Facebook ad-ranking systems were famously distributed L1-regularized logistic regressions, valued for their predictable serving cost and explainability.

The default machine-learning baseline: Every experienced data scientist starts a new tabular classification problem with logistic regression, simply to get a number to beat. If a heavyweight model only marginally improves on a clean logistic regression baseline, the extra complexity is usually not worth it. This simulator exists to make that baseline's behaviour transparent.

Common misconceptions and pitfalls

The most common misconception is to dismiss logistic regression as "just linear" and therefore weak. The combination of a nonlinear sigmoid and a cross-entropy loss is mathematically equivalent to a single-layer neural network. Before deep learning took over, logistic regression was the most trusted linear classifier in the world and it still matches sophisticated models on many tabular problems when features are well engineered. With sensible eta and lambda, the simulator reliably reaches over 95% training accuracy — "simple" does not mean "weak".

The next pitfall is to chase training accuracy as if it were the goal. This simulator displays accuracy on the training data, which only tells you how well the model memorized what it saw. What actually matters in production is generalization to unseen data. Increase L2 regularization $\lambda$ in the simulator: training accuracy can dip, but on real benchmarks that same shift typically reduces overfitting. In practice you always split data into training, validation and test sets and look at each separately.

Finally, beware of trusting gradient descent to "just converge". The logistic regression loss is convex, so in theory the global minimum exists. In practice a learning rate that is too large makes the iterates oscillate or diverge, while one that is too small leaves the weights stuck near zero after 5000 iterations. Try eta = 0.001 or eta = 1.0 in the simulator and watch the w₁ and loss cards to feel the difference. Real systems use learning-rate schedules and adaptive optimizers such as Adam or RMSProp to make this robust.

Logistic Regression (2D Binary Classifier) Simulator

About the logistic regression simulator

FAQ

Real-world applications

Common misconceptions and pitfalls

Related Tools