Linear SVM Simulator Back
Machine Learning Simulator

Linear SVM Simulator — Soft-Margin 2D Classification

Subgradient descent minimizes hinge loss plus L2 regularization to learn a maximum-margin separating line. See in real time how support vectors emerge and how C and label noise reshape the decision boundary.

Parameters
Regularization C
Learning rate η
Iterations
steps
Data noise level
σ

Dataset is generated by a fixed seed-42 LCG (deterministic). Class +1 centred at (2,2), -1 at (-2,-2), 20 points each, σ=1.0.

Results
Training accuracy
Margin width 2/‖w‖
Support vectors (in/on margin)
‖w‖_2
2D Classification and Maximum-Margin Hyperplane

Blue dots = class +1 / Red crosses = class -1 / Green solid = w·x+b=0 / Green dashed = margin boundaries w·x+b=±1 / Black circles = support vectors

Theory & Key Formulas

A soft-margin linear SVM learns the maximum-margin separating hyperplane by minimizing the sum of a hinge loss and an L2 regularizer.

Decision function. w is the weight vector and b the bias:

$$f(\mathbf{x}) = \mathbf{w}\cdot\mathbf{x} + b$$

Objective. The first term maximizes the margin, the second is the hinge loss with regularization parameter C:

$$J(\mathbf{w},b) = \tfrac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^{N}\max\bigl(0,\,1 - y_i\,f(\mathbf{x}_i)\bigr)$$

Subgradient (only the violating points $1 - y_i f(\mathbf{x}_i) > 0$ contribute):

$$\frac{\partial J}{\partial \mathbf{w}} = \mathbf{w} - C\sum_{i\in\mathcal{V}} y_i\,\mathbf{x}_i,\quad \frac{\partial J}{\partial b} = -C\sum_{i\in\mathcal{V}} y_i$$

The margin width is $2/\|\mathbf{w}\|$. Support vectors are the points with $|y_i f(\mathbf{x}_i) - 1|$ small (on or inside the margin).

What is the Linear SVM Simulator

🙋
I have heard of SVM, but what is it actually doing?
🎓
Roughly, it is an algorithm that draws a line (or a hyperplane in higher dimensions) that splits two classes "with as much breathing room as possible". In the simulator above it is learning the green line that separates the blue dots from the red crosses. The key point is not just to split them, but to choose the line where the distance to the closest point on each side — the margin — is largest. That is why it is called a "maximum-margin classifier".
🙋
There are two green dashed lines too. What are those?
🎓
Those are the margin boundaries, $w\cdot x + b = \pm 1$, and the width of that band, $2/\|w\|$, is the margin width. SVM tries to widen this band as much as possible while still classifying the training points correctly. Push the regularization C from 1 up to 100 — as C grows, the penalty on misclassifications becomes heavier and the band shrinks to fit individual points more aggressively.
🙋
The "support vectors" card shows a number. Are those the points circled in black?
🎓
Exactly. Points sitting on or inside the margin boundaries are the support vectors. The fascinating part of SVM is that the final decision boundary is determined entirely by these support vectors. "Easy" points far outside the margin can be added in any number without changing the boundary at all. That is why they are named "support vectors" — they support the decision.
🙋
When I move the noise slider the blue and red start to overlap, and the margin gets wider. That feels backwards.
🎓
Good observation. As the overlap grows, many points end up inside the margin. The soft-margin SVM "tolerates" this through the hinge loss, with C controlling how heavy the penalty is. Small C means "keep the margin wide and accept some violations"; large C means "punish violations hard and shrink the margin to classify correctly". In practice C is tuned by cross-validation.

Frequently Asked Questions

Each subgradient step takes a bigger jump. Too small and convergence is slow; too large and w and b oscillate or diverge. Around 0.01 is a stable choice for this tool. Pushing η to 0.5, especially with a large C, makes the objective oscillate noticeably and can transiently drop the training accuracy. It is a useful exercise in numerical stability of optimization.
Three main causes. First, large noise that makes the data linearly inseparable. Second, C is too small, so the misclassification penalty is light and the model deliberately tolerates a few errors. Third, the iteration count is too low, so optimization has not converged. With noise=0, C=1, η=0.01, and 500 iterations the default dataset reaches 100%.
Three options. First, add features so the data becomes linearly separable in a richer space (e.g., add x1², x1·x2 polynomial features). Second, use a kernel SVM (RBF, polynomial, sigmoid). Third, accept some errors with a soft margin, prioritizing robustness. With real data the second and third options are usually combined.
Both are linear classifiers but use different losses. SVM uses hinge loss, which is exactly zero for correctly classified points outside the margin. Logistic regression uses cross-entropy and always has a non-zero gradient on every point. As a result, SVM gives a sparse solution determined only by the support vectors, while logistic regression gives a smooth solution with probability outputs. Choose logistic regression when you need probabilities, SVM when you want margin-based discrimination.

Real-World Applications

Text classification and spam filtering: SVM was the standard text classifier in the 2000s. Documents are encoded as word-frequency vectors (TF-IDF), and a linear SVM performs spam detection or topic classification. SVM's strength on high-dimensional sparse vectors fits text data, where the vocabulary easily reaches tens of thousands of dimensions. It still serves as a strong baseline for simple text classification.

Image classification and bioinformatics: HOG features + linear SVM for human detection (Dalal & Triggs, 2005) was the de facto image-recognition method before deep learning. In bioinformatics, kernel SVMs are widely used for gene-expression and protein-structure classification. SVM excels at the "small n, large p" regime — few samples but many features.

Anomaly detection (One-Class SVM): One-Class SVM learns the boundary of "normal" from normal data alone and flags outliers. It is used in defect detection on production lines, sensor anomaly detection, and network-intrusion detection. It is robust to extreme class imbalance and does not need labelled abnormal samples — major practical advantages.

Education and theory: SVM packages many central machine-learning concepts in one place — margin maximization, dual formulation, kernel methods, convex optimization, structural risk minimization. Even in the deep-learning era, it is treated as a required topic in universities and industry training as a foundational "shape" of modern machine learning.

Common Misconceptions and Cautions

The most common misconception is to think that "the larger C is, the better the performance". C is the weight on the hinge loss; raising it tightens the fit to the training data but also raises the risk of overfitting. Lowering C widens the margin, which improves generalization while reducing training accuracy. Sweep the simulator's C from 0.01 to 100 and you will see the margin width change continuously. In practice you grid-search C with cross-validation. The right C is dataset-specific — there is no universal value.

The next most common error is to assume "subgradient descent is the canonical SVM solver". This tool implements it for educational simplicity, but production SVM solvers attack the dual problem with SMO (Sequential Minimal Optimization) — used by LIBSVM, scikit-learn, Vowpal Wabbit, etc. The subgradient method appears in large-scale online learning (Pegasos), but is inferior to dual methods in convergence speed and numerical stability. This tool is a device to feel the relationship between hinge loss and margin maximization.

Finally, beware of the misconception that SVM is "scale-insensitive". Linear SVM is in fact very sensitive to feature scaling. Feed it data with x1 in [0,1] and x2 in [0,10000] and the large scale of x2 makes w2 tiny, so the contribution of x1 is essentially ignored. This tool uses 2-D data on a common scale so the issue does not appear, but with real data you must always standardize (StandardScaler) or normalize. Forget this and tuning C becomes meaningless.