Random Forest Majority Vote Simulator Back
Machine Learning Simulator

Random Forest Majority Vote — Bagging and Variance Reduction

Train T decision trees with bootstrap sampling and feature subsampling, then predict by majority vote. Compare a single tree against the ensemble and see why aggregation reduces variance.

Parameters
Number of trees T
trees
Max depth per tree
levels
Query point x
Query point y

The training set is fixed (LCG seed = 42, N = 80, two classes, σ = 1.5). Tree construction is also deterministic via derived seeds.

Results
Single tree training accuracy
Ensemble training accuracy
Predicted class at query
Top-class vote share
Majority-Vote Decision Boundary and Query Vote Distribution

Background = vote-majority decision region (red = class 0 / blue = class 1) / dots = training data / yellow X = query / bottom-right bar = vote split

Theory & Key Formulas

Bootstrap sample: draw N points with replacement from N (about 63.2% are unique). Each tree is trained on its own bootstrap sample:

$$P(\text{not picked}) = (1 - 1/N)^N \;\xrightarrow{N\to\infty}\; e^{-1} \approx 0.368$$

Two-class majority-vote prediction:

$$\hat{y}(\mathbf{x}) = \mathrm{argmax}_{c \in \{0,1\}} \sum_{t=1}^{T} \mathbb{1}[h_t(\mathbf{x}) = c]$$

Variance of an average of weak learners (the bagging effect):

$$\mathrm{Var}(\bar{h}) = \frac{\sigma^2}{T} + \rho\,\sigma^2\,\frac{T-1}{T}$$

ρ is the correlation between trees. Feature subsampling lowers ρ so that increasing T continues to reduce variance.

What is the Random Forest Majority Vote Simulator

🙋
I have heard that random forest is "build many decision trees and vote". Does that really raise accuracy? If you train on the same data, won't all trees come out identical?
🎓
Sharp question. The trick is two layers of randomization: bootstrap sampling and feature subsampling. Bootstrap draws N points with replacement from the original N — some samples appear multiple times and about 37% never appear at all. So each tree sees slightly different data. On top of that, each split picks features at random, which decorrelates the trees. In the simulator, sweep T from 1 to 10 to 20 — the decision boundary changes from a jagged staircase to a smooth curve.
🙋
Right, T = 1 is very jagged but T = 20 is round and smooth. The "single tree training accuracy" and "ensemble training accuracy" don't differ that much, though.
🎓
Training accuracy is high even for a single tree — in fact a deep tree tends to overfit it. What really matters is test accuracy and variance reduction. Mathematically, the variance of an average of T independent learners drops to σ²/T. Real trees aren't fully independent so a ρσ²(T-1)/T term remains, but feature subsampling lowers ρ enough that bagging keeps reducing variance. We don't show a hold-out set for rendering speed, but watch how the boundary morphs toward a clean diagonal line between the two class centers — that's overfitting being averaged away.
🙋
When I put the query at (0, 0) — right between the two classes — the votes split nearly evenly. The "top-class vote share" is 60 % or so. What does that mean?
🎓
It's a measure of prediction uncertainty. The center is an ambiguous region that could belong to either class, so different trees decide differently. Move the query to (-3, -3) or (3, 3) and the share jumps to 95 % or 100 %. In practice, this vote share is used as a "probability" (predict_proba): treat samples below 0.5 as undecided, above 0.9 as confident. It's exactly how credit scoring and medical decision support use the model.
🙋
If I drop the max depth to 1 the accuracy collapses, but at depth 5 it goes near 100 %. So deeper is always better?
🎓
No, that's the trap. Deep trees memorize the training set and hit 100 % training accuracy, but on test data they overfit. Here is the elegant part of random forests: even when individual trees overfit, each one overfits to a different bootstrap sample, so the noise cancels out under voting. That's why the default in RF is "deep trees, lots of them". Gradient boosting takes the opposite stance: shallow trees, corrected sequentially.

Frequently Asked Questions

Bootstrap sampling picks roughly 63 % of the original data for training and leaves about 37 % "out of the bag". For each sample you predict using only the trees that did not see it during training, then aggregate those predictions to get an error rate. This is the OOB error. It provides a generalization estimate without a separate hold-out set, so random forests use it widely as a substitute for cross-validation. In scikit-learn, set oob_score=True to retrieve it.
Two common methods. MDI (Mean Decrease in Impurity) averages, over all trees, the Gini reduction each feature produces when used in a split. It is fast but biased toward features with many categories or high-resolution continuous values. Permutation importance shuffles a feature's values and measures the drop in OOB accuracy, which is more expensive but considered fairer. SHAP values are an alternative that distributes the credit consistently across features.
Random forest trains independent trees in parallel (bagging) and aggregates them by vote or average. The aim is to reduce variance, and it is robust to overfitting and easy to tune. Gradient boosting (XGBoost, LightGBM, CatBoost, etc.) trains trees sequentially so each one corrects the residuals of the previous ones. It reduces bias and usually reaches higher accuracy ceilings, but is more sensitive to learning rate and tree count and is more prone to overfitting if poorly tuned.
In theory the OOB error decreases monotonically with T and never overfits (Breiman's law of large numbers). In practice accuracy plateaus around T = 100 to 500. Since training and inference cost grow linearly with T, the rule of thumb is to plot the OOB curve and stop just before it flattens. This simulator caps T at 50 for rendering speed, but the qualitative behavior is the same: beyond T = 20 or so, the decision boundary barely changes.

Real-World Applications

Credit scoring and lending decisions: Banks routinely use random forests and gradient boosting to estimate default probability from features like income, employment tenure, delinquency history and transaction patterns. The vote share serves directly as a "risk score", and the relatively interpretable feature-importance ranking makes it easier to satisfy regulators who require explanations.

Medical diagnosis and gene expression analysis: Random forests are robust on high-dimensional "features >> samples" data such as gene expression matrices. Feature subsampling helps even when there are tens of thousands of noisy candidates, and the model produces stable importance rankings of disease-related genes. They are a workhorse in bioinformatics.

Manufacturing defect detection and predictive maintenance: Random forests are easy to deploy on the shop floor for classifying defective products or predicting equipment failure from sensor data. Training parallelizes well, inference is fast, and the model is robust to outliers. Thresholding on the vote share supports the practical workflow of "auto-classifying confident cases and routing the gray zone to a human operator".

Recommender systems and demand forecasting: E-commerce and retail use random forests to predict purchase probability or demand from user attributes and transaction history. They are an excellent baseline before serious feature engineering, since they capture interactions automatically. Even in the deep-learning era, they are often the strongest single model on tabular data.

Common Misconceptions and Cautions

The most common misconception is to assume that "more trees always means more accuracy". The OOB error does decrease monotonically with T, but it plateaus around 100 to 500 trees and gains beyond that are tiny. Meanwhile training and inference cost grow linearly and memory usage too. Sweep the simulator from 1 to 10 to 30 to 50 trees: the difference between T = 10 and T = 30 is obvious, but between T = 30 and T = 50 you can barely see anything. In real projects the right move is to stop just before the OOB curve flattens.

The next pitfall is treating MDI feature importance as gospel. MDI has a known bias that overestimates features with many categories or fine-grained continuous values. If you accidentally include a meaningless but unique feature like a customer ID, it can rank near the top. Pair MDI with permutation importance or SHAP values for safety, and remember that highly correlated features split their importance among themselves.

Finally, beware of thinking random forest is a universal hammer. It is a strong choice on tabular data, but it loses badly to deep learning on images, audio and text. Its extrapolation behavior is also weak: predictions break down for inputs outside the training range. And on heavily imbalanced data the majority vote always returns the dominant class, so class_weight, threshold tuning and SMOTE become essential. In the simulator, push the query to the corners (±5, ±5) and watch the vote share become extreme — that is a glimpse of the extrapolation instability.