Gradient Boosting Simulator

Parameters

Number of boosting trees

trees

Number of weak learners M added sequentially

Learning rate (shrinkage) η

How much each tree is shrunk before it is added

Depth of each tree

Levels of a single regression tree (deeper = more expressive)

Data noise standard deviation

Magnitude σ of the Gaussian noise added to observations

Number of data points

pts

Number of 1-D data points used for training

Results

—

Training MSE

—

Number of trees

—

Learning rate η

—

Error reduction (%)

—

Noise floor σ²

—

State

—

Ensemble build-up — adding one tree at a time

Against the noisy training points (blue), the true function (faint curve) and the gradient-boosting prediction (bold staircase) are shown. The animation replays the ensemble being built tree by tree as the staircase hugs the data, then loops.

Training error vs number of trees

Ensemble prediction (true function, final fit, data)

Theory & Key Formulas

$$F_m(x)=F_{m-1}(x)+\eta\,h_m(x),\qquad h_m\ \text{fits}\ r_i=y_i-F_{m-1}(x_i)$$

The stage-m ensemble $F_m$ is the previous ensemble $F_{m-1}$ plus a new tree $h_m$ shrunk by the learning rate $\eta$. Each tree $h_m$ fits the residuals $r_i$, which for squared-error loss are the negative gradient of the loss.

$$r_i=y_i-F_{m-1}(x_i)=-\left.\frac{\partial L}{\partial F}\right|_{F=F_{m-1}},\qquad L=\tfrac12\,(y-F)^2$$

The residual is the negative gradient of the squared-error loss $L$ with respect to the prediction $F$, so "fit a tree to the residuals, then add it" is one step of gradient descent in function space.

$$\text{MSE}=\frac1N\sum_{i=1}^{N}\bigl(y_i-F_M(x_i)\bigr)^2,\qquad \text{reduction}=\Bigl(1-\frac{\text{MSE}_M}{\text{MSE}_0}\Bigr)\times100$$

The training MSE is the mean squared error after all M trees. Reduction is the improvement from the constant (mean) prediction MSE₀. When the MSE drops well below the noise floor σ², the model is overfitting.

What is the Gradient Boosting Simulator?

🙋

I hear "gradient boosting" a lot, but what is it actually doing? It doesn't train one big model, does it?

🎓

Right, it is not one giant model. Roughly speaking, it "builds many weak predictors and adds them up". The weak predictor here is usually a shallow regression tree — set the number of trees to 1 and the depth to 1 and you will see it: a model that just splits the input into a few intervals and returns a constant in each. One such tree fits the data terribly. But if the next tree specifically targets "what the previous tree got wrong", and you repeat that dozens of times, the summed-up whole — the ensemble — becomes very capable. That is what is inside XGBoost and LightGBM.

🙋

"The next tree fixes what the previous one got wrong" — I don't quite get that. How is it computed concretely?

🎓

The keyword is "residual". First you start the prediction for every point at something bland, like the mean of y. At each point there is then a gap, "true y minus current prediction" — that is the residual. The next tree treats that residual itself as the target and fits it. In other words it learns only "the part still missing". You shrink that tree by the learning rate η and add it to the whole, so the prediction moves a little closer to the truth. Recompute the residuals, build the next tree — and repeat. Watch the staircase animation above: every time a tree is added, the staircase moves toward the data.

🙋

I see. But why is it called "gradient"? It just looks like fitting trees to residuals.

🎓

Good question. With squared error, the residual r = y − F turns out to be exactly the negative gradient of the loss L = ½(y−F)² with respect to the prediction F. So "fit a tree to the residuals and add it" is literally one step of gradient descent on the loss, performed in function space. The learning rate η is the step size of that descent. Make η small and each correction is small, descending cautiously; make it large and you descend fast but tend to overshoot. Move the η slider on the left and watch how the "training error vs number of trees" curve changes shape.

🙋

So if I keep adding trees and drive the training error to zero, don't I get the strongest possible model?

🎓

That is the trap. The data always carries observation noise, and its variance is the "noise floor" — σ² in this tool. If you add so many trees that the training error falls far below the noise floor, the model memorises the shape of the noise instead of the true function. That is overfitting. With the default settings (50 trees, η 0.10) you will actually see the training error drop below the noise floor of 0.04 and the state reported as "overfit". Cut the trees down to about 10 and you get underfitting instead. The skill is stopping at the right number, and in practice you use "early stopping" — halt once the validation error stops improving.

🙋

Which one should I tune — the number of trees or the learning rate? Being able to change both is confusing.

🎓

Here is the thing: if M·η (trees times learning rate) is about the same, you get a similar fit. So the standard recipe is "fix η to something small, 0.01-0.1, and tune with the number of trees M". A small η descends cautiously per tree and tends to generalise better, but it needs more trees and more training time. Depth is the expressive power of a single tree; shallower trees are more "weak-learner-like". Gradient boosting typically uses many shallow trees of depth 2-6, which is the opposite design philosophy from a random forest with a few deep trees.

Frequently Asked Questions

Both are ensembles of decision trees, but the trees are grown in opposite ways. A random forest builds many deep trees independently and in parallel, then averages them (bagging). Gradient boosting builds shallow trees sequentially, and each tree learns the residual errors that the current ensemble still gets wrong. Because each tree depends on the previous ones, boosting is harder to parallelise, but adding weak learners that chip away at the error tends to give higher accuracy on the same data. XGBoost and LightGBM are fast implementations of gradient boosting.

Because the residual r = y − F(x) that each tree learns equals the negative gradient −∂L/∂F of the squared-error loss L = ½(y−F)² with respect to the prediction F. So "fit a tree to the residuals and add it" is exactly one step of gradient descent on the loss function, performed in function space. The learning rate η is the step size of that descent: small values make each tree's contribution small and the descent cautious, while large values descend faster but raise the risk of overshooting into overfitting.

The number of trees M and the learning rate η trade off against each other: roughly, a constant M·η gives a similar fit. In practice you fix η to a small value (about 0.01-0.1) and increase M until the validation error stops improving — this is "early stopping". A smaller η usually generalises better but needs more trees and more training time. In this tool, lowering η makes the training-error curve fall more gently, while adding more trees pushes the training error below the noise floor and into overfitting.

No. Training data always contains observation noise, and its variance is the noise floor (here noiseLv²). If you add trees until the training error drops well below the noise floor, the model memorises the shape of the noise rather than the true function. That is overfitting: the training error looks great but the error on unseen data gets worse. The ideal is a training error that levels off near the noise floor, and this tool automatically labels the result as underfit, good or overfit by comparing the training error with the noise floor.

Real-World Applications

Tabular-data competitions and production forecasting: For prediction tasks on tabular data — demand forecasting, credit scoring, ad click-through-rate prediction, fraud detection — gradient boosting (XGBoost, LightGBM, CatBoost) is still the first choice. Most top solutions in Kaggle tabular competitions are of this family, delivering high, stable accuracy with less tuning than deep learning.

Search-engine ranking: Re-ordering search results and recommendations is called "learning to rank", and gradient boosting is widely used for it. Gradient boosting learns a relevance score for each query-document pair under a pairwise or listwise loss; LambdaMART is the classic method here.

Surrogate models for engineering and sensor data: You can train gradient boosting on the mapping between input parameters and responses (stress, temperature, efficiency, etc.) obtained from CAE or experiments, and use it as a surrogate model that predicts instantly in place of heavy numerical analysis. In design-space exploration and optimisation loops, this drastically cuts the number of calls to an FEM analysis that takes hours each.

Feature-importance analysis: Gradient boosting can tally how much each feature contributed to the splits, and combined with SHAP values it can explain "which input moved the prediction and how". It is useful not only for prediction but also where interpretation is required — root-cause analysis of manufacturing defects, or surfacing risk factors in healthcare and finance.

Common Misconceptions and Pitfalls

A common misconception is that "more trees always means more accuracy". If you push the number of trees in this tool to the maximum of 200 and the learning rate up to 1, the training error drops almost to zero — but that is an overfit state far below the noise floor. The training error only measures how much the model has "memorised" the data; it is distinct from performance on unseen data (the generalisation error). The number of trees should be decided by the validation error, and adding trees while watching only the training error is a textbook failure mode.

Next, the assumption that "a smaller learning rate is always better". Lowering the learning rate does tend to improve generalisation, but the number of trees needed to reach the same fit grows inversely. Make η one tenth as large and you need roughly ten times as many trees. Since training time and memory scale with the number of trees, in reality you use early stopping to find the balance between "a small enough η and a matching number of trees", trading off against your compute budget. Drop η to an extreme without enough trees and you simply have underfitting.

Finally, overconfidence that "gradient boosting is an all-purpose tool that needs no preprocessing". Being tree-based, it genuinely needs no feature scaling and is fairly robust to missing values. However, with many outliers the squared-error loss is easily dragged by a single point, and you may need to switch to a robust loss such as the Huber loss. Also, in regions absent from the training data (extrapolation) the staircase prediction flattens out at the edge values and goes wrong abruptly — you can see the prediction curve become horizontal at both ends in this tool too. For tasks where extrapolation is essential, such as forecasting the future of a time series, use it with this property in mind.

What is the Gradient Boosting Simulator?

Frequently Asked Questions

Real-World Applications

Common Misconceptions and Pitfalls

How to Use

Worked Example

Practical Notes