What is the difference between Gini impurity and entropy?

Gini impurity is G = 2 p1 p2 with a maximum of 0.5 at p1 = 0.5. Shannon entropy is H = -p1 log2 p1 - p2 log2 p2 with a maximum of 1.0. The two curves are very similar, and in practice the resulting splits are nearly identical. Gini avoids the log computation, so it is faster and is the default in CART.

How do CART, ID3 and C4.5 differ?

CART (Classification And Regression Trees) uses Gini impurity and always builds a binary tree. ID3 uses Shannon entropy via information gain and can build multi-way trees. C4.5 is an improvement on ID3 that uses the gain ratio to correct the bias toward attributes with many values. Scikit-learn's DecisionTreeClassifier follows CART and lets you choose either Gini or entropy.

Why is the misclassification rate not used as the split criterion?

Because the misclassification rate ME = min(p1, p2) is insensitive to splits. It is piecewise linear rather than a concave function like Gini or entropy, so its information gain is often zero even when a split increases purity, leading the search to miss good splits. Misclassification rate may be used for pruning, but it is avoided as a splitting criterion.

What is information gain?

It is the impurity of the parent minus the weighted average impurity of the children: IG = I(parent) - sum (|S_k|/|S|) I(S_k), where |S_k|/|S| is the sample fraction of each child. A larger IG means the split reduces impurity more strongly, that is, separates the classes better. Trees grow greedily by choosing at each node the attribute and split point that maximize IG.

Decision Tree Impurity Simulator — Free Online Calculator

Parameters

Parent class-1 ratio p1

—

Left child sample count

samples

Left child class-1 ratio

—

Right child class-1 ratio

—

The right child sample count is fixed at 100. The left child fraction is leftN/(leftN+100).

Results

—

Parent Gini impurity

—

Parent entropy

—

Gini information gain

—

Entropy information gain

Impurity Curves and Split Tree Diagram

Top: X = p1, three curves (blue = Gini, green = Entropy, red = Misclassification) / Bottom: parent-to-children split diagram

Theory & Key Formulas

Three impurity measures for a two-class problem (p1 = class-1 ratio, p2 = 1 − p1):

$$G = 2 p_1 p_2 = 2 p_1 (1 - p_1)$$

Shannon entropy (base 2, with H = 0 at p1 = 0 or 1):

$$H = -p_1 \log_2 p_1 - p_2 \log_2 p_2$$

Misclassification rate:

$$ME = \min(p_1, p_2)$$

Information gain of a parent-to-children split:

$$IG = I(\text{parent}) - \sum_k \frac{|S_k|}{|S|} I(S_k)$$

All three measures peak at p1 = 0.5, where Gini = 0.5, entropy = 1.0 and ME = 0.5.

What is the Decision Tree Impurity Simulator

🙋

My mental model of a decision tree is "a stack of if-statements". What does "impurity" actually mean?

🎓

Roughly speaking, it measures how mixed up the classes are at a given node. A node with half apples and half oranges is "impure"; a node that is almost all apples is "pure". A decision tree splits the data each time so as to drive this impurity down. In the simulator above, move "parent class-1 ratio p1" from 0.5 toward 0 or 1 — Gini, entropy and misclassification all drop sharply.

🙋

Right, they all peak at 0.5 and reach zero at the ends. But Gini (blue) and entropy (green) are round humps, while misclassification (red) is a triangle. Why is that?

🎓

That is the key point. Gini and entropy are smooth "concave" functions; misclassification is piecewise linear — see the kink at p1 = 0.5. When you compute the information gain of a split, concave functions guarantee that "any split increases gain or leaves it unchanged". For misclassification the gain is often exactly zero. That is why CART and ID3 do not use misclassification as a split criterion. It is fine for pruning, though.

🙋

So "information gain" is the IG_Gini box at the bottom? Parent 0.500 minus weighted children 0.320 equals 0.180, that kind of thing.

🎓

Exactly. The formula is IG = I(parent) − Σ (|S_k|/|S|) · I(S_k): take the weighted average of the children's impurity by sample count and subtract from the parent. With the defaults (left leftP1=0.8, right rightP1=0.2) both children lean toward one class, so the IG is large. Try setting both children to 0.5 — the IG goes to nearly zero. That means "the split did not add any information".

🙋

Scikit-learn's DecisionTreeClassifier defaults to criterion='gini'. Why Gini and not entropy?

🎓

In practice the splits chosen by the two are almost the same — papers usually report only a few percent difference. So you may as well use Gini, which is faster: entropy needs log2, Gini needs only two multiplications. So CART makes Gini the default. ID3 and C4.5 use entropy via information gain. In real projects, the sound move is to try both during hyperparameter tuning and compare CV scores.

Frequently Asked Questions

For two classes, G = 2·p1·p2 = 2·p1·(1−p1), a quadratic in p1 maximized at p1 = 0.5 with value 2·0.5·0.5 = 0.5. In general, for K classes the maximum is 1 − 1/K, so 0.5 for two classes, 2/3 for three, and 0.9 for ten. Gini can be read as "the probability that two randomly drawn samples belong to different classes".

The unit depends on the logarithm base. Log base 2 gives bits, natural log (ln) gives nats, log10 gives dits. In decision trees and most machine learning literature the convention is log2, so the unit is bits: the maximum for two classes is 1 bit and for K classes with a uniform distribution it is log2(K) bits. This simulator uses log2 and caps the two-class maximum at 1.0.

It is information gain normalized by the "split information": Gain Ratio = IG / SplitInfo, where SplitInfo is the entropy of the branch sizes. Raw IG over-rewards attributes with many possible values (for example, a customer ID), so C4.5 uses the gain ratio to correct that bias. Because CART splits are always binary, the problem is comparatively mild there.

Yes, but with care. With extreme imbalance (say 99 : 1) Gini and entropy are both small to begin with, so the tree can look "pure" without splitting at all. A common fix in scikit-learn is class_weight='balanced', which reweights each class by inverse frequency and raises the cost of misclassifying the minority. Combining this with oversampling such as SMOTE and using metrics like F1 or AUC alongside accuracy is also wise.

Real-World Applications

Scikit-learn, XGBoost, LightGBM: All major Python machine learning libraries use Gini or entropy internally to evaluate splits. Scikit-learn's DecisionTreeClassifier defaults to criterion='gini', and gradient-boosted trees in XGBoost and LightGBM follow essentially the same principle. The feel of "splitting reduces impurity" you build in this simulator is exactly what is happening inside those libraries.

Credit scoring and lending decisions: Decision trees and random forests are widely used in finance. Attributes like "annual income", "years employed" and "past delinquencies" are split to separate borrowers with high versus low default probability. The "Gini index" is also used in economics to measure income inequality — the underlying concept is the same.

Medical diagnosis and risk prediction: Diagnostic support systems that predict disease from symptoms and lab values often prefer decision trees for their interpretability. Branches like "if temperature > 38 then suspect flu, and if cough is also present then..." closely mirror clinical reasoning, and a drop in impurity intuitively corresponds to "how much the symptoms narrowed down the diagnosis".

Manufacturing defect detection and quality control: Systems that classify defective products from sensor data often use decision trees. Explicit rules like "if temperature > 80°C and vibration > 3.0 G, then defective" are easy for operators on the shop floor to understand. The "split with a pure left child" we see in this simulator corresponds in practice to "a criterion that cleanly isolates defective units".

Common Misconceptions and Cautions

The most common misconception is to think that "choosing Gini versus entropy will significantly change model performance". In reality the two curves are very similar in shape and the resulting splits usually differ by only a few percent. The official scikit-learn docs say "there is no clear empirical evidence to suggest that one is better than the other". What matters is not the choice between Gini and entropy but tuning other hyperparameters: tree depth (max_depth), minimum samples per leaf (min_samples_leaf), and pruning strength (ccp_alpha). Moving p1 in the simulator and comparing the two curves makes the similarity immediately obvious.

The next most common mistake is to try to use the misclassification rate as a splitting criterion. People reason: "we ultimately want to reduce misclassification, so we should pick the split that minimizes it". But in the simulator, set the left and right children to asymmetric values such as 0.4 and 0.6. Gini will report a positive information gain while the misclassification gain can be zero. The reason is that misclassification is piecewise linear and lacks the sensitivity of a concave function. Misclassification rate is unsuitable as a split criterion, but appropriate as a pruning metric.

Finally, remember that large information gain does not automatically mean a good split. As an extreme example, assigning a unique ID to each sample and splitting "ID = 1 to the left, ID = 2 to the right..." makes every leaf pure and maximizes IG — but the model does not generalize at all. C4.5's gain ratio, CART's minimum sample constraints and the regularization terms in gradient boosting all exist precisely to "suppress splits that lead to overfitting". The IG values shown in this simulator are exact theoretical figures, but in real projects they must always be paired with cross-validation or held-out evaluation.

Decision Tree Impurity — Gini, Entropy and Misclassification

What is the Decision Tree Impurity Simulator

Frequently Asked Questions

Real-World Applications

Common Misconceptions and Cautions

Related Tools