Decision Trees #

Theory #

root node: depth 0 at top of tree
split nodes: nodes with children, based on splits
left nodes: nodes without children, terminal nodes

Gini impurity #

Pure (gini=0) if all instances applies to one class
Impure (gini=0.5) if instances are equally distributed among classes
Training algorithm computes the Gini impurity Gi of the ith node
The range of Gini impurity is [0, 0.5] for binary classification and [0, 1 - 1/K] for K classes.

$$ Gini(p) = 1 - \sum_{k=1}^{K} p_k^2 $$ where: Gi is the Gini impurity of node i pi,k is the proportion of class k instances in node i

Entropy #

Alternative to Gini impurity for DecisionTreeClassifier
Impurity measure based on information theory
The range of Entropy is [0, log2(K)] for K classes.

$$ Entropy(p) = - \sum_{k=1}^{K} p_k \log_2(p_k) $$

Application #

Classification and regression
No feature scaling or centering needed
Makes little assumptions about the data
Can also estimate the probability of an instance that belongs to a class
Parameters
- max_depth: maximum depth of the tree
- min_samples_split: minimum number of samples required to split an internal node
- min_samples_leaf: minimum number of samples required to be at a leaf node
- max_features: number of features evaluated at splitting a node
- criterion: function to measure the quality of a split (default=“gini” for classification, “squared_error” for regression)
- ccp_alpha: minimal cost-complexity pruning

Coding Example #

from sklearn.tree import DecisionTreeClassifier
iris = load_iris(as_frame=True)
X_iris = iris.data[["petal length (cm)", "petal width (cm)"]].values
y_iris = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf.fit(X_iris, y_iris)