Decision Trees #
Theory #
- root node: depth 0 at top of tree
- split nodes: nodes with children, based on splits
- left nodes: nodes without children, terminal nodes
Gini impurity #
- Pure (gini=0) if all instances applies to one class
- Impure (gini=0.5) if instances are equally distributed among classes
- Training algorithm computes the Gini impurity Gi of the ith node
- The range of Gini impurity is [0, 0.5] for binary classification and [0, 1 - 1/K] for K classes.
$$ Gini(p) = 1 - \sum_{k=1}^{K} p_k^2 $$ where: Gi is the Gini impurity of node i pi,k is the proportion of class k instances in node i
Entropy #
- Alternative to Gini impurity for DecisionTreeClassifier
- Impurity measure based on information theory
- The range of Entropy is [0, log2(K)] for K classes.
$$ Entropy(p) = - \sum_{k=1}^{K} p_k \log_2(p_k) $$
Application #
- Classification and regression
- No feature scaling or centering needed
- Makes little assumptions about the data
- Can also estimate the probability of an instance that belongs to a class
- Parameters
- max_depth: maximum depth of the tree
- min_samples_split: minimum number of samples required to split an internal node
- min_samples_leaf: minimum number of samples required to be at a leaf node
- max_features: number of features evaluated at splitting a node
- criterion: function to measure the quality of a split (default=“gini” for classification, “squared_error” for regression)
- ccp_alpha: minimal cost-complexity pruning
Coding Example #
from sklearn.tree import DecisionTreeClassifier
iris = load_iris(as_frame=True)
X_iris = iris.data[["petal length (cm)", "petal width (cm)"]].values
y_iris = iris.target
tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf.fit(X_iris, y_iris)