4. Decision Tree(2)

Eunji·2026년 4월 11일

Data Mining

목록 보기

5/12

1. Decision Tree Pruning

why do we need pruning?

overfitting
complex
poor performance on unseen data

a very detailed tree memorizes the training data, but fails to generalize

pruning makes the tree smaller, easier to interpret, faster, more accurate on test data

1.1 Approaches

Pre-pruning

stop splitting during tree construction
use criteria
- e.g., Information Gain, Gini
node becomes a leaf if split is not significant
risk: underfitting

트리가 완전히 만들어지기 전, 트리 구축 중에 분할을 중단한다. 분할이 멈춘 노드는 즉시 리프 노드가 된다.

Post-pruning (backward pruning)

build full tree first
remove unnecessary subtrees
replace subtree with a leaf (majority class)
more reliable but computationally expensive

전체 트리를 다 만든 후에 불필요한 부분을 제거해 나가는 방식으로 비싸다.

Key Idea

balance between model complexity and generalization

1.2 Representative Methods

1. Cost Complexity Pruning (CART)

minizes a cost function that combines error rate and tree complexity
compare subtree vs. pruned version to select with lower cost
- $R_{\alpha}(T) = R(T) + \alpha \cdot |T|$
a pruning dataset is typically used to choose the optimal complexity parameter $\alpha$

$\alpha$ 가 0이면 페널티가 없어 복잡한 트리가 되고 $\alpha$ 가 커지면 페널티가 커져 에러가 조금 늘어나더라도 단순한 트리를 선택한다. 위 식을 통해 모델의 복잡도와 일반화 성능 사이의 균형을 맞출 수 있다.

2. Pessimistic Pruning (C4.5)

uses only training data (no pruning set)
adds a penalty to obtain a more realistic error estimate
subtrees are pruned if the corrected leaf error is lower than the subtree error

훈련 데이터에 과적합을 경계하여 에러가 더 많이 날 것이라고 보수적으로 가정하고 트리를 단순화한다.

3. MDL-based pruning

select the model that minimizes total encoding length
balances model complexity and data fitting by considering both tree size and precision errors
subtrees are pruned if replacing them with a leaf reduces the overall description length

트리 크기에 대한 비용과 에러 비용을 모두 길이 단위로 환산하여 최적의 모델을 찾는다.

1.3 Model Evaluation and Selection

after building a classifier, we must ask
- how accurate is our model?
- will it generalize to unseen data?
- how do we compare multiple models?
training accuracy is often overly optimistic due to overfitting
- evaluation must be done on unseen test data

1.4 Confusion Matrix and Metrics

Confusion matrix

TP: correctly predicted positivies
TN: correctly predicted negatives
FP: false alarms
FN: missed positives

Metrics

each metric provides different perspectives on model performance

1.5 Limitation on Accuracy

Class Imbalance

in real-word, the positive class is rare, while the negative class dominates the dataset
a model can achieve high accuracy by predicting only the majority class

Problem

the model ignores the minority (but often important) class in critical applications such as medical diagnosis or security
and missing positive cases (false negatives) can be very costly

Solution

use evaluation metrics that focus on the positive class
precision: reliability of positive predictions
recall(sensitivity): ability to detect positives (rare class)
F1-score: balance between precision and recall

1.6 Reliable Evaluation Methods

1. Holdout

split data into training set and test set
simple and fast, results may very depending on the split

데이터를 2개로 쪼개든 3개로 쪼개든, 한 번 정해진 분할을 고정해서 사용한다.

2. Cross-Validation

divide data into multiple folds
train and test repeatdly on different splits
provides more stable and reliable esimates

데이터를 여러 번 다시 나누어 평가하기 때문에 더 안정적이고 신뢰할 수 있는 추정치를 제공한다.

3. Bootstrap

sample data with replacement
evaluate performance across multiple resampled datasets
useful when data is limited

Goal

obtain a reliable estimate of generalization performance
reduce bias and variance in evaluation
ensure fair comparison between different models

2. Ensemble

2.1 Ensemble Learning

ensemble methods combine multiple models $M_1, M_2, ..., M_k$
each classifier "votes" for the class label of the given data
to build a strong composite model and improve classification performance

Process

generate multiple training set $D_1,D_2, ..., D_k$
train base classifiers $M_i$
each model produces a prediction
final prediction is based on combined outputs

2.2 Voting Mechanism

each classifier outputs a class label ("votes")
final prediction
- (classification) majority voting
- (regression) averaging
improved accuracy through diversity
- ensemble makes an error only when the majority of models are wrong
- more robust and stable than a single model

low correlation between models
- if all models are highly similar, they tend to make the same mistake
- if models are less correlated, errors are less likely to overlap

모델이 서로 다른 방식으로 학습하면 실수를 하더라도 서로 다른 지점에서 하게 된다.

2.3 Bagging

given a dataset $D$ with $d$ samples:
- we create multiple training sets $D_1, D_2, ..., D_k$
- each dataset is generated using bootstrap sampling (w/ replacement)
- some samples may appear multiple times, while some samples may not appear at all

Training Process

for each iteration $i = 1, 2, ... , k$
- sample dataset $D_i$ from $D$ (w/ replacement)
- train a model $M_i$ using $D_i$
- we obtain multiple models $M_1, M_2, ... , M_k$

Prediction Process

given a new data sample $X$
- each model $M_i$ independently makes a prediction
- each prediction is treated as one vote
final output
- (classification) the class with the highest number of votes is selected $\rightarrow$ majority voting
- (regression) the final output is the average of all predictions

Why Bagging Works

reducing variance
stabilizing predictions through averaging
- averaging multiple models reduces fluctuations in predictions
handling noise and overfitting
- different training samples produce different models
- and the effect of noise is reduced through aggregation