1.1 Uncertainty in Data Mining
- Many data mining tasks involve making predictions under uncertainty, such as in classification where the goal is to predict a class label Y from an input feature vector X
- This relationship is expressed as the posterior probability P(X∣Y), which is the probability that an instance belongs to class Y given the observed features X
- Real-world datasets contain uncertainty due to noisy data, missing attributes, overlapping feature distributions, and measurement errors
- overlapping feature distributions
- 서로 다른 클래스의 특징값 분포가 겹침
- A 클래스와 B 클래스가 어떤 feature에서 비슷한 값들을 많이 가져서 경계가 뚜렷하지 않은 상태
- Information theory provides a mathematical framework for quantifying this uncertainty using concepts like entropy, conditional entropy, and mututal information
- entropy: H(Y)=−∑p(y)logp(y)
- conditional entropy: H(Y∣X)
- mutual information: H(X;Y)=H(Y)−H(Y∣X)
정보이론으로 불확실성을 수량화하는 수학적 프레임워크를 만들 수 있다.
- Introduced by Claude Shannon in 1948 to study how information can be efficiently transmitted through channels
- Key objectives: Quantifying information in a message, measuring uncertainty in random variables, and understanding the relationship between probability distributions and information
확률 분포와 불확실성의 관계를 측정, 데이터 마이닝에서 확률 분포 간 관계를 정량화한다.
Definition: I(x)=−log2p(x)
- It measures the information content of a single event
- If an event is unlikely (low probability), it provides more information
Properties
- Non-negative: I(x)≥0, 확률 p(x)≤1이므로 로그값 항상 0 이상
- Monotonic: p(x)가 작을수록 I(x)가 커짐
- Additive for independent events: I(x,y)=I(x)+I(y)(x, y 독립일 때)
2.2 Entropy: Uncertainty of a Distribution
Definition: H(X)=−∑p(x)logp(x)
- It is expected value of self-information or the average information produced by a random variable
- Low entropy → high certainty, while High entropy → high uncertainty
Properties
- Non-negativity: H(X)≥0
- 모든 확률 값은 0과 1 사이이므로 엔트로피 값은 음수가 될 수 없음
- Maximum entropy(max): 모든 결과가 동등하게 가능할 때 H(X)=log2n(n은 가능한 결과 개수)
- 모든 결과가 균등하게 발생할 때 불확실성은 최대
- Deteministic case(min): 한 결과만 확정(확률 1)일 때 H(X)=0
- 완전한 확실성으로 인해 결과가 완전히 예측 가능, 불확실성이 없는 상태
3.1 Conditional Entropy
Definition
- Conditional entropy measures the remaining uncertainty of a random variable after observing another variable
- H(Y∣X) measures how much uncertainty in Y remains after knowing X
Mathematical Definition
- H(Y∣X)=−∑p(x,y)logp(y∣x)
- Alternative form: H(Y∣X)=∑p(x)H(Y∣x) (Conditional entropy is the expected uncertainty over X)
- H(Y∣x): x가 딱 하나로 고정, 확률 안 곱함
- H(Y∣X): x가 여러 값, 각각의 확률 p(x)를 가중치로 곱해서 평균냄
Interpretation
- If X strongly predicts Y → conditional entropy is small
- If X provides little information about Y → conditional entropy is large
X를 알면 알수록 Y에 대한 불확실성이 줄어들고 줄어든 정도를 수치화한 것이 조건부 엔트로피이다.
E.g.,
- Computing the conditional entropy H(Y∣X) of exam results Y given study status X
Study 여부에 따른 Pass/Fail 확률 데이터를 통해 특정 조건(X)이 주어졌을 때 Y의 불확실성이 어떻게 변하는지 계산하는 기초로 활용한다.
3.2 Chain Rule of Entropy
Definition
- Entropy satisfies an important decomposition rule called the chain rule of entropy
- H(X,Y)=H(X)+H(Y∣X)
- Similarly, H(X,Y)=H(Y)+H(X∣Y)
전체 불확실성을 어떤 변수를 먼저 기준으로 잡느냐에 따라 두 가지로 쪼갤 수 있다.
Interpretation
- The joint entropy H(X,Y) represents the total uncertainty of two variables together
- The total uncertainty can b e decomposed into two parts:
- The uncertainty of one variable H(X)
- The remaining uncertainty of X + remaining uncertainty of Y after observing X
- Total uncertainty = uncertainty of X + remaining uncertainty of Y after observing X
Definition
- Mutual information measures how much information two random variables share, or how much knowing one variable reduces the uncertainty about the other
Mathematical Definition
- I(X;Y)=H(Y)−H(Y∣X)
- Another formulation: I(X;Y)=H(X)+H(Y)−H(X,Y)
- H(Y): X를 모를 때 Y의 불확실성
- H(Y∣X): X를 알고 난 후 Y의 남은 불확실성
- I(X;Y): X를 앎으로써 줄어든 불확실성
- If X and Y are independent, then I(X;Y)=0
Properties
- Non-negativity
- I(X;Y)≥0
- Dependency
- Larger mutual information indicates a stronger dependency between variables
Interpretation
- Independence
- If X and Y are independent, I(X;Y)=0
- Knowing one variable dose not reduce the uncertainty about the other; they do not share any information
- Perfect Prediction
- If knowing X allows us to perfectly predict Y, then I(X;Y)=H(Y)
- Once X is known, all uncertainty about Y disappears
- MI measures how strongly two variables are related by quantifying the reduction in uncertainty
MI란 X를 알기 전과 X를 알고 난 후의 불확실성 차이, 그 차이가 크면 클수록 두 변수는 강하게 연관되어 있고 차이가 0이면 아무 관련이 없다.
4. Featrue Selection and Decision Trees
- In data mining, a useful feature provides information about the target variable (label)
Core Idea
- Feature selection can be formulated as maximizing the mutual information
Significance
- High MI with the label → significantly reduces the uncertainty of the target
- MI close to zero → the feature provides little or no useful information for prediction
Goal
- Features with larger MI are considered more information and improve learning performance
Decision Tree에서 어떤 기준으로 데이터를 나눌지 결정할 때 사용한다.
- IG=H(parent)−H(children)
- IG=H(Y)−H(Y∣X)
- IG represents how much uncertainty about the class label Y is reduced by knowing the feature X
- 나누기 전 데이터 (parent)
- X 기준으로 나눈 후 (children)
- IG = 나누기 전 불확실성 - 나눈 후 불확실성
Goal
- The ultimate goal of decision tree learning is to reduce uncertainty about the class label
Process
- At each node, the algorithm evaluates candidate features and selects the one with the highest information gain
- Selecting the best feature is equivalent to maximizing the mutual information between feature X and class label Y
- Decision tree learning is an iterative process of reducing uncertainty in the dataset
Higher H means more uncertainty; observing X results in a lower H(Y∣X), leading to a reduction in uncertainty