5. Linear Model (4)

Eunji·2026년 5월 13일

Data Mining

목록 보기

9/12

1. Linear Models

1.1 Core of Linear Models

signal $s = \mathbf{w}^T\mathbf{x}$ : combines input variables linearly
we have seen two models based on this

입력에 대한 가중치 내적은 각 입력이 결과에 얼마나 중요한지를 나타낸다.

1. Linear Regression

singal itself = output
for predicting real (unbounded) response

2. Linear Classification

signal is threshold at zero to produce $\pm 1$ ouput
for binary decisions

2. Logistic Regression

2.1 Example: Heart Attack Prediction

based on cholesterol level, blood pressure, age, weight, ...
cannot predict a heart attack with any certainty
- but can predict how likey it is to occur given these factors
a more suitable model than binary decision would be:
- output $y$ that varies continously between 0 and 1
- the closer $y$ is to 1, the more likely heart attack will occur

심장마비의 경우 확률적으로 알려주는 것이 더 현실적이고 적합한 모델링이다. 이 개념이 바로 로지스틱 회귀를 'Soft' Binray Classification 이라고 부르는 이유이다.

2.2 Soft Binray Classification

outputs probability of a binray response
e.g., heart attack or not, dead or alive
- returns 'soft labels' (probability)

Logistic Regression

output: real (like regression) but bounded (like classification)

로지스틱 회귀는 출력이 0과 1 사이로 제한된다. 입력 신호에 로지스틱 함수를 씌워 매끄럽게 범위를 제한하기 때문이다

Linear Classification vs. Logistic Regression

both deal with a binary event
logistic regression: allowed to be uncertain
$\rightarrow$ intermediate values between 0 and 1 reflect this uncertainty

2.3 Example: Probability of Default

annual incomes and monthly credit card balances

(a): individual who defaulted on credit card payments are shown in orange, and those who did not are shown in blue
(b): boxplots of balance/income as a function of default status

$\mathbb{P}[default = yes | balance]$ : probability of default given balance

can be estimated / predicted by regression analysis
logistic regression: more appropriate than linear regression

estimated probability of default:
- (a): some probabilities are negative
- (b): all probabilities lie between 0 and 1

3. Logistic Regression Model

Linear Classification

hard threshold on signal $s = \mathbf{w}^T\mathbf{x}$
$h(x) = sign(\mathbf{w}^T\mathbf{x})$

Linear Regression

no threshold
$h(x) = \mathbf{w}^T\mathbf{x}$

Logistic Regression Model

needs something between these two
- smoothly restricts output to probability range [0, 1]
$h(x) = \theta(\mathbf{w}^T\mathbf{x})$
- $\theta$ so-called logistic function

출력값의 특성
- 출력값이 실수이면서도 0과 1 사이로 제한적인 범위
임계값의 적용 방식
- 신호가 작을 때는 부드럽게 변하고 클 때는 선형 분류의 임계값에 수렴하는 형태
확실성 vs. 불확실성
- 연속적인 확률값으로 모델이 얼마나 확신/불확실한지 표현

3.1 Logistic Function $\theta$

Definition

For $-\infin <s < \infin$ :
$\theta(s) = \displaystyle \frac{e^s}{1 + e^s} = \frac{1}{1 + e^{-s}}$

output lies between 0 and 1
- can be interpreted as probability for binary events

Formula of $\theta(s)$

allow us to define an error measure that has analytical and computational advantages
other names of logistic function $\theta$
- soft threshold in contrast to the hard one in classification
- sigmoid its shape looks like flattened-out 's'

$1 - \theta(s) = 1 - \displaystyle \frac{1}{1 + e^{-s}} = \frac{e^{-s}}{1 + e^{-s}} = \color{indianred}{\theta(-s)}$

3.2 Linear Models

based on signal $s = \displaystyle \sum_{i=0}^{d} w_i x_i$

4. Example: Heart Attack

Prediction

input $\mathbf{x}$
- cholesterol level, age, weight, etc.,
signal $s = \mathbf{w}^T\mathbf{x}$
- risk score: 각 속성이 얼마나 중요한가/덜 중요한가?

Linear Classification

$h(\mathbf{x})$ returns $\pm 1$ : heart attack (+1) or not (-1) for sure

Linear Regression

$h(\mathbf{x})$ returns risk score s itself

Logisitc Regression

$h(\mathbf{x})$ returns $\theta(s)$ : probability of heart attack

4.1 Another Popular Sigmod Function

Hyperbolic Tangent

$\tanh(s) = \displaystyle\frac{e^s - e^{-s}}{e^s + e^{-s}}$
$\tanh(s)$ converges to a hard threshold for large $|s|$
converges to no threshold for small $|s|$

Linear, Logistic, Tanh, and Hard Threshold

$\mathbf{w}^T\mathbf{x} =$ signal이고, 여기에 함수 $f(s)=y$ 값을 만드는 과정을 activation이라고 부른다.

5. Learning Target and Error

5.1 Big Picture

$\mathbb{P}[y = +1 | \boldsymbol{x}] = f(\boldsymbol{x}) \sim h(\boldsymbol{x}) = \theta(\boldsymbol{w}^T \boldsymbol{x})$
$P(y|\boldsymbol{x}) = h(\boldsymbol{x})^{[y=+1]} (1 - h(\boldsymbol{x}))^{[y=-1]} = \theta(y \boldsymbol{w}^T \boldsymbol{x})$

1. Linear Signal

$\mathbf{x} \rightarrow \mathbf{w}^T\mathbf{x}$
파란색 직선 $\mathbf{w}^T\mathbf{x}$ 은 입력값 $\mathbf{x}$ 를 받아 signal 혹은 risk score 계산

2. Linear Transformation

$\mathbf{w}^T\mathbf{x} \rightarrow \theta(s)$
녹색 곡선 $\theta$ 은 직선으로 뻗어가는 신호 $s$ 를 받아 0과 1 사이의 매끄러운 확률값 $h(x)$ 로 변환
- 신호가 크고 양수일수록 확률은 1에 가까워지고, 크고 음수일수록 0에 가까워짐

Observed vs Fitted

Observed
- 실제 데이터 $y_n$ 는 +1, -1와 같이 이진 샘플로 주어짐
- 그래프에서는 확률이 0 또는 1인 극단적인 막대로 표시
Fitted
- 모델 $h(x)가 내놓는 결과
- 0이나 1로 단정 짓지 않고, 0과 1 사이의 연속된 수치로 나타남

Fitted 막대들이 Observed 막대들과 최대한 비슷해지도록 만드는 것이 학습의 목표이다.

5.2 Target

Learning Target

probability of event $y = +1$ given input $\mathbf{x}$
$f(\boldsymbol{x}) = \mathbb{P}[y = +1 | \boldsymbol{x}]$
e.g., probability of heart attack given patient characteristics

입력 $\mathbf{x}$ 이 주어졌을 때, 결과 $y$ 가 $+1$ 이 될 확률을 구하는 함수 $f(x)$ 를 배우는 것이 학습이다.

Training Data

do not give us the value of $f$ explicitly

딱딱한 샘플들 $y_n$ 만 보고, 부드러운 확률 분포 $f(x)$ 를 유추해내는 것이 학습의 핵심이다.

5.3 How to Measure Error?

fitting data $D$ means finding a good $h$
$h$ is good if
- $h(\mathbf{x}_n) = 1$ whenever $y_n = +1$
- $h(\mathbf{x}_n) = 0$ whenever $y_n = -1$

Simple

not very convenient (hard to minimize)

$\displaystyle E_{in}(h) = \frac{1}{N} \sum_{n=1}^{N} \left( h(\boldsymbol{x}_n) - \frac{1}{2}(1 + y_n) \right)^2$

Cross-Entropy

looks complicated and ugly, but

based on intuitive probabilistic interpretation
has nice property for gradient-based optimization
- although not easy as linear regression

$y_n = +1$ 일 때
- 출력이 클수록 좋음
- $\theta$ 는 1에 수렴, 오차는 0에 수렴

$y_n = -1$ 일 때는 반대로 보면 된다.

6. Defining Error Measure

6.1 Likelihood & Maximum Likelihood

Likelihood

standard error measure in logistic regression: likelihood
how likely is it to get output $y$ from input $\mathbf{x}$ , if target distribution $P(y|x)$ was indeed captured by $h(\mathbf{x})$ ?

$P(y|\boldsymbol{x}) = \begin{cases} h(\boldsymbol{x}) & \text{for } y = +1 \\ 1 - h(\boldsymbol{x}) & \text{for } y = -1 \end{cases}$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ = \theta(y \boldsymbol{w}^T \boldsymbol{x})$

$h(\boldsymbol{x}) = \theta(\boldsymbol{w}^T \boldsymbol{x})$ and $1 - \theta(s) = \theta(-s)$
this simplicity in: reason for defining $\theta(s)$ as $e^s / (1 + e^s)$

Maximum Likelihood

assume $(\mathbf{x}_1, y_1)$ ,... $(\mathbf{x}_N, y_N)$ are independently generated
- probability of getting all $y_n$ 's from corresponding $\mathbf{x}$ 's

$P(y_1|\boldsymbol{x}_1)P(y_2|\boldsymbol{x}_2) \dots P(y_N|\boldsymbol{x}_N) = \prod_{n=1}^{N} P(y_n|\boldsymbol{x}_n)$

the method of maximum likelihood
- select the hypothesis $h$ that maximizes this probability

모든 데이터의 확률을 높이는 것은 모두 올바르게 분류하도록 만드는 것을 의미한다.

equivalently minimize a more convenient quantity
- $\displaystyle - \frac{1}{N} ln (\cdot)$ is monotonically decreasing function

$\displaystyle -\frac{1}{N} \ln \left( \prod_{n=1}^{N} P(y_n|\boldsymbol{x}_n) \right) = \frac{1}{N} \sum_{n=1}^{N} \ln \frac{1}{P(y_n|\boldsymbol{x}_n)}$

substituting with $p(y| \mathbf{x}) = \theta(y \mathbf{w}^T \mathbf{x})$ , we would be minimizing (wrt $\mathbf{w}$ )

$\displaystyle \frac{1}{N} \sum_{n=1}^{N} \ln \frac{1}{\theta(y_n \mathbf{w}^T \mathbf{x}_n)} = \frac{1}{N} \sum_{n=1}^{N} \ln \begin{cases} \frac{1}{h(\mathbf{x}_n)} & \text{for } y_n = +1 \\ \frac{1}{1 - h(\mathbf{x}_n)} & \text{for } y_n = -1 \end{cases}$ (3)

$\displaystyle = \frac{1}{N} \sum_{n=1}^{N} \left\{ [y_n = +1] \ln \frac{1}{h(\mathbf{x}_n)} + [y_n = -1] \ln \frac{1}{1 - h(\mathbf{x}_n)} \right\}$ (4)

[]는 안의 조건이 맞으면 1, 틀리면 0을 곱한다.

정답이 $y_n = +1$ 일 때
- 왼쪽 항만 살아남고 오른쪽 항은 0을 곱해 사라짐
정답이 $y_n = -1$ 일 때
- 왼쪽 항이 사라지고 오른쪽 항만 계산에 반영

매 데이터 포인트마다 정답에 해당하는 오차 딱 하나마 선택해서 계산하는 switch 구조이다. 수학적 최적화를 위해 합쳐서 사용한다.

6.2 In-sample Error $E_{in}$

the fact that we are minimizing quantity (3)
- allow us to treat it as an error measure
substituting the functional form for $\theta(y_n \mathbf{w}^T \mathbf{x})$ produces
- in-sample error measure for logistic regression

$\displaystyle E_{in}(\mathbf{w}) = \frac{1}{N} \sum_{n=1}^{N} \ln(1 + e^{-y_n \mathbf{w}^T \mathbf{x}_n})$

implied pointwise error
- $\displaystyle e(h(\boldsymbol{x}_n), y_n) = \ln(1 + e^{-y_n \boldsymbol{w}^T \boldsymbol{x}_n})$

개별 오차들을 모두 더해서 데이터 개수로 나누면, 전체 평균 오차( $E_{in}$ )가 완성된다.

6.3 Optimization

$\max \quad \prod_{n=1}^{N} P(y_n | \mathbf{x}_n)$

$\Leftrightarrow \max \quad \ln \left( \prod_{n=1}^{N} P(y_n | \mathbf{x}_n) \right)$

$\equiv \max \quad \sum_{n=1}^{N} \ln P(y_n | \mathbf{x}_n)$

$\Leftrightarrow \min \quad -\frac{1}{N} \sum_{n=1}^{N} \ln P(y_n | \mathbf{x}_n)$

$\equiv \min \quad \frac{1}{N} \sum_{n=1}^{N} \ln \frac{1}{P(y_n | \mathbf{x}_n)}$

$\equiv \min \quad \frac{1}{N} \sum_{n=1}^{N} \ln \frac{1}{\theta(y_n \mathbf{w}^T \mathbf{x}_n)}$ - $h$ 대입

$\equiv \min \quad \frac{1}{N} \sum_{n=1}^{N} \ln (1 + e^{-y_n \mathbf{w}^T \mathbf{x}_n})$

Remarks

small when $y_n c^T \mathbf{x}_n$ is large and positive
- which would imply that $sign(\mathbf{w}^T \mathbf{x}) = y_n$
therefore, as out intuition expect
- the error measure encourages $\mathbf{w}$ to classify each $\mathbf{x}_n$ correctly
alternative derivation of $E_{in}(\mathbf{w})$ is possible
- based on the notion of cross entropy

6.4 Cross-Entropy

consider two pmfs ${p, 1-p}$ and ${q, 1- q}$ with binary outcomes
cross entropy for these two pmfs: defined by
- $\displaystyle p log \frac {1}{q} + (1-p) log \frac {1}{1-q}$
cross entropy measures the error for approximating
- observed(정답) pmf ${p, 1-p}$ by fitted(예측) pmf ${q, 1- q}$

정답과 예측 사이의 거리를 계산하여 두 확률 분포가 얼마나 다른지 측정한다.

Recall

$P(y|\mathbf{x}) = \begin{cases} h(\mathbf{x}) & \text{for } y = +1 \\ 1 - h(\mathbf{x}) & \text{for } y = -1 \end{cases}$

위 식은 시그모이드 함수 $\theta$ 의 대칭성을 이용하여 다음과 같이 간결하게 표현할 수 있다.

$P(y|\mathbf{x}) = \theta(y\mathbf{w}^T\mathbf{x})$

또, 조건부 지수 표현을 사용하여 하나의 식으로 합쳐서 쓸 수 있다.

$P(y|\mathbf{x}) = h(\mathbf{x})^{[y_n=+1]} (1 - h(\mathbf{x}))^{[y_n=-1]}$

Likelihood of data

전체 데이터셋에 대한 우도 함수는 각 데이터 포인트 확률의 곱으로 나타낸다.

$\displaystyle \prod_{n=1}^{N} P(y_n|\mathbf{x}_n) = \prod_{n=1}^{N} h(\mathbf{x}_n)^{[y_n=+1]} (1 - h(\mathbf{x}_n))^{[y_n=-1]}$

Negative Log-Likelihood

우도 함수에 로그를 취하고 음수를 붙여 최소화 문제로 변환한다. $N$ 으로 나누어 평균적인 손실을 구한다.

$\begin{aligned} NLL(\mathbf{w}) &\propto -\frac{1}{N} \log \{ \prod_{n=1}^{N} P(y_n|\mathbf{x}_n) \} \\ &= -\frac{1}{N} \log \{ \prod_{n=1}^{N} h(\mathbf{x}_n)^{[y_n=+1]} (1 - h(\mathbf{x}_n))^{[y_n=-1]} \} \\ &= \frac{1}{N} \sum_{n=1}^{N} \{ [y_n = +1] \log \frac{1}{h(\mathbf{x}_n)} + [y_n = -1] \log \frac{1}{1 - h(\mathbf{x}_n)} \} \end{aligned}$

위 식의 시그마 내부 항이 바로 Cross-entropy Loss이다.

Cross-entropy Loss = Log Loss

6.5 Bic Picture

Eunji

이전 포스트

5. Linear Model (3)

다음 포스트

5. Linear Model (4)

Data Mining

1. Linear Models

1.1 Core of Linear Models

1. Linear Regression

2. Linear Classification

2. Logistic Regression

2.1 Example: Heart Attack Prediction

2.2 Soft Binray Classification

Logistic Regression

Linear Classification vs. Logistic Regression

2.3 Example: Probability of Default

3. Logistic Regression Model

Linear Classification

Linear Regression

Logistic Regression Model

3.1 Logistic Function θ\thetaθ

Definition

Formula of θ(s)\theta(s)θ(s)

3.2 Linear Models

4. Example: Heart Attack

Prediction

Linear Classification

Linear Regression

Logisitc Regression

4.1 Another Popular Sigmod Function

Hyperbolic Tangent

Linear, Logistic, Tanh, and Hard Threshold

5. Learning Target and Error

5.1 Big Picture

1. Linear Signal

2. Linear Transformation

Observed vs Fitted

5.2 Target

Learning Target

Training Data

5.3 How to Measure Error?

Simple

Cross-Entropy

6. Defining Error Measure

6.1 Likelihood & Maximum Likelihood

Likelihood

Maximum Likelihood

6.2 In-sample Error EinE_{in}Ein​

6.3 Optimization

Remarks

6.4 Cross-Entropy

Recall

Negative Log-Likelihood

Cross-entropy Loss = Log Loss

6.5 Bic Picture

5. Linear Model (3)

6. Pattern Mining

0개의 댓글

3.1 Logistic Function $\theta$

Formula of $\theta(s)$

6.2 In-sample Error $E_{in}$