[ML&DL] 3. Classification

KBC·2024년 10월 7일
0

Classification

  • Qualitative variables take values in an unordered set CC, such as
    eye color{brown, blue, green}email{spam, ham}\text{eye color}\in\{\text{{brown, blue, green}}\}\\ \text{email}\in\{\text{spam, ham}\}
  • Given a feature vector XX and a qualitative response YY taking values in the set CC

    The Classification task is to build a function C(X)C(X) that takes as input the feature vector XX and predicts its value for YY

  • Often we are more interested in estimating the probabilities that XX belongs to each category in CC.

Can we use Linear Regression?

  • Suppose for the Default classification task that we code

    Y={0if   No1if   YesY = \begin{cases} 0 & \text{if \;\textcolor{red}{No}} \\ 1 & \text{if \;\textcolor{red}{Yes}} \end{cases}
  • Can we simply perform a linear regression of YY on XX and classify as Yes if Y^>0.5\hat Y > 0.5?

    • In this case of a binary outcome, linear regression does a good job as a classifier, and if equivalent to linear discriminant analysis which we discuss later
    • Since in the population E(YX=x)=Pr(Y=1X=x)E(Y|X=x)=Pr(Y=1|X=x), we might think that regression is perfect for this task
    • However, linear regression might produce probabilities less than zero or bigger than one

      Logistic regression is more appropriate

  • Now suppose that

    Y={1if   stroke2if   drug overdose3if   epileptic seizureY = \begin{cases} 1 & \text{if \;\textcolor{red}{stroke}} \\ 2 & \text{if \;\textcolor{red}{drug overdose}} \\ 3 & \text{if \;\textcolor{red}{epileptic seizure}} \end{cases}
  • Linear regression is not appropriate here

    Multiclass Logistic Regression or Discriminant Analysis are more appropriate

Logistic Regression

  • Let's write p(X)=Pr(Y=1X)p(X) = Pr(Y=1|X) for short and consider using balancebalance to predict defaultdefault. Logistic regression uses the form
    p(X)=eB0+B1X1+eB0+B1Xp(X) = \frac{e^{B_0+B_1X}}{1+e^{B_0+B_1X}}
  • e2.71828e\approx2.71828 is a mathematical constant [Euler's number]
  • p(X)p(X) will have values between 0 and 1
  • A bit of rearrangement gives : log odds or logit
    ln(p(X)1p(X))=β0+β1X\ln\left(\frac{p(X)}{1-p(X)}\right)=\beta_0+\beta_1X

Maximum Likelihood

l(β0,β)=i:yi=1p(xi)i:yi=0(1p(xi))l(\beta_0,\beta)=\prod_{i:y_i=1}p(x_i)\prod_{i:y_i=0}(1-p(x_i))
  • This likelihood gives the probability of the observed zeros and ones in the data
  • We pick β0\beta_0 and β1\beta_1 to maximize the likelihood of the observed data

Making Predictions

  • What is estimated probability of defaultdefault for someone with a balancebalance of $1000?
    p^(X)=eβ^0+β^1X1+eβ^0+β^1X=e10.6513+0.0055×1001+e10.6513+0.0055×100=0.006\hat p(X) = \frac{e^{\hat\beta_0+\hat\beta_1X}}{1 +e^{\hat\beta_0+\hat\beta_1X}} = \frac{e^{-10.6513+0.0055\times100}}{1 + e^{-10.6513+0.0055\times100}} = 0.006

Logistic Regression with several variables

ln(p(X)1p(X))=β0+β1X1++βpXpp(X)=eβ0+β1X1++βpXp1+eβ0+β1X1++βpXp\ln\left(\frac{p(X)}{1-p(X)}\right)=\beta_0+\beta_1X_1+\cdots+\beta_pX_p\\ p(X)=\frac{e^{\beta_0+\beta_1X_1+\cdots+\beta_pX_p}}{1+e^{\beta_0+\beta_1X_1+\cdots+\beta_pX_p}}

Logistic Regression with more than two classes

Pr(Y=kX)=eβ0k+β1kX1++βpkXpl=1Keβ0l+β1lX1++βplXp\text{Pr}(Y=k|X)=\frac{e^{\beta_{0k}+\beta_{1k}X_1+\cdots+\beta_{pk}X_p}}{\sum_{l=1}^Ke^{\beta_{0l}+\beta_{1l}X_1+\cdots+\beta_{pl}X_p}}
  • Multiclass logistic regression is also referred to as multinomial regression

Discriminant Analysis

  • Here the approach is to model the distribution of XX in each of the classes separately, and then use Bayes theorem to flip things around and obtain Pr(YX)\text{Pr}(Y|X)
  • When we use Normal or Gaussian distributions for each class, this leads to linear or quadratic discriminant analysis

Bayes theorem for classification

  • Bayes theorem

    Pr(Y=kX=x)=Pr(X=xY=k)Pr(Y=k)Pr(X=x)\text{Pr}(Y=k|X=x)=\frac{\text{Pr}(X=x|Y=k)\cdot\text{Pr}(Y=k)}{\text{Pr}(X=x)}
  • One writes this slightly differently for discriminant analysis:

    Pr(Y=kX=x)=πkfk(x)l=1Kπlfl(x),  where\text{Pr}(Y=k|X=x)=\frac{\pi_kf_k(x)}{\sum_{l=1}^K\pi_lf_l(x)},\;\text{where}
    • fk(x)=Pr(X=xY=k)f_k(x) = \text{Pr}(X=x|Y=k) is the density for XX in class kk
    • πk=Pr(Y=k)\pi_k=\text{Pr}(Y=k) is the marginal or prior probability for class kk
  • When the priors are different, we take them into account as well, and compare πkfk(x)\pi_kf_k(x).

  • On the right, we favor the blue class - the decision boundary has shifted to the left

Why discriminant analysis?

  • When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. But, Linear discriminant analysis does not suffer from this problem
  • If nn is small and the distribution of the predictors XX is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model
  • Linear discriminant analysis is popular when we have more than two response classes, because it also provides low-dimensional views of the data

Linear Discriminant Analysis when p = 1(Uncorrelated)

  • The Gaussian density has the form
    fk(x)=12πσke12(xμkσk)2f_k(x) = \frac{1}{\sqrt{2\pi}\sigma_k}e^{-\frac{1}{2}\left(\frac{x-\mu_k}{\sigma_k}\right)^2}
  • Here μk\mu_k is the mean, and σk2\sigma_k^2 the variance (in class kk)
  • We will assume that all the σk=σ\sigma_k=\sigma are the same
  • Plugging this into Bayes formula, we got a rather complex expression for pk(x)=Pr(Y=kX=x)p_k(x) = \text{Pr}(Y=k|X=x)
    pk(x)=πk12πσe12(xμkσ)2l=1Kπl12πσe12(xμlσ)2p_k(x) = \frac{\pi_k \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}\left(\frac{x - \mu_k}{\sigma}\right)^2}}{\sum_{l=1}^{K} \pi_l \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}\left(\frac{x - \mu_l}{\sigma}\right)^2}}

Discriminant functions

  • To classify at the value X=xX=x, we need to see which of the pk(x)p_k(x) is largest
  • Taking logs, and discarding terms that do not depend on kk, we see that this is equivalent to assigning xx to the class with the largest discriminant score:
    δk(x)=xμkσ2μk22σ2+log(πk)\delta_k(x) = x \cdot \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k)
  • Note that δk(x)\delta_k(x) is a linear function of xx
  • If there are K=2K=2 classes and π1=π2=0.5\pi_1=\pi_2=0.5, then one can see that the decision boundary is at
    x=μ1+μ22π^k=nknμ^k=1nki:yi=kxiσ^2=1nKk=1Ki:yi=k(xiμ^k)2=k=1Knk1nKσ^k2x=\frac{\mu_1+\mu_2}{2} \hat{\pi}_k = \frac{n_k}{n} \\ \hat{\mu}_k = \frac{1}{n_k} \sum_{i: y_i = k} x_i \\ \hat{\sigma}^2 = \frac{1}{n - K} \sum_{k=1}^{K} \sum_{i: y_i = k} (x_i - \hat{\mu}_k)^2 \\ = \sum_{k=1}^{K} \frac{n_k - 1}{n - K} \cdot \hat{\sigma}_k^2
  • where σ^k2=1nk1i:yi=k(xiμ^k)2\hat{\sigma}_k^2 = \frac{1}{n_k - 1} \sum_{i: y_i = k} (x_i - \hat{\mu}_k)^2 is the usual formula for the estimated variance in the kkth class

Linear Discriminant Anaysis when p > 1(Correlated)

  • Density: f(x)=1(2π)p/2Σ1/2e12(xμ)TΣ1(xμ)\text{Density: } f(x) = \frac{1}{(2\pi)^{p/2} |\Sigma|^{1/2}} e^{-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)}
  • Discriminant function: δk(x)=xTΣ1μk12μkTΣ1μk+logπk\text{Discriminant function: } \delta_k(x) = \textcolor{red}{x^T \Sigma^{-1} \mu_k} - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log \pi_k
    (Red : linear in XX, other : constant)
  • Despite its complex form:
    δk(x)=ck0+ck1x1+ck2x2++ckpxp — a linear function.\delta_k(x) = c_{k0} + c_{k1}x_1 + c_{k2}x_2 + \cdots + c_{kp}x_p \text{ — a linear function.}

Fisher's Discriminant Plot

  • When there are KK classes, linear discriminant analysis can be viewed exactly in a K1K-1 dimensional plot

    Because it essentially classifies to the closest centroid and they span a K-1 dimensional plane

  • Even when K>3K > 3, we can find the best 2-dimensional plance for visualizing the discriminant rule

From delta function to probabilities

  • One we have estimates δ^k(x)\hat\delta_k(x), we can turn these into estimates for class probabilities:
    Pr^(Y=kX=x)=eδ^k(x)l=1Keδ^l(x)\hat{\text{Pr}}(Y=k| X=x)=\frac{e^{\hat\delta_k(x)}}{\sum_{l=1}^Ke^{\hat\delta_l}(x)}
  • So classifying to the largest δ^k(x)\hat\delta_k(x) amounts to classifying to the class for which Pr^(Y=kX=x)\hat\text{Pr}(Y=k|X=x) is largest
  • When K=2K=2, we classify to class 22 if Pr^(Y=2X=x)0.5\hat\text{Pr}(Y=2|X=x) \geq 0.5, else to class 1

Confusion Matrix

  • misclassification rate : (23+252)/10000=2.75%(23 + 252) / 10000 = 2.75\%
  • precision : 81/10481/104
  • recall : 252/333252/333
  • False positive rate : The fraction of negative examples that are classified as positive - 0.2%0.2\% in example
  • False negative rate : The fraction of positive examples that are classified as negative - 75.7%75.7\% in example
  • We produced this table by classifying to class YesYes if
    Pr^(Default=YesBalance,Student)0.5\hat\text{Pr}(Default=Yes|Balance,Student)\geq0.5
    • Lower Threshold : Higher False Positive and Lower False Negative
    • Higher Threshold : Lower False Positive and Higher False Negative

      With controlling the threshold we can get ROC Curve
    • If the classifier has good performance the AUC value goes near 1

Other forms of Discriminant Analysis

Pr(Y=kX=x)=πkfk(x)l=1Kπlfl(x)\text{Pr}(Y=k|X=x)=\frac{\pi_kf_k(x)}{\sum_{l=1}^K\pi_lf_l(x)}
  • With Gaussians but different k\sum_k in each class, we get quadratic discriminant analysis
  • With Gaussians and same k\sum_k in each class, we get linear discriminant analysis
  • With fk(x)=j=1pfjk(xj)f_k(x)=\prod_{j=1}^pf_{jk}(x_j) (conditional independence model) in each class we get naive Bayes. For Gaussian this means the k\sum_k are diagonal
    • k\sum_k : Covariance Matrix
      δk(x)=12(xμk)TΣk1(xμk)+logπk12logΣk\delta_k(x) = -\frac{1}{2}(x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k) + \log \pi_k - \frac{1}{2} \log |\Sigma_k|

Naive Bayes

  • Assume features are independent in each class
  • Useful when pp is large, and so multivariate methods like QDA and even LDA break down
  • Gaussian naive Bayes assumes each k\sum_k is diagonal:
    δk(x)log[πkj=1pfkj(xj)]=12j=1p[(xjμkj)2σkj2+logσkj2]+logπk\delta_k(x) \propto \log \left[ \pi_k \prod_{j=1}^{p} f_{kj}(x_j) \right] \\= -\frac{1}{2} \sum_{j=1}^{p} \left[ \frac{(x_j - \mu_{kj})^2}{\sigma_{kj}^2} + \log \sigma_{kj}^2 \right] + \log \pi_k
  • For mixed feature vectors(qualitative and quantitative)

    If XjX_j is qualitative, replace fkj(xj)f_{kj}(x_j) with probability mass function (histogram) over discrete categories

Logistic Regression versus LDA

  • For a two-class problem, one can show that for LDA
    log(p1(x)1p1(x))=log(p1(x)p2(x))=c0+c1x1++cpxp\log \left( \frac{p_1(x)}{1 - p_1(x)} \right) = \log \left( \frac{p_1(x)}{p_2(x)} \right) = c_0 + c_1 x_1 + \cdots + c_p x_p
  • So it has same form as logistic regression
  • The difference is in how the parameters are estimated
    • Logistic Regression uses the conditional likelihood based on Pr(YX)\text{Pr}(Y|X) (discriminative learning)
    • LDA uses the full likelihood based on Pr(X,Y)\text{Pr}(X,Y) (generative learning)
  • Logistic Regression can also fit quadratic boundaries like QDA, by explicitly including quadratic terms in the model

Multinomial Logistic Regression

  • The simplest representation uses different linear functions for each class, combined with the softmax function to form probabilities
    Pr(Y=kX=x)=eβk0+βk1x1++βkpxpl=1Keβl0+βl1x1++βlpxp.\Pr(Y = k \mid X = x) = \frac{e^{\beta_{k0} + \beta_{k1}x_1 + \cdots + \beta_{kp}x_p}}{\sum_{l=1}^{K} e^{\beta_{l0} + \beta_{l1}x_1 + \cdots + \beta_{lp}x_p}}.
  • There is a rebundancy here; we really only need K1K-1 functions
  • We fit by maximizing the multinomial log likelihood (cross-entropy) - a generalization of the binomial

Generative Models and Naive Bayes

  • Logistic regression models Pr(Y=kX=x)\text{Pr}(Y=k|X=x) directly, via the logistic function

  • Similarly the multinomial logistic regression uses the softmax function

  • These all model the conditional distribution of YY given XX

  • By contrast generative models start with the conditional distribution of XX given YY, and then use Bayes formula to turn things around:

    Pr(Y=kX=x)=πkfk(x)l=1Kπlfl(x).\Pr(Y = k \mid X = x) = \frac{\pi_k f_k(x)}{\sum_{l=1}^{K} \pi_l f_l(x)}.
  • fk(x)f_k(x) is the density of XX given Y=kY=k

  • πk=Pr(Y=k)\pi_k=\text{Pr}(Y=k) is the marginal probability that YY is in class kk

  • Linear and Quadratic discriminant analysis derive from generative models, where fk(x)f_k(x) are Gaussian

  • Often useful if some classes are well seperated - a situation where logistic regression is unstable

  • Naive Bayes assumes that the densities fk(x)f_k(x) in each class

    Factor

    fk(x)=fk1(x1)×fk2(x2)××fkp(xp)f_k(x) = f_{k1}(x_1) \times f_{k2}(x_2) \times \cdots \times f_{kp}(x_p)
  • Equivalently this assumes that the features are independent within each class

  • Then using Bayes formula:

    Pr(Y=kX=x)=πk×fk1(x1)×fk2(x2)××fkp(xp)l=1Kπl×fl1(x1)×fl2(x2)××flp(xp)\Pr(Y = k \mid X = x) = \frac{\pi_k \times f_{k1}(x_1) \times f_{k2}(x_2) \times \cdots \times f_{kp}(x_p)}{\sum_{l=1}^{K} \pi_l \times f_{l1}(x_1) \times f_{l2}(x_2) \times \cdots \times f_{lp}(x_p)}
  • fk1(x1)×fk2(x2)××fkp(xp)f_{k1}(x_1) \times f_{k2}(x_2) \times \cdots \times f_{kp}(x_p) : Probability that the value could be found by each distribution

Why independence assumption for naive bayes?

  • Difficult to specify and model high-dimensional densities. Much easier to specify one-dimensional densities
  • Can handle mixed features
    • If feature jj is quantitative, can model as univariate Gaussian. We estimate μjk\mu_{jk} and σjk2\sigma_{jk}^2 from the data, and then plug into Gaussian density formula for fjk(xj)f_{jk}(x_j)
    • Alternatively, can use a histogram estimate of the density, and directly estimate fjk(xj)f_{jk}(x_j) by the proportion of observations in the bin into which xjx_j falls
    • If feature jj is qualitative, can simply model the proportion in each category

      Pr(Y=1X=x)=0.944 and Pr(Y=2X=x)=0.056\text{Pr}(Y=1|X=x^*)=0.944 \text{ and Pr}(Y=2|X=x^*)=0.056

Naive Bayes and GAMs

log(Pr(Y=kX=x)Pr(Y=KX=x))=log(πkfk(x)πKfK(x))=log(πkj=1pfkj(xj)πKj=1pfKj(xj))=log(πkπK)+j=1plog(fkj(xj)fKj(xj))=ak+j=1pgkj(xj),where ak=log(πkπK) and gkj(xj)=log(fkj(xj)fKj(xj))\log \left( \frac{\Pr(Y = k \mid X = x)}{\Pr(Y = K \mid X = x)} \right) = \log \left( \frac{\pi_k f_k(x)}{\pi_K f_K(x)} \right) = \log \left( \frac{\pi_k \prod_{j=1}^{p} f_{kj}(x_j)}{\pi_K \prod_{j=1}^{p} f_{Kj}(x_j)} \right) \\ = \log \left( \frac{\pi_k}{\pi_K} \right) + \sum_{j=1}^{p} \log \left( \frac{f_{kj}(x_j)}{f_{Kj}(x_j)} \right) = a_k + \sum_{j=1}^{p} g_{kj}(x_j), \\ \text{where } a_k = \log \left( \frac{\pi_k}{\pi_K} \right) \text{ and } g_{kj}(x_j) = \log \left( \frac{f_{kj}(x_j)}{f_{Kj}(x_j)} \right)
  • Hence, the Naive Bayes model takes the form of a generalized additive model from later Chapter

Generalized Linear Models

  • Generalized linear models provide a unified framework for dealing with many different response types(non-negative responses, skewed distributions, and more)
  • In left plot we see that the variance mostly increases with the mean
  • Taking log(bikersbikers) alleviates this, but has its own problems: e.g. predictions are on the wrong scale, and some counts are zero

Poisson Regression Model

  • Poisson Distribution is useful for modeling counts:
    Pr(Y=k)=eλλkk! for k = 0,1,2\text{Pr}(Y=k)=\frac{e^{-\lambda}\lambda^k}{k!} \text{ for k = }0,1,2\dots
  • λ=E(Y)=Var(Y)\lambda=E(Y)=\text{Var}(Y) - there is a mean/variance dependence
  • With covariates, we model
    log(λ(X1,,Xp))=β0+β1X1++βpXp\log(\lambda(X_1, \ldots, X_p)) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p
    or equivalently
    or equivalentlyλ(X1,,Xp)=eβ0+β1X1++βpXp.\text{or equivalently} \quad \lambda(X_1, \ldots, X_p) = e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}.
  • Model automatically gurantees that the predictions are non-negative

Three GLMs

  • We have covered three GLMs : Gaussian, Binomial and Poisson
  • They each have a characteristic link function. This it the transformation of the mean that is represented by a linear model
    η(E(YX1,X2,,Xp))=β0+β1X1++βpXp\eta(\mathbb{E}(Y \mid X_1, X_2, \ldots, X_p)) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p
    • linear : η(μ)=μ\eta(\mu)=\mu
    • logistic : η(μ=log(μ(1μ))\eta(\mu=\log(\mu(1-\mu))
    • Poisson regression : η(μ)=log(μ)\eta(\mu)=\log(\mu)
  • They also each have characteristic variance functions
  • The modls are fit by maximum-likelihood

All Contents written based on GIST - Machine Learning & Deep Learning Lesson(Instructor : Prof. sun-dong. Kim)

profile
AI, Security

0개의 댓글