Lecture 04. Perceptron & Generalized Linear Model

cryptnomy·2022년 11월 22일
0

CS229: Machine Learning

목록 보기
4/18
post-thumbnail

Lecture video link: https://youtu.be/iZTeva0WSTQ

Topics for today

  • Perceptron
  • Exponential Family
  • Generalized Linear Models
  • Softmax Regression (Multiclass Classification)

Perceptron learning algorithm

θjθj+α(y(i)hθ(x(i)))xj(i)\theta_j \leftarrow \theta_j + \alpha(y^{(i)} - h_\theta(x^{(i)}))x^{(i)}_j

where hθ(x)=g(θTx)h_\theta(x)=g(\theta^T x) and g(z)={1z00z<0g(z)=\begin{cases} 1 & z\geq 0 \\ 0 & z \lt 0 \end{cases}.

Exponential family

whose pdf is written as

p(y;η)=b(y)exp(ηTT(y)a(η))p(y; \eta) = b(y)\exp{\left(\eta^T T(y) - a(\eta)\right)}

where

yy: Data

η\eta: Natural parameter

T(y)T(y): Sufficient statistic

b(y)b(y): Base measure

a(η)a(\eta): Log-partition

Bernoulli

ϕ=\phi = probability of event

p(y;ϕ)=ϕy(1ϕ)1y=exp(log(ϕy(1ϕ)1y))=1exp(log(ϕ1ϕ)y+log(1ϕ))\begin{aligned} p(y;\phi) &= \phi^y (1-\phi)^{1-y} \\ &= \exp{\left(\log{\left(\phi^y (1-\phi)^{1-y}\right)}\right)} \\ &= 1 \cdot \exp{\left(\log{\left(\frac{\phi}{1-\phi}\right)y+\log{(1-\phi)}}\right)} \end{aligned}

where

b(y)=1T(y)=yη=log(ϕ1ϕ)ϕ=11+eηa(η)=log(1ϕ)log(111+eη)=log(1+eη).\begin{aligned} b(y) &= 1 \\ T(y) &= y \\ \eta &= \log{\left(\frac{\phi}{1-\phi}\right)} \Rightarrow \phi = \frac{1}{1+e^{-\eta}} \\ a(\eta) &= -\log{(1-\phi)} \Rightarrow -\log{\left(1-\frac{1}{1+e^{-\eta}}\right)}=\log{(1+e^\eta)}. \end{aligned}

Gaussian (with fixed variance)

Assume σ2=1\sigma^2=1.

p(y;μ)=12πexp((yμ)22)=12πexp(y22)exp(μy12μ2)\begin{aligned} p(y;\mu) &= \frac{1}{\sqrt{2\pi}}\exp{\left(-\frac{(y-\mu)^2}{2}\right)} \\ &= \frac{1}{\sqrt{2\pi}} \exp{\left(-\frac{y^2}{2}\right)}\exp{\left(\mu y-\frac{1}{2}\mu^2\right)}\end{aligned}

where

b(y)=12πexp(y22)T(y)=yη=μa(η)=μ22=η22.\begin{aligned} b(y) &= \frac{1}{\sqrt{2\pi}}\exp{\left(-\frac{y^2}{2}\right)} \\ T(y) &= y \\ \eta &= \mu \\ a(\eta) &= \frac{\mu^2}{2} = \frac{\eta^2}{2}. \end{aligned}

Properties

  1. MLE w.r.t. η\eta → concave

    NLL is convex

  2. E[y;η]=ηa(η)\mathbb{E}[y;\eta] = \frac{\partial}{\partial\eta}a(\eta)

  3. V[y;η]=2η2a(η)\mathbb{V}[y;\eta] = \frac{\partial^2}{\partial\eta^2}a(\eta)

cf. Probability distributions

Real - Gaussian

Binary - Bernoulli

Count - Poisson

R2R^2 - Gamma, Exponential

Distribution - Beta, Dirichlet … Bayesian

GLM (Generalized linear models)

Assumptions / Design choices

  1. yx;θExponential Family(η)y|x;\theta\sim\text{Exponential Family}(\eta)

  2. η=θTxθRn,xRn\eta=\theta^T x \hspace{1cm} \theta\in\mathbb{R^n}, x\in\mathbb{R^n}

  3. Test time: output E[yx;θ]\mathbb{E}[y|x;\theta]

    hθ(x)=E[yx;θ]\Rightarrow h_\theta(x) = \mathbb{E}[y|x;\theta]

Train time

Test time

maxθ logp(y(i);θTx(i))\max\limits_\theta\ \log{p(y^{(i)}; \theta^T x^{(i)})}

E[y;η]=E[yx;θ]=hθ(x)\mathbb{E}[y;\eta]=\mathbb{E}[y|x;\theta]=h_\theta(x)

GLM Training

Learning update rule

θjθj+α(y(i)hθ(x(i)))xj(i)\theta_j \leftarrow \theta_j + \alpha\left(y^{(i)}-h_\theta(x^{(i)})\right)x_j^{(i)}

Terminology

η\eta: natural parameter

μ=E[y;η]=g(η)\mu=\mathbb{E}[y;\eta]=g(\eta): canonical response function

η=g1(μ)\eta = g^{-1}(\mu): canonical link function

g(η)=ηa(η)g(\eta)=\frac{\partial}{\partial \eta}a(\eta)

3 parameterizations

Model param.Natural param.Canonical param.
θ\thetaη\etaϕ\phi\simBernoulli
\uparrow Learng-g\rightarrowμ,σ2\mu,\sigma^2\simGaussian
Design choice θTx-\theta^Tx\rightarrowg1\leftarrow g^{-1}-λ\lambda\simPoisson

Logistic regression

hθ(x)=E[yx;θ]=ϕ=11+eη=11+eθTxh_\theta (x) = \mathbb{E}[y|x;\theta] = \phi = \frac{1}{1+e^{-\eta}} = \frac{1}{1+e^{-\theta^T x}}.

Softmax regression (cross entropy)

KK - # classes

x(i)Rnx^{(i)}\in\mathbb{R}^n

yy - one-hot vector with length KK

Learn → Predict → Compare

How to minimize the distance between two distributions: p^(y)\hat{p}(y) and p(y)p(y)?

→ Minimize the cross entropy between two distributions

CrossEnt(p,p^)=yclassesp(y)logp^(y)=logp^(ytarget)=logeθtargetTxcclasseseθcTx.\begin{aligned} \text{CrossEnt}(p, \hat{p}) &= -\sum_{y\in{\text{classes}}}p(y)\log{\hat{p}(y)} \\ &= -\log{\hat{p}(y_{\text{target}})} \\ &=-\log{\frac{e^{\theta_{\text{target}}^T x}}{\sum_{c\in \text{classes}}}e^{\theta_c^T x}}. \end{aligned}

You treat the above as the loss and do gradient descent.

0개의 댓글