Lecture 03. Locally Weighted & Logistic Regression

cryptnomy·2022년 11월 22일
0

CS229: Machine Learning

목록 보기
3/18
post-thumbnail

Lecture video link: https://youtu.be/het9HFqo1TQ

Outline

  • Linear Regression (recap)
  • Locally Weighted Regression
  • Probabilistic Interpretation
  • Logistic Regression
  • Newton’s Method

Recap

(x(i),y(i)):  ith    example(x^{(i)}, y^{(i)}):\;i^{th}\;\;\textnormal{example}

x(i)Rn+1,    y(i)R,    x0=1x^{(i)} \in \mathbb{R}^{n+1}, \;\;y^{(i)}\in\mathbb{R},\;\; x_0=1

m:  #  examplesm:\;\textnormal{\#\;examples}

n:  #    featuresn:\;\textnormal{\#\;\;features}

hθ(x)=j=0nθjxj=θTxh_\theta(x) = \sum\limits_{j=0}^{n}\theta_j x_j = \theta^T x

J(θ)=12i=1m(hθ(x(i))y(i))2J(\theta) = \frac{1}{2} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2

Locally weighted regression

“Parametric” learning algorithm:

Fit fixed set of parameters (θi)(\theta_i) to data.

“Non-parametric” learning algorithm:

Amount of data/parameters you need to keep grows (linearly) with size of data.

To evaluate hh at certain xx:

  • LR

Fit θ\theta to minimize 12i=1m(hθ(x(i))y(i))2\frac{1}{2} \sum\limits_{i=1}^{m} \left(h_\theta\left(x^{(i)}\right) - y^{(i)}\right)^2.

Return θTx\theta^T x.

  • Locally weighted regression

Fit θ\theta to minimize i=1mw(i)(y(i)θTx(i))2\sum\limits_{i=1}^m w^{(i)}\left(y^{(i)} - \theta^T x^{(i)}\right)^2

where w(i)w^{(i)} is a “weighting” function, w(i)=exp((x(i)x)22τ2)w^{(i)}=\exp{\left(-\frac{(x^{(i)}-x)^2}{2\tau^2}\right)}.

τ\tau: bandwidth

If x(i)x|x^{(i)}-x| is small, w(i)1w^{(i)}\approx1.

If x(i)x|x^{(i)}-x| is large, w(i)0w^{(i)}\approx0.

Why least squares?

Assume y(i)=θTx(i)+ϵ(i)y^{(i)}=\theta^T x^{(i)} + \epsilon^{(i)}

where ϵ(i)\epsilon^{(i)} captures unmodeled effects (ex. one day seller is in an unusually good mood or bad mood so that the price goes higher or lower, etc.) and random noise.

Assume ϵ(i)\epsilon^{(i)} have a Gaussian distribution, i.e., ϵ(i)N(0,σ2)\epsilon^{(i)}\sim\mathcal{N}(0, \sigma^2) (which signifies p(ϵ(i))=12πσexp((ϵ(i))22σ2)p(\epsilon^{(i)}) = \frac{1}{\sqrt{2\pi}\sigma}\exp{\left(-\frac{(\epsilon^{(i)})^2}{2\sigma^2}\right)}.)

(Why Gaussian? \because Central limit theorem. In many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed.)

(Source: https://en.wikipedia.org/wiki/Central_limit_theorem)

Then we have i.i.d. (independent and identically distributed) noise around θTx\theta^T x, which means

p(y(i)x(i);θ)=12πθexp((y(i)θTx(i))22σ2)i.e.,    y(i)    x(i);θN(θTx(i),σ2)p(y^{(i)}|x^{(i)};\theta)=\frac{1}{\sqrt{2\pi}\theta}\exp{\left(-\frac{\left(y^{(i)}-\theta^T x^{(i)}\right)^2}{2\sigma^2}\right)} \\ \text{i.e.,}\;\;y^{(i)}\;|\;x^{(i)};\theta\sim\mathcal{N}(\theta^T x^{(i)},\sigma^2)

where ;; symbol is read as “parameterized by”.

(cf. Since θ\theta is not a random variable, it needs to be distinguished from x(i)x^{(i)}.)

Likelihood of θ\theta

L(θ)=p(yx;θ)=i=1mp(y(i)x(i);θ)=i=1m12πσexp((y(i)θTx(i))22σ2).\begin{aligned} L(\theta) &= p(y|x;\theta) \\ &= \prod_{i=1}^m p(y^{(i)}|x^{(i)};\theta) \\ &= \prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma}\exp{\left(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\right)}. \end{aligned}

Probability vs. Likelihood

View p(yx;θ)p(y|x;\theta) as …

… a function of the parameters θ\theta holding the data (x,y)(x, y) fixed → Likelihood.

… a function of data (x,y)(x, y) while parameters θ\theta as fixed → Probability.

Log likelihood

l(θ)=logL(θ)=logi=1m12πσexp((y(i)θTx(i))22σ2)=mlog12πσ+i=1m((y(i)θTx(i))22σ2).\begin{aligned} l(\theta) &= \log{L(\theta)} \\ &= \log{\prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma}\exp{\left(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\right)}} \\ &= m\log{\frac{1}{\sqrt{2\pi}\sigma} + \sum_{i=1}^m \left(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\right)}. \end{aligned}

MLE (Maximum Likelihood Estimation)

Choose θ\theta to maximize L(θ)L(\theta).

I.e., choose θ\theta to minimize 12i=1m(y(i)θTx(i))2=J(θ)\frac{1}{2}\sum\limits_{i=1}^m (y^{(i)}-\theta^T x^{(i)})^2 = J(\theta).

\therefore Choose the value of θ\theta to minimize the least squares errors

= Find the maximum likelihood estimate for the parameters θ\theta under this set of assumptions we made that the error terms are Gaussian and i.i.d.

Classification

y{0,1}y\in\{0,1\} (binary classification).

Logistic regression

Want hθ(x)[0,1]h_\theta (x) \in [0,1].

hθ(x)=g(θTx)=11+eθTxh_\theta (x) = g(\theta^T x) = \frac{1}{1+e^{\theta^T x}}.

g(z)=11+ezg(z) = \frac{1}{1+e^{-z}}: “sigmoid” or “logistic” function.

p(y=1x;θ)=hθ(x)p(y=0x;θ)=1hθ(x)p(y=1|x;\theta) = h_\theta (x) \\ p(y=0|x;\theta) = 1 - h_\theta (x)

p(yx;θ)=h(x)y(1h(x))1y\Rightarrow p(y|x;\theta) = h(x)^y (1-h(x))^{1-y}.

Then for log likelihood we obtain

l(θ)=logL(θ)=i=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i))).\begin{aligned} l(\theta) &= \log{L\left(\theta\right)} \\ &= \sum_{i=1}^m y^{(i)}\log{h_\theta (x^{(i)})} + \left(1-y^{(i)}\right)\log{\left(1-h_\theta (x^{(i)})\right)}. \end{aligned}

Choose θ\theta to maximize l(θ)l(\theta).

  • Batch gradient ascent:
θjθj+αθjl(θ)    orθjθj+αi=1m(y(i)hθ(x(i)))xj(i).\theta_j \leftarrow \theta_j + \alpha\frac{\partial}{\partial\theta_j}l(\theta) \;\; \textnormal{or}\\ \theta_j \leftarrow \theta_j + \alpha\sum_{i=1}^m (y^{(i)} - h_\theta (x^{(i)}))x_j^{(i)}.

Newton’s method

Have ff.

Want to find θ\theta s.t. f(θ)=0f(\theta)=0.

→ Want to maximize l(θ)l(\theta) i.e. want l(θ)=0l'(\theta)=0.

Start from θ(0)\theta^{(0)}.

θ(1)θ(0)Δ=θ(0)f(θ(0))f(θ(0))(f(θ(0))=f(θ(0))ΔΔ=f(θ(0))f(θ(0)))θ(t+1)θ(t)f(θ(t))f(θ(t)).\theta^{(1)} \leftarrow \theta^{(0)} - \Delta = \theta^{(0)} - \frac{f(\theta^{(0)})}{f'(\theta^{(0)})} \\\left(f'\left(\theta^{(0)}\right) = \frac{f(\theta^{(0)})}{\Delta} \Rightarrow \Delta = \frac{f(\theta^{(0)})}{f'(\theta^{(0)})} \right) \\ \cdots \\ \theta^{(t+1)} \leftarrow \theta^{(t)} - \frac{f(\theta^{(t)})}{f'(\theta^{(t)})}.

Letting f(θ)=l(θ)f(\theta)=l'(\theta) we obtain

θ(t+1)θ(t)l(θ(t))l(θ(t)).\theta^{(t+1)} \leftarrow \theta^{(t)} - \frac{l'(\theta^{(t)})}{l''(\theta^{(t)})}.

Quadratic convergence

0.01 error → 0.0001 error → 0.00000001 error → …

→ Newton’s method requires few iterations.

When θ\theta is a vector (θRn+1)(\theta \in \mathbb{R}^{n+1}):

θ(t+1)θ(t)+H1θl\theta^{(t+1)} \leftarrow \theta^{(t)} + H^{-1}\nabla_\theta l

where HR(n+1)×(n+1)H\in\mathbb{R^{(n+1)\times(n+1)}} is the Hessian matrix which satisfies

Hij=2lθiθj.H_{ij} = \frac{\partial^2 l}{\partial \theta_i \partial \theta_j}.

0개의 댓글