Lecture 03. Locally Weighted & Logistic Regression

cryptnomy·2022년 11월 22일

ML cs229 machine learning

CS229: Machine Learning

목록 보기

3/18

Lecture video link: https://youtu.be/het9HFqo1TQ

Outline

Linear Regression (recap)
Locally Weighted Regression
Probabilistic Interpretation
Logistic Regression
Newton’s Method

Recap

$(x^{(i)}, y^{(i)}):\;i^{th}\;\;\textnormal{example}$

$x^{(i)} \in \mathbb{R}^{n+1}, \;\;y^{(i)}\in\mathbb{R},\;\; x_0=1$

$m:\;\textnormal{\#\;examples}$

$n:\;\textnormal{\#\;\;features}$

$h_\theta(x) = \sum\limits_{j=0}^{n}\theta_j x_j = \theta^T x$

$J(\theta) = \frac{1}{2} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$

Locally weighted regression

“Parametric” learning algorithm:

Fit fixed set of parameters $(\theta_i)$ to data.

“Non-parametric” learning algorithm:

Amount of data/parameters you need to keep grows (linearly) with size of data.

To evaluate $h$ at certain $x$ :

Fit $\theta$ to minimize $\frac{1}{2} \sum\limits_{i=1}^{m} \left(h_\theta\left(x^{(i)}\right) - y^{(i)}\right)^2$ .

Return $\theta^T x$ .

Locally weighted regression

Fit $\theta$ to minimize $\sum\limits_{i=1}^m w^{(i)}\left(y^{(i)} - \theta^T x^{(i)}\right)^2$

where $w^{(i)}$ is a “weighting” function, $w^{(i)}=\exp{\left(-\frac{(x^{(i)}-x)^2}{2\tau^2}\right)}$ .

$\tau$ : bandwidth

If $|x^{(i)}-x|$ is small, $w^{(i)}\approx1$ .

If $|x^{(i)}-x|$ is large, $w^{(i)}\approx0$ .

Why least squares?

Assume $y^{(i)}=\theta^T x^{(i)} + \epsilon^{(i)}$

where $\epsilon^{(i)}$ captures unmodeled effects (ex. one day seller is in an unusually good mood or bad mood so that the price goes higher or lower, etc.) and random noise.

Assume $\epsilon^{(i)}$ have a Gaussian distribution, i.e., $\epsilon^{(i)}\sim\mathcal{N}(0, \sigma^2)$ (which signifies $p(\epsilon^{(i)}) = \frac{1}{\sqrt{2\pi}\sigma}\exp{\left(-\frac{(\epsilon^{(i)})^2}{2\sigma^2}\right)}$ .)

(Why Gaussian? $\because$ Central limit theorem. In many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed.)

(Source: https://en.wikipedia.org/wiki/Central_limit_theorem)

Then we have i.i.d. (independent and identically distributed) noise around $\theta^T x$ , which means

p(y^{(i)}|x^{(i)};\theta)=\frac{1}{\sqrt{2\pi}\theta}\exp{\left(-\frac{\left(y^{(i)}-\theta^T x^{(i)}\right)^2}{2\sigma^2}\right)} \\ \text{i.e.,}\;\;y^{(i)}\;|\;x^{(i)};\theta\sim\mathcal{N}(\theta^T x^{(i)},\sigma^2)

where $;$ symbol is read as “parameterized by”.

(cf. Since $\theta$ is not a random variable, it needs to be distinguished from $x^{(i)}$ .)

Likelihood of $\theta$

\begin{aligned} L(\theta) &= p(y|x;\theta) \\ &= \prod_{i=1}^m p(y^{(i)}|x^{(i)};\theta) \\ &= \prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma}\exp{\left(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\right)}. \end{aligned}

Probability vs. Likelihood

View $p(y|x;\theta)$ as …

… a function of the parameters $\theta$ holding the data $(x, y)$ fixed → Likelihood.

… a function of data $(x, y)$ while parameters $\theta$ as fixed → Probability.

Log likelihood

\begin{aligned} l(\theta) &= \log{L(\theta)} \\ &= \log{\prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma}\exp{\left(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\right)}} \\ &= m\log{\frac{1}{\sqrt{2\pi}\sigma} + \sum_{i=1}^m \left(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\right)}. \end{aligned}

MLE (Maximum Likelihood Estimation)

Choose $\theta$ to maximize $L(\theta)$ .

I.e., choose $\theta$ to minimize $\frac{1}{2}\sum\limits_{i=1}^m (y^{(i)}-\theta^T x^{(i)})^2 = J(\theta)$ .

$\therefore$ Choose the value of $\theta$ to minimize the least squares errors

= Find the maximum likelihood estimate for the parameters $\theta$ under this set of assumptions we made that the error terms are Gaussian and i.i.d.

Classification

$y\in\{0,1\}$ (binary classification).

Logistic regression

Want $h_\theta (x) \in [0,1]$ .

$h_\theta (x) = g(\theta^T x) = \frac{1}{1+e^{\theta^T x}}$ .

$g(z) = \frac{1}{1+e^{-z}}$ : “sigmoid” or “logistic” function.

$p(y=1|x;\theta) = h_\theta (x) \\ p(y=0|x;\theta) = 1 - h_\theta (x)$

$\Rightarrow p(y|x;\theta) = h(x)^y (1-h(x))^{1-y}$ .

Then for log likelihood we obtain

\begin{aligned} l(\theta) &= \log{L\left(\theta\right)} \\ &= \sum_{i=1}^m y^{(i)}\log{h_\theta (x^{(i)})} + \left(1-y^{(i)}\right)\log{\left(1-h_\theta (x^{(i)})\right)}. \end{aligned}

Choose $\theta$ to maximize $l(\theta)$ .

Batch gradient ascent:

\theta_j \leftarrow \theta_j + \alpha\frac{\partial}{\partial\theta_j}l(\theta) \;\; \textnormal{or}\\ \theta_j \leftarrow \theta_j + \alpha\sum_{i=1}^m (y^{(i)} - h_\theta (x^{(i)}))x_j^{(i)}.

Newton’s method

Have $f$ .

Want to find $\theta$ s.t. $f(\theta)=0$ .

→ Want to maximize $l(\theta)$ i.e. want $l'(\theta)=0$ .

Start from $\theta^{(0)}$ .

\theta^{(1)} \leftarrow \theta^{(0)} - \Delta = \theta^{(0)} - \frac{f(\theta^{(0)})}{f'(\theta^{(0)})} \\\left(f'\left(\theta^{(0)}\right) = \frac{f(\theta^{(0)})}{\Delta} \Rightarrow \Delta = \frac{f(\theta^{(0)})}{f'(\theta^{(0)})} \right) \\ \cdots \\ \theta^{(t+1)} \leftarrow \theta^{(t)} - \frac{f(\theta^{(t)})}{f'(\theta^{(t)})}.

Letting $f(\theta)=l'(\theta)$ we obtain

\theta^{(t+1)} \leftarrow \theta^{(t)} - \frac{l'(\theta^{(t)})}{l''(\theta^{(t)})}.

Quadratic convergence

0.01 error → 0.0001 error → 0.00000001 error → …

→ Newton’s method requires few iterations.

When $\theta$ is a vector $(\theta \in \mathbb{R}^{n+1})$ :

\theta^{(t+1)} \leftarrow \theta^{(t)} + H^{-1}\nabla_\theta l

where $H\in\mathbb{R^{(n+1)\times(n+1)}}$ is the Hessian matrix which satisfies

H_{ij} = \frac{\partial^2 l}{\partial \theta_i \partial \theta_j}.

cryptnomy

이전 포스트

Lecture 02. Linear Regression & Gradient Descent

다음 포스트