Lecture video link: https://youtu.be/het9HFqo1TQ
Outline
Linear Regression (recap)
Locally Weighted Regression
Probabilistic Interpretation
Logistic Regression
Newton’s Method
Recap
( x ( i ) , y ( i ) ) : i t h example (x^{(i)}, y^{(i)}):\;i^{th}\;\;\textnormal{example} ( x ( i ) , y ( i ) ) : i t h example
x ( i ) ∈ R n + 1 , y ( i ) ∈ R , x 0 = 1 x^{(i)} \in \mathbb{R}^{n+1}, \;\;y^{(i)}\in\mathbb{R},\;\; x_0=1 x ( i ) ∈ R n + 1 , y ( i ) ∈ R , x 0 = 1
m : # examples m:\;\textnormal{\#\;examples} m : # examples
n : # features n:\;\textnormal{\#\;\;features} n : # features
h θ ( x ) = ∑ j = 0 n θ j x j = θ T x h_\theta(x) = \sum\limits_{j=0}^{n}\theta_j x_j = \theta^T x h θ ( x ) = j = 0 ∑ n θ j x j = θ T x
J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta) = \frac{1}{2} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 J ( θ ) = 2 1 i = 1 ∑ m ( h θ ( x ( i ) ) − y ( i ) ) 2
Locally weighted regression
“Parametric” learning algorithm:
Fit fixed set of parameters ( θ i ) (\theta_i) ( θ i ) to data.
“Non-parametric” learning algorithm:
Amount of data/parameters you need to keep grows (linearly) with size of data.
To evaluate h h h at certain x x x :
Fit θ \theta θ to minimize 1 2 ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 \frac{1}{2} \sum\limits_{i=1}^{m} \left(h_\theta\left(x^{(i)}\right) - y^{(i)}\right)^2 2 1 i = 1 ∑ m ( h θ ( x ( i ) ) − y ( i ) ) 2 .
Return θ T x \theta^T x θ T x .
Locally weighted regression
Fit θ \theta θ to minimize ∑ i = 1 m w ( i ) ( y ( i ) − θ T x ( i ) ) 2 \sum\limits_{i=1}^m w^{(i)}\left(y^{(i)} - \theta^T x^{(i)}\right)^2 i = 1 ∑ m w ( i ) ( y ( i ) − θ T x ( i ) ) 2
where w ( i ) w^{(i)} w ( i ) is a “weighting” function, w ( i ) = exp ( − ( x ( i ) − x ) 2 2 τ 2 ) w^{(i)}=\exp{\left(-\frac{(x^{(i)}-x)^2}{2\tau^2}\right)} w ( i ) = exp ( − 2 τ 2 ( x ( i ) − x ) 2 ) .
τ \tau τ : bandwidth
If ∣ x ( i ) − x ∣ |x^{(i)}-x| ∣ x ( i ) − x ∣ is small, w ( i ) ≈ 1 w^{(i)}\approx1 w ( i ) ≈ 1 .
If ∣ x ( i ) − x ∣ |x^{(i)}-x| ∣ x ( i ) − x ∣ is large, w ( i ) ≈ 0 w^{(i)}\approx0 w ( i ) ≈ 0 .
Why least squares?
Assume y ( i ) = θ T x ( i ) + ϵ ( i ) y^{(i)}=\theta^T x^{(i)} + \epsilon^{(i)} y ( i ) = θ T x ( i ) + ϵ ( i )
where ϵ ( i ) \epsilon^{(i)} ϵ ( i ) captures unmodeled effects (ex. one day seller is in an unusually good mood or bad mood so that the price goes higher or lower, etc.) and random noise.
Assume ϵ ( i ) \epsilon^{(i)} ϵ ( i ) have a Gaussian distribution, i.e., ϵ ( i ) ∼ N ( 0 , σ 2 ) \epsilon^{(i)}\sim\mathcal{N}(0, \sigma^2) ϵ ( i ) ∼ N ( 0 , σ 2 ) (which signifies p ( ϵ ( i ) ) = 1 2 π σ exp ( − ( ϵ ( i ) ) 2 2 σ 2 ) p(\epsilon^{(i)}) = \frac{1}{\sqrt{2\pi}\sigma}\exp{\left(-\frac{(\epsilon^{(i)})^2}{2\sigma^2}\right)} p ( ϵ ( i ) ) = 2 π σ 1 exp ( − 2 σ 2 ( ϵ ( i ) ) 2 ) .)
(Why Gaussian? ∵ \because ∵ Central limit theorem. In many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed.)
(Source: https://en.wikipedia.org/wiki/Central_limit_theorem )
Then we have i.i.d. (independent and identically distributed) noise around θ T x \theta^T x θ T x , which means
p ( y ( i ) ∣ x ( i ) ; θ ) = 1 2 π θ exp ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) i.e., y ( i ) ∣ x ( i ) ; θ ∼ N ( θ T x ( i ) , σ 2 ) p(y^{(i)}|x^{(i)};\theta)=\frac{1}{\sqrt{2\pi}\theta}\exp{\left(-\frac{\left(y^{(i)}-\theta^T x^{(i)}\right)^2}{2\sigma^2}\right)} \\ \text{i.e.,}\;\;y^{(i)}\;|\;x^{(i)};\theta\sim\mathcal{N}(\theta^T x^{(i)},\sigma^2) p ( y ( i ) ∣ x ( i ) ; θ ) = 2 π θ 1 exp ( − 2 σ 2 ( y ( i ) − θ T x ( i ) ) 2 ) i.e., y ( i ) ∣ x ( i ) ; θ ∼ N ( θ T x ( i ) , σ 2 )
where ; ; ; symbol is read as “parameterized by”.
(cf. Since θ \theta θ is not a random variable, it needs to be distinguished from x ( i ) x^{(i)} x ( i ) .)
Likelihood of θ \theta θ
L ( θ ) = p ( y ∣ x ; θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 m 1 2 π σ exp ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) . \begin{aligned} L(\theta) &= p(y|x;\theta) \\ &= \prod_{i=1}^m p(y^{(i)}|x^{(i)};\theta) \\ &= \prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma}\exp{\left(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\right)}. \end{aligned} L ( θ ) = p ( y ∣ x ; θ ) = i = 1 ∏ m p ( y ( i ) ∣ x ( i ) ; θ ) = i = 1 ∏ m 2 π σ 1 exp ( − 2 σ 2 ( y ( i ) − θ T x ( i ) ) 2 ) .
Probability vs. Likelihood
View p ( y ∣ x ; θ ) p(y|x;\theta) p ( y ∣ x ; θ ) as …
… a function of the parameters θ \theta θ holding the data ( x , y ) (x, y) ( x , y ) fixed → Likelihood.
… a function of data ( x , y ) (x, y) ( x , y ) while parameters θ \theta θ as fixed → Probability.
Log likelihood
l ( θ ) = log L ( θ ) = log ∏ i = 1 m 1 2 π σ exp ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) = m log 1 2 π σ + ∑ i = 1 m ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) . \begin{aligned} l(\theta) &= \log{L(\theta)} \\ &= \log{\prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma}\exp{\left(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\right)}} \\ &= m\log{\frac{1}{\sqrt{2\pi}\sigma} + \sum_{i=1}^m \left(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\right)}. \end{aligned} l ( θ ) = log L ( θ ) = log i = 1 ∏ m 2 π σ 1 exp ( − 2 σ 2 ( y ( i ) − θ T x ( i ) ) 2 ) = m log 2 π σ 1 + i = 1 ∑ m ( − 2 σ 2 ( y ( i ) − θ T x ( i ) ) 2 ) .
MLE (Maximum Likelihood Estimation)
Choose θ \theta θ to maximize L ( θ ) L(\theta) L ( θ ) .
I.e., choose θ \theta θ to minimize 1 2 ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 = J ( θ ) \frac{1}{2}\sum\limits_{i=1}^m (y^{(i)}-\theta^T x^{(i)})^2 = J(\theta) 2 1 i = 1 ∑ m ( y ( i ) − θ T x ( i ) ) 2 = J ( θ ) .
∴ \therefore ∴ Choose the value of θ \theta θ to minimize the least squares errors
= Find the maximum likelihood estimate for the parameters θ \theta θ under this set of assumptions we made that the error terms are Gaussian and i.i.d.
Classification
y ∈ { 0 , 1 } y\in\{0,1\} y ∈ { 0 , 1 } (binary classification).
Logistic regression
Want h θ ( x ) ∈ [ 0 , 1 ] h_\theta (x) \in [0,1] h θ ( x ) ∈ [ 0 , 1 ] .
h θ ( x ) = g ( θ T x ) = 1 1 + e θ T x h_\theta (x) = g(\theta^T x) = \frac{1}{1+e^{\theta^T x}} h θ ( x ) = g ( θ T x ) = 1 + e θ T x 1 .
g ( z ) = 1 1 + e − z g(z) = \frac{1}{1+e^{-z}} g ( z ) = 1 + e − z 1 : “sigmoid” or “logistic” function.
p ( y = 1 ∣ x ; θ ) = h θ ( x ) p ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) p(y=1|x;\theta) = h_\theta (x) \\ p(y=0|x;\theta) = 1 - h_\theta (x) p ( y = 1 ∣ x ; θ ) = h θ ( x ) p ( y = 0 ∣ x ; θ ) = 1 − h θ ( x )
⇒ p ( y ∣ x ; θ ) = h ( x ) y ( 1 − h ( x ) ) 1 − y \Rightarrow p(y|x;\theta) = h(x)^y (1-h(x))^{1-y} ⇒ p ( y ∣ x ; θ ) = h ( x ) y ( 1 − h ( x ) ) 1 − y .
Then for log likelihood we obtain
l ( θ ) = log L ( θ ) = ∑ i = 1 m y ( i ) log h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) . \begin{aligned} l(\theta) &= \log{L\left(\theta\right)} \\ &= \sum_{i=1}^m y^{(i)}\log{h_\theta (x^{(i)})} + \left(1-y^{(i)}\right)\log{\left(1-h_\theta (x^{(i)})\right)}. \end{aligned} l ( θ ) = log L ( θ ) = i = 1 ∑ m y ( i ) log h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) .
Choose θ \theta θ to maximize l ( θ ) l(\theta) l ( θ ) .
θ j ← θ j + α ∂ ∂ θ j l ( θ ) or θ j ← θ j + α ∑ i = 1 m ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) . \theta_j \leftarrow \theta_j + \alpha\frac{\partial}{\partial\theta_j}l(\theta) \;\; \textnormal{or}\\ \theta_j \leftarrow \theta_j + \alpha\sum_{i=1}^m (y^{(i)} - h_\theta (x^{(i)}))x_j^{(i)}. θ j ← θ j + α ∂ θ j ∂ l ( θ ) or θ j ← θ j + α i = 1 ∑ m ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) .
Newton’s method
Have f f f .
Want to find θ \theta θ s.t. f ( θ ) = 0 f(\theta)=0 f ( θ ) = 0 .
→ Want to maximize l ( θ ) l(\theta) l ( θ ) i.e. want l ′ ( θ ) = 0 l'(\theta)=0 l ′ ( θ ) = 0 .
Start from θ ( 0 ) \theta^{(0)} θ ( 0 ) .
θ ( 1 ) ← θ ( 0 ) − Δ = θ ( 0 ) − f ( θ ( 0 ) ) f ′ ( θ ( 0 ) ) ( f ′ ( θ ( 0 ) ) = f ( θ ( 0 ) ) Δ ⇒ Δ = f ( θ ( 0 ) ) f ′ ( θ ( 0 ) ) ) ⋯ θ ( t + 1 ) ← θ ( t ) − f ( θ ( t ) ) f ′ ( θ ( t ) ) . \theta^{(1)} \leftarrow \theta^{(0)} - \Delta = \theta^{(0)} - \frac{f(\theta^{(0)})}{f'(\theta^{(0)})} \\\left(f'\left(\theta^{(0)}\right) = \frac{f(\theta^{(0)})}{\Delta} \Rightarrow \Delta = \frac{f(\theta^{(0)})}{f'(\theta^{(0)})} \right) \\ \cdots \\ \theta^{(t+1)} \leftarrow \theta^{(t)} - \frac{f(\theta^{(t)})}{f'(\theta^{(t)})}. θ ( 1 ) ← θ ( 0 ) − Δ = θ ( 0 ) − f ′ ( θ ( 0 ) ) f ( θ ( 0 ) ) ( f ′ ( θ ( 0 ) ) = Δ f ( θ ( 0 ) ) ⇒ Δ = f ′ ( θ ( 0 ) ) f ( θ ( 0 ) ) ) ⋯ θ ( t + 1 ) ← θ ( t ) − f ′ ( θ ( t ) ) f ( θ ( t ) ) .
Letting f ( θ ) = l ′ ( θ ) f(\theta)=l'(\theta) f ( θ ) = l ′ ( θ ) we obtain
θ ( t + 1 ) ← θ ( t ) − l ′ ( θ ( t ) ) l ′ ′ ( θ ( t ) ) . \theta^{(t+1)} \leftarrow \theta^{(t)} - \frac{l'(\theta^{(t)})}{l''(\theta^{(t)})}. θ ( t + 1 ) ← θ ( t ) − l ′ ′ ( θ ( t ) ) l ′ ( θ ( t ) ) .
Quadratic convergence
0.01 error → 0.0001 error → 0.00000001 error → …
→ Newton’s method requires few iterations.
When θ \theta θ is a vector ( θ ∈ R n + 1 ) (\theta \in \mathbb{R}^{n+1}) ( θ ∈ R n + 1 ) :
θ ( t + 1 ) ← θ ( t ) + H − 1 ∇ θ l \theta^{(t+1)} \leftarrow \theta^{(t)} + H^{-1}\nabla_\theta l θ ( t + 1 ) ← θ ( t ) + H − 1 ∇ θ l
where H ∈ R ( n + 1 ) × ( n + 1 ) H\in\mathbb{R^{(n+1)\times(n+1)}} H ∈ R ( n + 1 ) × ( n + 1 ) is the Hessian matrix which satisfies
H i j = ∂ 2 l ∂ θ i ∂ θ j . H_{ij} = \frac{\partial^2 l}{\partial \theta_i \partial \theta_j}. H i j = ∂ θ i ∂ θ j ∂ 2 l .