Lecture video link: https://youtu.be/4b4MUYve_U8
Linear Regression
training set → learning algorithm → h h h : hypothesis
ex.
Housing price prediction
Modeling: size → h h h → price
h ( x ) = ∑ j = 0 2 θ j x j h(x) = \sum_{j=0}^{2} \theta_{j}x_{j} h ( x ) = j = 0 ∑ 2 θ j x j
where x 0 = 1 x_{0}=1 x 0 = 1 .
Notations
θ \theta θ : parameter
m m m : # training examples (# rows in table above)
x x x : ‘inputs’ / features
y y y : ‘output’ / target variable
( x , y ) (x, y) ( x , y ) : training examples
( x ( i ) , y ( i ) ) (x^{(i)}, y^{(i)}) ( x ( i ) , y ( i ) ) : i th i^\text{th} i th training example
Choose θ \theta θ s.t. h ( x ) ≃ y h(x) \simeq y h ( x ) ≃ y for training examples.
(cf. h θ ( x ) = h ( x ) h_{\theta}(x) = h(x) h θ ( x ) = h ( x ) ; abbreviated notation)
For linear regression, we want to minimize
J ( θ ) = 1 2 ∑ j = 1 m ( h θ ( x ) − y ) 2 . J(\theta)=\frac{1}{2}\sum_{j=1}^{m}(h_\theta(x)-y)^2. J ( θ ) = 2 1 j = 1 ∑ m ( h θ ( x ) − y ) 2 .
Gradient Descent
Start with some θ \theta θ ( ( ( say θ = 0 ⃗ ) \theta=\vec{0}) θ = 0 ) .
Keep changing θ \theta θ to reduce J ( θ ) J(\theta) J ( θ ) .
θ j ← θ j − α ∂ ∂ θ j J ( θ ) a ← a + 1 \begin{aligned}\theta_j&\leftarrow\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\\a&\leftarrow a+1\end{aligned} θ j a ← θ j − α ∂ θ j ∂ J ( θ ) ← a + 1
where α \alpha α is a learning rate.
∂ ∂ θ j J ( θ ) = ∂ ∂ θ j 1 2 ( h θ − y ) 2 = 1 2 ⋅ 2 ( h θ − y ) ⋅ ∂ ∂ θ j ( h θ ( x ) − y ) = ( h θ − y ) ⋅ ∂ ∂ θ j ( θ 0 x 0 + θ 1 x 1 + ⋯ + θ n x n − y ) = ( h θ − y ) x j . \begin{aligned}\frac{\partial}{\partial\theta_j}J(\theta)&=\frac{\partial}{\partial\theta_j}\frac{1}{2}(h_\theta-y)^{2}\\&=\frac{1}{2}\cdot2(h_\theta-y)\cdot\frac{\partial}{\partial\theta_j}(h_\theta(x)-y)\\&=(h_\theta-y)\cdot\frac{\partial}{\partial\theta_j}(\theta_0x_0+\theta_1x_1+\cdots+\theta_nx_n-y)\\&=(h_\theta-y)x_j. \end{aligned} ∂ θ j ∂ J ( θ ) = ∂ θ j ∂ 2 1 ( h θ − y ) 2 = 2 1 ⋅ 2 ( h θ − y ) ⋅ ∂ θ j ∂ ( h θ ( x ) − y ) = ( h θ − y ) ⋅ ∂ θ j ∂ ( θ 0 x 0 + θ 1 x 1 + ⋯ + θ n x n − y ) = ( h θ − y ) x j .
Hence, the algorithm becomes
Repeat until convergence:
θ j ← θ j − α ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x j ( i ) a ← a + 1 \begin{aligned}\theta_j&\leftarrow\theta_j-\alpha\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})\cdot x_j^{(i)}\\a&\leftarrow a+1\end{aligned} θ j a ← θ j − α i = 1 ∑ m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x j ( i ) ← a + 1
for j = 0 , 1 , ⋯ , n j=0, 1, \cdots, n j = 0 , 1 , ⋯ , n .
Batch gradient descent
Batch? … For all training examples as one batch of data, we process all the data as a batch.
Disadvantage? … In order to make one update to your parameters, in order to even take a single step of gradient descent, you need to calculate the sum of the right-handed side of the equation above. You need to scan through your entire database even if m m m is very large (100 million etc.).
Stochastic gradient descent
You loop through j j j equal to 1 1 1 to m m m of taking a gradient descent step using the derivative of just one single example.
Repeat {
For i = 1 i = 1 i = 1 to m m m {
θ j ← θ j − α ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta_j \leftarrow \theta_j - \alpha (h_\theta(x^{(i)})-y^{(i)})x^{(i)}_j θ j ← θ j − α ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) for every j j j .
}
}
Notation
∇ θ J ( θ ) = [ ∂ J ∂ θ 0 ⋮ ∂ J ∂ θ n ] \nabla_\theta J(\theta) = \begin{bmatrix} \frac{\partial J}{\partial\theta_0} \\ \vdots \\ \frac{\partial J}{\partial\theta_n} \end{bmatrix} ∇ θ J ( θ ) = ⎣ ⎢ ⎢ ⎡ ∂ θ 0 ∂ J ⋮ ∂ θ n ∂ J ⎦ ⎥ ⎥ ⎤
where θ ∈ R n + 1 \theta \in \mathbb{R}^{n+1} θ ∈ R n + 1 .
ex.
For A = ∈ R 2 × 2 A=\in\mathbb{R}^{2\times2} A = ∈ R 2 × 2 and f : R 2 × 2 → R f: \mathbb{R}^{2\times2} \rightarrow \mathbb{R} f : R 2 × 2 → R , define A = [ A 11 A 12 A 21 A 22 ] A=\begin{bmatrix}A_{11} & A_{12} \\ A_{21} & A_{22}\end{bmatrix} A = [ A 1 1 A 2 1 A 1 2 A 2 2 ] and f ( A ) = A 11 + A 12 2 f(A)=A_{11}+A_{12}^2 f ( A ) = A 1 1 + A 1 2 2 .
Then,
∇ A f ( A ) = [ ∂ f ∂ A 11 ∂ f ∂ A 12 ∂ f ∂ A 21 ∂ f ∂ A 22 ] = [ 1 2 A 12 0 0 ] . \begin{aligned} \nabla_A f(A) &= \begin{bmatrix}\frac{\partial f}{\partial A_{11}} & \frac{\partial f}{\partial A_{12}} \\ \frac{\partial f}{\partial A_{21}} & \frac{\partial f}{\partial A_{22}}\end{bmatrix} \\ &= \begin{bmatrix} 1 & 2A_{12} \\ 0 & 0 \end{bmatrix}. \end{aligned} ∇ A f ( A ) = [ ∂ A 1 1 ∂ f ∂ A 2 1 ∂ f ∂ A 1 2 ∂ f ∂ A 2 2 ∂ f ] = [ 1 0 2 A 1 2 0 ] .
Trace
tr A \textnormal {tr} A tr A := sum of diagonal entries of matrix A A A . “trace of A”
Characteristics
tr A = tr A T \textnormal {tr} A = \textnormal {tr} A^T tr A = tr A T
f ( A ) = tr A B f(A) = \textnormal {tr} AB f ( A ) = tr A B
∇ A f ( A ) = B T \nabla_A f(A) = B^T ∇ A f ( A ) = B T
tr A B C = tr C A B \textnormal {tr} ABC = \textnormal {tr} CAB tr A B C = tr C A B
∇ A tr A A T C = C A + C T A \nabla_A \textnormal {tr} AA^TC = CA + C^TA ∇ A tr A A T C = C A + C T A
Define
X = [ x ( 1 ) T x ( 2 ) T ⋮ x ( m ) T ] and y = [ y ( 1 ) y ( 2 ) ⋮ y ( m ) ] . X = \begin{bmatrix}x^{{(1)}^{T}} \\ x^{{(2)}^{T}} \\ \vdots \\ x^{{(m)}^{T}} \end{bmatrix}\;\textnormal{and}\;\; y = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)}\end{bmatrix}. X = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ x ( 1 ) T x ( 2 ) T ⋮ x ( m ) T ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ and y = ⎣ ⎢ ⎢ ⎢ ⎢ ⎡ y ( 1 ) y ( 2 ) ⋮ y ( m ) ⎦ ⎥ ⎥ ⎥ ⎥ ⎤ .
Then the loss function becomes
J ( θ ) = 1 2 ∑ i = 1 m ( h ( x ( i ) ) − y ( i ) ) 2 = 1 2 ( X θ − y ) T ( X θ − y ) . \begin{aligned} J(\theta) &= \frac{1}{2} \sum_{i=1}^{m} (h(x^{(i)})-y^{(i)})^2 \\ &= \frac{1}{2}(X\theta-y)^T(X\theta-y). \end{aligned} J ( θ ) = 2 1 i = 1 ∑ m ( h ( x ( i ) ) − y ( i ) ) 2 = 2 1 ( X θ − y ) T ( X θ − y ) .
Hence,
∇ θ J ( θ ) = ∇ θ 1 2 ( X θ − y ) T ( X θ − y ) = 1 2 ∇ θ ( θ T X T − y T ) ( X θ − y ) = 1 2 ∇ θ ( θ T X T X θ − θ T X T y − y T X θ + y T y ) = 1 2 ( X T X θ + X T X θ − X T y − X T y ) = X T X θ − X T y = set 0 ⃗ . ∴ θ = ( X T X ) − 1 X T y . \begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \frac{1}{2}(X\theta-y)^T(X\theta-y) \\ &= \frac{1}{2}\nabla_\theta (\theta^T X^T-y^T)(X\theta-y) \\ &= \frac{1}{2}\nabla_\theta (\theta^T X^T X\theta - \theta^T X^T y - y^T X\theta + y^T y) \\ &= \frac{1}{2} (X^T X\theta + X^T X\theta - X^T y - X^T y) \\ &= X^T X\theta - X^T y \\ &\stackrel{\mathclap{\textnormal{set}}}{=} \vec0. \end{aligned} \\ \therefore \theta = (X^T X)^{-1}X^T y. ∇ θ J ( θ ) = ∇ θ 2 1 ( X θ − y ) T ( X θ − y ) = 2 1 ∇ θ ( θ T X T − y T ) ( X θ − y ) = 2 1 ∇ θ ( θ T X T X θ − θ T X T y − y T X θ + y T y ) = 2 1 ( X T X θ + X T X θ − X T y − X T y ) = X T X θ − X T y = set 0 . ∴ θ = ( X T X ) − 1 X T y .