Lecture 02. Linear Regression & Gradient Descent

cryptnomy·2022년 11월 22일
0

CS229: Machine Learning

목록 보기
2/18
post-thumbnail

Lecture video link: https://youtu.be/4b4MUYve_U8

Linear Regression

training set → learning algorithm → hh: hypothesis

ex.

Housing price prediction

Modeling: size → hh → price

h(x)=j=02θjxjh(x) = \sum_{j=0}^{2} \theta_{j}x_{j}

where x0=1x_{0}=1.

Notations

θ\theta: parameter

mm: # training examples (# rows in table above)

xx: ‘inputs’ / features

yy: ‘output’ / target variable

(x,y)(x, y): training examples

(x(i),y(i))(x^{(i)}, y^{(i)}): ithi^\text{th} training example

Choose θ\theta s.t. h(x)yh(x) \simeq y for training examples.

(cf. hθ(x)=h(x)h_{\theta}(x) = h(x); abbreviated notation)

For linear regression, we want to minimize

J(θ)=12j=1m(hθ(x)y)2.J(\theta)=\frac{1}{2}\sum_{j=1}^{m}(h_\theta(x)-y)^2.

Gradient Descent

Start with some θ\theta ((say θ=0)\theta=\vec{0}).

Keep changing θ\theta to reduce J(θ)J(\theta).

θjθjαθjJ(θ)aa+1\begin{aligned}\theta_j&\leftarrow\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\\a&\leftarrow a+1\end{aligned}

where α\alpha is a learning rate.

θjJ(θ)=θj12(hθy)2=122(hθy)θj(hθ(x)y)=(hθy)θj(θ0x0+θ1x1++θnxny)=(hθy)xj.\begin{aligned}\frac{\partial}{\partial\theta_j}J(\theta)&=\frac{\partial}{\partial\theta_j}\frac{1}{2}(h_\theta-y)^{2}\\&=\frac{1}{2}\cdot2(h_\theta-y)\cdot\frac{\partial}{\partial\theta_j}(h_\theta(x)-y)\\&=(h_\theta-y)\cdot\frac{\partial}{\partial\theta_j}(\theta_0x_0+\theta_1x_1+\cdots+\theta_nx_n-y)\\&=(h_\theta-y)x_j. \end{aligned}

Hence, the algorithm becomes

Repeat until convergence:

θjθjαi=1m(hθ(x(i))y(i))xj(i)aa+1\begin{aligned}\theta_j&\leftarrow\theta_j-\alpha\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})\cdot x_j^{(i)}\\a&\leftarrow a+1\end{aligned}

for j=0,1,,nj=0, 1, \cdots, n.

Batch gradient descent

Batch? … For all training examples as one batch of data, we process all the data as a batch.

Disadvantage? … In order to make one update to your parameters, in order to even take a single step of gradient descent, you need to calculate the sum of the right-handed side of the equation above. You need to scan through your entire database even if mm is very large (100 million etc.).

Stochastic gradient descent

You loop through jj equal to 11 to mm of taking a gradient descent step using the derivative of just one single example.

Repeat {
   For i=1i = 1 to mm {
       θjθjα(hθ(x(i))y(i))xj(i)\theta_j \leftarrow \theta_j - \alpha (h_\theta(x^{(i)})-y^{(i)})x^{(i)}_j for every jj.
   }
}

Notation

θJ(θ)=[Jθ0Jθn]\nabla_\theta J(\theta) = \begin{bmatrix} \frac{\partial J}{\partial\theta_0} \\ \vdots \\ \frac{\partial J}{\partial\theta_n} \end{bmatrix}

where θRn+1\theta \in \mathbb{R}^{n+1}.

ex.

For A=R2×2A=\in\mathbb{R}^{2\times2} and f:R2×2Rf: \mathbb{R}^{2\times2} \rightarrow \mathbb{R}, define A=[A11A12A21A22]A=\begin{bmatrix}A_{11} & A_{12} \\ A_{21} & A_{22}\end{bmatrix} and f(A)=A11+A122f(A)=A_{11}+A_{12}^2.

Then,

Af(A)=[fA11fA12fA21fA22]=[12A1200].\begin{aligned} \nabla_A f(A) &= \begin{bmatrix}\frac{\partial f}{\partial A_{11}} & \frac{\partial f}{\partial A_{12}} \\ \frac{\partial f}{\partial A_{21}} & \frac{\partial f}{\partial A_{22}}\end{bmatrix} \\ &= \begin{bmatrix} 1 & 2A_{12} \\ 0 & 0 \end{bmatrix}. \end{aligned}

Trace

trA\textnormal {tr} A := sum of diagonal entries of matrix AA. “trace of A”

Characteristics

  • trA=trAT\textnormal {tr} A = \textnormal {tr} A^T
  • f(A)=trABf(A) = \textnormal {tr} AB
  • Af(A)=BT\nabla_A f(A) = B^T
  • trABC=trCAB\textnormal {tr} ABC = \textnormal {tr} CAB
  • AtrAATC=CA+CTA\nabla_A \textnormal {tr} AA^TC = CA + C^TA

Define

X=[x(1)Tx(2)Tx(m)T]  and    y=[y(1)y(2)y(m)].X = \begin{bmatrix}x^{{(1)}^{T}} \\ x^{{(2)}^{T}} \\ \vdots \\ x^{{(m)}^{T}} \end{bmatrix}\;\textnormal{and}\;\; y = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)}\end{bmatrix}.

Then the loss function becomes

J(θ)=12i=1m(h(x(i))y(i))2=12(Xθy)T(Xθy).\begin{aligned} J(\theta) &= \frac{1}{2} \sum_{i=1}^{m} (h(x^{(i)})-y^{(i)})^2 \\ &= \frac{1}{2}(X\theta-y)^T(X\theta-y). \end{aligned}

Hence,

θJ(θ)=θ12(Xθy)T(Xθy)=12θ(θTXTyT)(Xθy)=12θ(θTXTXθθTXTyyTXθ+yTy)=12(XTXθ+XTXθXTyXTy)=XTXθXTy=set0.θ=(XTX)1XTy.\begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \frac{1}{2}(X\theta-y)^T(X\theta-y) \\ &= \frac{1}{2}\nabla_\theta (\theta^T X^T-y^T)(X\theta-y) \\ &= \frac{1}{2}\nabla_\theta (\theta^T X^T X\theta - \theta^T X^T y - y^T X\theta + y^T y) \\ &= \frac{1}{2} (X^T X\theta + X^T X\theta - X^T y - X^T y) \\ &= X^T X\theta - X^T y \\ &\stackrel{\mathclap{\textnormal{set}}}{=} \vec0. \end{aligned} \\ \therefore \theta = (X^T X)^{-1}X^T y.

0개의 댓글