Lecture 02. Linear Regression & Gradient Descent

cryptnomy·2022년 11월 22일

ML cs229 machine learning

CS229: Machine Learning

목록 보기

2/18

Lecture video link: https://youtu.be/4b4MUYve_U8

Linear Regression

training set → learning algorithm → $h$ : hypothesis

ex.

Housing price prediction

Modeling: size → $h$ → price

h(x) = \sum_{j=0}^{2} \theta_{j}x_{j}

where $x_{0}=1$ .

Notations

$\theta$ : parameter

$m$ : # training examples (# rows in table above)

$x$ : ‘inputs’ / features

$y$ : ‘output’ / target variable

$(x, y)$ : training examples

$(x^{(i)}, y^{(i)})$ : $i^\text{th}$ training example

Choose $\theta$ s.t. $h(x) \simeq y$ for training examples.

(cf. $h_{\theta}(x) = h(x)$ ; abbreviated notation)

For linear regression, we want to minimize

J(\theta)=\frac{1}{2}\sum_{j=1}^{m}(h_\theta(x)-y)^2.

Gradient Descent

Start with some $\theta$ $($ say $\theta=\vec{0})$ .

Keep changing $\theta$ to reduce $J(\theta)$ .

\begin{aligned}\theta_j&\leftarrow\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\\a&\leftarrow a+1\end{aligned}

where $\alpha$ is a learning rate.

\begin{aligned}\frac{\partial}{\partial\theta_j}J(\theta)&=\frac{\partial}{\partial\theta_j}\frac{1}{2}(h_\theta-y)^{2}\\&=\frac{1}{2}\cdot2(h_\theta-y)\cdot\frac{\partial}{\partial\theta_j}(h_\theta(x)-y)\\&=(h_\theta-y)\cdot\frac{\partial}{\partial\theta_j}(\theta_0x_0+\theta_1x_1+\cdots+\theta_nx_n-y)\\&=(h_\theta-y)x_j. \end{aligned}

Hence, the algorithm becomes

Repeat until convergence:
$\begin{aligned}\theta_j&\leftarrow\theta_j-\alpha\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})\cdot x_j^{(i)}\\a&\leftarrow a+1\end{aligned}$
for $j=0, 1, \cdots, n$ .

Batch gradient descent

Batch? … For all training examples as one batch of data, we process all the data as a batch.

Disadvantage? … In order to make one update to your parameters, in order to even take a single step of gradient descent, you need to calculate the sum of the right-handed side of the equation above. You need to scan through your entire database even if $m$ is very large (100 million etc.).

Stochastic gradient descent

You loop through $j$ equal to $1$ to $m$ of taking a gradient descent step using the derivative of just one single example.

Repeat {
   For $i = 1$ to $m$ {
        $\theta_j \leftarrow \theta_j - \alpha (h_\theta(x^{(i)})-y^{(i)})x^{(i)}_j$ for every $j$ .
   }
}

Notation

\nabla_\theta J(\theta) = \begin{bmatrix} \frac{\partial J}{\partial\theta_0} \\ \vdots \\ \frac{\partial J}{\partial\theta_n} \end{bmatrix}

where $\theta \in \mathbb{R}^{n+1}$ .

ex.

For $A=\in\mathbb{R}^{2\times2}$ and $f: \mathbb{R}^{2\times2} \rightarrow \mathbb{R}$ , define $A=\begin{bmatrix}A_{11} & A_{12} \\ A_{21} & A_{22}\end{bmatrix}$ and $f(A)=A_{11}+A_{12}^2$ .