Week 2 | Neural Networks Basics | Logistic Regression as a Neural Network

Hyungseop Lee·2023년 6월 19일

Logistic Regression sigmoid

[Coursera | DL Specialization | 1 ] Neural Networks and DeepLearning]

목록 보기

2/5

Binary Classification

Logistic Regression is an alogirthm for Binary Classification
An example of a Binary Classification Problem :
- 어떠한 input image가 있다
  그 image에서 cat을 인식할 수 있다면 $y=1$ , 아니면 $y=0$ 을 denote하는 output label이 존재
- Computer에서는 위의 image를 어떻게 표현하는지 살펴보자
  Computer에서 위의 image를 3가지 Red, Green, Blue color chennels을 matrices로 저장한다.
  만약 input image가 $[64$ X $64]$ pixel이라면,
  image의 RGB pixel intensity values를 일치시키기 위해 $[3$ X $64$ X $64]$ matrices가 될 것이다.
- $[3$ X $64$ X $64]$ matrices의 pixel intensity values를
  feature vector $x$ 로 변환하기 위해서
  모든 pixel intensity values를 feature vector로 unroll해야 한다.
  feature vector $x$ 의 전체 dimension은 $[12,288$ X $1]$ 이 된다. (3 x 64 x 64 = 12,288)
  이때, input feature $x$ 의 dimension을 $n = n_x = 12,288$ 로 표현한다.
- 그래서 Binary Classification이란
  feature vector $x$ 로 나타내는 image를 입력할 수 있는 classifier를 학습하고,
  feature vector $x$ 에 대응하는 label $y$ 가 1인지? 0인지?에 따라
  cat image인지? cat image가 아닌지?를 prediction하는 것이 목적이다.

Notation

$(x, y)$ pair로 나타내는 single training example :
- $x$ $\in$ ${\rm I\!R}^{n_x}$
- $y$ $\in$ { $0, 1$ }
- $m$ 개의 training examples
- first training example : ( $x^{(1)}$ , $y^{(1)}$ )
- second training example : ( $x^{(2)}$ , $y^{(2)}$ )
- last training example : ( $x^{(m)}$ , $y^{(m)}$ )
- training set : { $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ... , (x^{(m)}, y^{(m)})$ }
- To emphasize the number of training samples : $m = m_{train}$
- To output all of the training example : $X$ , $Y$
  $n_x$ : 한 image의 크기
  $m$ : 전체 image 개수
- Python Command
  - X.shape = ( $n_x, m$ )
  - Y.shape = ( $1, m$ )

Logistic Regression

Logistic Regression is a algorithm that you use when the output labels Y in a supervised learning problem are all either zero or one,
so for binary classification problems.

input feature vector $x$ ( $x \in R^{n_x}$ ):
an image that you want to recognize as either a cat picture or not a cat picture.
Parameters( $w \in R^{n_x}$ , $b \in R$ )
Ground Truth label $y$
estimate of $y$ = $\hat y$ = $P(y=1 | x)$ :
the chance that this is a cat picture

Notation

How to generalize the output $\hat{y} ?$

$\hat{y}=w^Tx + b$ :
This is not a very good algorithm for binary classification.
왜냐하면, 우리는 $\hat{y}$ 이 $y=1$ 일 확률이 되기를 원한다.
그래서 $\hat{y}$ 이 0~1 사이의 값을 갖아야 한다.
하지만 $w^Tx + b$ 는 1보다 크거나, 음수의 값을 가질 수 있기 때문에
Probability에 합당하지 않다.
$\hat{y} = \sigma(w^Tx + b) = \sigma(z)$
: 따라서 $w^Tx + b$ 을 sigmoid function, $\sigma()$ 에 적용한 값을 사용한다.

(참고) : 다른 교재에서는
$\theta$ = { $\theta_0$ , $\theta_1$ , ..., $\theta_{n_x}$ } ➡️ $w$ = { $\theta_1$ , ..., $\theta_{n_x}$ }, $b$ = $\theta_0$ 으로 다르게 notation하기도 한다.

Sigmoid Function

Sigmoid Function looks like

$z = w^Tx$
- if $z$ is very small, $\sigma(z) \simeq 0$
- if $z$ is very large, $\sigma(z) \simeq 1$

Cost Function

To train the parameters $w$ , $b$ of logistic regression model
Wee need to define a cost function.
$\hat{y} = \sigma(w^T + b)$ , where $\sigma(z) = \frac{1}{1+e^{-z}}$
Given { $(x^{(1)}, y^{(1)})$ ,..., $(x^{(m)}, y^{(m)})$ }, want $\hat{y}^{(i)} \simeq y^{(i)}$
Loss Function = Error Function :
it measures how well you're doing on a single training example.
1. Squared Error : $L(\hat{y}, y)=\frac{1}{2}(\hat{y}-y)^2$
  ➡️ logistic regression에서 보통 사용하지 않는다.
  왜냐하면, 나중에 배울 optimization 문제가 non convex하기 때문이다.
2. $L(\hat{y}, y)=-(ylog\hat{y} + (1-y)log(1-\hat{y}))$
  if $y=1$ : $L(\hat{y}, y)=-log(\hat{y})$ ➡️ want $log\hat{y}$ large, want $\hat{y}$ large (max 1)
  if $y=0$ : $L(\hat{y}, y)=-log(1-\hat{y})$ ➡️ want $1-\hat{y}$ large, want $\hat{y}$ small (min 0)
Cost Function :
it measures how are you doing on the entire training set.

$J(w, b) = -\frac{1}{m}\sum_{i=1}^{m} L(\hat{y}^{(i)},y^{(i)}) = -\frac{1}{m}\sum_{i=1}^{m}[ylog\hat{y} + (1-y)log(1-\hat{y})]$

Our logistic regression model,
we're going to try to find parameters $w$ and $b$
that minimize the overall Cost Function $J$

Gradient Descent

recap :
$\hat{y}=\sigma(w^Tx + b)$ , $\sigma(z)=\frac{1}{1+e^{-z}}$
$J(w, b) = -\frac{1}{m}\sum_{i=1}^{m} L(\hat{y}^{(i)},y^{(i)}) = -\frac{1}{m}\sum_{i=1}^{m}[ylog\hat{y} + (1-y)log(1-\hat{y})]$
Want to find $w$ , $b$ that minimize $J(w, b)$

Convex Function

convex function : 간단히 말하면 아래로 볼록한 함수.
이 함수 위의 두 점을 선분으로 이었을 때,
해당 선분 위의 모든 점들이 함수의 점보다 위에 있거나 같은 위치에 있는 함수)
Assumption : Cost Function $J$ is a convex function
- initialize $w$ and $b$ :
  For logistic regression, almost any initialization method works
  
  보통은 0으로 initialization한다.
  
  Random initialization도 잘 동작하지만,
  convex function이라서 어디에 initialization하는지에 관계없이
  모두 최저점에 똑같이 도달하기 때문에 보통 사용하지 않는다.
Gradient descent algorithm : To train or to learn the parameters $w$ on our training set.
- There's a function $J(w, b)$ that we want to minimize. (그리기 편하기 위해 $b$ 는 생략)
  repeatly carry out the following update
  ➡️ $w = w - \alpha \frac{\partial J(w, b)}{\partial w}$ ( $\alpha$ : learning rate)
  ➡️ $b = b - \alpha \frac{\partial J(w, b)}{\partial b}$ ( $\alpha$ : learning rate)

Derivatives

Derivative just means slope of a function
formal definition : This $f(a)$ go up three times as much as
whatever was the tiny, tiny, tiny, amount(infinitesimal amount) that we nudged $a$ to the right
아래 예제에서는 직관적인 이해를 위해 0.001 $a$ 축 방향 0.001 이동시켰지만,
실제 derivatives의 정의는 무한히 작은 값(infinitesimal amount)을 이동시킨다.
- Derivatives
  = Slop of the function
  = $\frac{height}{width}$
  = $\frac{df(a)}{a}$ ( $a$ : infinitesimal amount)

More Derivate Examples

On a straight line, the functions's derivative doesn't change
The slope of the function can be different points on the curve

$f(a)=a^2$
formula : $\frac{df(a)}{da} = 2a$
- $a=2, f(a)=4$
  $a=2.001, f(a) = 4.004001$
  (2와 2.001의 차이 0.001은 infinitesimally small하지 않기 때문에 formula에 의한 값과 오차 존재)

$f(a)=a^3$
formula : $\frac{df(a)}{da} = 3a^2$
- by formula, $a=2, \frac{df(a)}{da}=12$
  we can see that the formula is correct.
  $a=2, f(a) = 8$
  $a=2.001, f(a) \simeq 8.012$
  ➡️ $0.001 * 12 = 0.012$
$f(a) = log_e(a) = ln(a)$
formula : $\frac{df(a)}{da} = \frac{1}{a}$
- by formula, $a=2, \frac{df(a)}{da}=\frac{1}{2}= 0.5$
  we can see that the formula is correct.
  $a=2, f(a) \simeq 0.69315$
  $a=2.001, f(a) \simeq 0.69365$
  ➡️ $0.001 * 0.5 = 0.0005$

Computation Graph

The computation of a neural network are organized in terms of a
forward pass(= forward propagation step) in which we compute the output of the neural network
followed by a backward pass(= backward propagation step) which we use to compute gradients or compute derivatives.
The computation graph explains why it is organized this way
Simple example :
변수 $a, b, c$ 를 사용하는 function $J(a, b, c) = 3(a+bc)$ 가 있다고 가정하자.
- $J(a, b, c) = 3(a + bc)$ 는 총 3단계로 나눌 수 있다.
1. $u = bc$
2. $v = a+u$
3. $J = 3v$

Derivatives with a Computation Graph

Key Point :
Computing all of these derivatives,
the most efficient way to do so is through a right to left computation
following the direction of the red arrows.
1. 먼저 $v$ 에 대한 derivatives를 계산
2. 그러면 $a$ 에 대한 derivative와 $u$ 에 대한 derivative를 계산하는 데 유용.
3. 그러면 $b$ , $c$ 각각에 대한 derivative 계산하는 데 유용

Logistic Regression Gradient Descent

하나의 training example에 대해서
Logistic Regression에 대한 Gradient Descent를 수행하는 방법을 살펴보자.

$z = w^Tx + b$
$\hat{y} = a = \sigma(z)$ : output of logistic regression
$L(a, y) = -(ylog(a) + (1-y)log(1-a))$ : ground truth label

Gradient dsecent on m examples

remind :
$J(w,b) = \sum_{i=1}^m L(a^{(i)}, y^{(i)})$
$a^{(i)}= \hat{y}^{(i)}= \sigma(z^{(i)})=\sigma(w^Tx^{(i)}+b)$

앞서 하나의 training example( $x^{(i)}, y^{(i)}$ )만 갖고 계산했던 derivatives :
- $dw_1^{(i)} = \frac{\partial}{\partial w_1}L(a^{(i)}, y^{(i)})$
- $dw_2^{(i)} = \frac{\partial}{\partial w_2}L(a^{(i)}, y^{(i)})$
- $db^{(i)} = \frac{\partial}{\partial b}L(a^{(i)}, y^{(i)})$
Logistic regression on m examples :
- But, we need to write two for loops
  1. first for loop : $m$ 개의 training example에 대한 for loop
  2. second for loop : for loop over all the features( $dw_1, dw_2$ )
    만약 feature가 $n$ 개가 된다면, feature $n$ 개에 대한 for loop이 필요.
- 만약 deep learning algorithm을 구현할 때, for loop이 있으면 algorithm 효율이 떨어진다.
  deep learning 시대에는 dataset이 매우 커지기 때문에
  명시적 for loop를 사용하지 않고 algorithm을 구현할 수 있어야 한다.
  그러기 위해서 vectorization이 매우 중요해졌다.