Deep Neural Network 연습문제

박성철 | 수리데이터사이언스학과 | 한양대(ERICA) ·2026년 4월 13일

Problem 4.1 Consider composing the two neural networks in figure 4.8. Draw a plot of the relationship between the input $x$ and output $y'$ for $x \in [-1, 1].$

figure 4.8:

Problem 4.2 Identify the four hyperparameters in figure 4.6.
figure 4.6 :

$\begin{aligned} &D_1: \text{첫 번째 hidden layer 크기} \\ &D_2: \text{두 번째 hidden layer 크기} \\ &D_3: \text{세 번째 hidden layer 크기} \\ &K = 3: \text{hidden layer의 개수} \end{aligned}$

Problem 4.3 Using the non-negative homogeneity property of the ReLU function (see problem 3.5), show that:

\text{ReLU} \left[ \boldsymbol{\beta}_1 + \lambda_1 \cdot \boldsymbol{\Omega}_1 \text{ReLU} \left[ \boldsymbol{\beta}_0 + \lambda_0 \cdot \boldsymbol{\Omega}_0 \mathbf{x} \right] \right] = \lambda_0 \lambda_1 \cdot \text{ReLU} \left[ \frac{1}{\lambda_0 \lambda_1} \boldsymbol{\beta}_1 + \boldsymbol{\Omega}_1 \text{ReLU} \left[ \frac{1}{\lambda_0} \boldsymbol{\beta}_0 + \boldsymbol{\Omega}_0 \mathbf{x} \right] \right]

where $\lambda_0$ and $\lambda_1$ are non-negative scalars. From this, we see that the weight matrices can be rescaled by any magnitude as long as the biases are also adjusted, and the scale factors can be re-applied at the end of the network.

\mathrm{ReLU}\!\left[\beta_1 + \lambda_1 \Omega_1 \mathrm{ReLU}\!\left[\beta_0 + \lambda_0 \Omega_0 x\right]\right] = \lambda_0 \lambda_1 \,\mathrm{ReLU}\!\left[ \frac{1}{\lambda_0\lambda_1}\beta_1 + \Omega_1 \mathrm{ReLU}\!\left[ \frac{1}{\lambda_0}\beta_0 + \Omega_0 x \right] \right]

where $\lambda_0, \lambda_1 \ge 0$ .
이 문제의 핵심은 ReLU의 non-negative homogeneity 성질이다.

\mathrm{ReLU}(az)=a\,\mathrm{ReLU}(z) \qquad (a \ge 0)

즉, ReLU 내부에 있는 음이 아닌 스칼라 배수를 바깥으로 꺼낼 수 있다.
먼저 내부 ReLU를 보면,

\beta_0 + \lambda_0 \Omega_0 x = \lambda_0\left(\frac{1}{\lambda_0}\beta_0 + \Omega_0 x\right)

이므로 ReLU의 성질을 적용하면,

\mathrm{ReLU}\!\left[\beta_0 + \lambda_0 \Omega_0 x\right] = \lambda_0 \,\mathrm{ReLU}\!\left[\frac{1}{\lambda_0}\beta_0 + \Omega_0 x\right]

이다.
이를 원래 식에 대입하면,

\mathrm{ReLU}\!\left[ \beta_1 + \lambda_1 \Omega_1 \left( \lambda_0 \,\mathrm{ReLU}\!\left[\frac{1}{\lambda_0}\beta_0 + \Omega_0 x\right] \right) \right]

즉,

= \mathrm{ReLU}\!\left[ \beta_1 + \lambda_0\lambda_1 \Omega_1 \mathrm{ReLU}\!\left[\frac{1}{\lambda_0}\beta_0 + \Omega_0 x\right] \right]

이 된다.
이제 바깥쪽 전체에서 $\lambda_0\lambda_1$ 을 묶으면,

\beta_1 + \lambda_0\lambda_1 \Omega_1 \mathrm{ReLU}\!\left[\frac{1}{\lambda_0}\beta_0 + \Omega_0 x\right] = \lambda_0\lambda_1 \left[ \frac{1}{\lambda_0\lambda_1}\beta_1 + \Omega_1 \mathrm{ReLU}\!\left[\frac{1}{\lambda_0}\beta_0 + \Omega_0 x\right] \right]

따라서 다시 ReLU의 성질을 적용하면,

\mathrm{ReLU}\!\left[\beta_1 + \lambda_1 \Omega_1 \mathrm{ReLU}\!\left[\beta_0 + \lambda_0 \Omega_0 x\right]\right] = \lambda_0 \lambda_1 \,\mathrm{ReLU}\!\left[ \frac{1}{\lambda_0\lambda_1}\beta_1 + \Omega_1 \mathrm{ReLU}\!\left[ \frac{1}{\lambda_0}\beta_0 + \Omega_0 x \right] \right]

를 얻는다.
이 식이 의미하는 바는 다음과 같다.

weight matrix에 곱해진 양수 스케일 $\lambda_0, \lambda_1$ 은
bias를 적절히 나누어 조정하면
네트워크 마지막의 전체 스케일 $\lambda_0\lambda_1$ 로 옮겨서 표현할 수 있다.
즉, ReLU 네트워크에서는 가중치의 크기를 내부적으로 재조정(rescaling) 해도,
적절히 bias를 함께 바꾸면 같은 형태의 함수를 나타낼 수 있다.
ReLU의 non-negative homogeneity 성질 $\mathrm{ReLU}(az)=a\,\mathrm{ReLU}(z), \quad a\ge0$ 을 두 번 적용하면, 네트워크 내부의 양수 스케일을 바깥으로 이동시킬 수 있다.
따라서 ReLU 네트워크는 weight의 스케일을 자유롭게 재배치할 수 있는 표현상의 유연성을 가진다.

Problem 4.4 Write out the equations for a deep neural network that takes $D_{i} = 5$ inputs $D_{0}$ = 4 outputs and has three hidden layers of sizes $D_1 = 20$ , $D_2 = 10$ , and $D_3 = 7$ , respectively, in both the forms of equations 4.15 and 4.16. What are the sizes of each weight matrix Ω. and bias vector $\Beta$ ?

equation 4.15 :

equation 4.16 :

입력 차원이 $D_i = 5$ , 출력 차원이 $D_o = 4$ 인 deep neural network를 생각하자.
이 네트워크는 3개의 hidden layer를 가지며, 각 hidden layer의 크기는 각각

D_1 = 20,\quad D_2 = 10,\quad D_3 = 7

이다.

즉,

입력층: 5차원
첫 번째 은닉층: 20개 유닛
두 번째 은닉층: 10개 유닛
세 번째 은닉층: 7개 유닛
출력층: 4차원

으로 이루어진 신경망이다.

1. Equation 4.15 형태

은닉층이 3개이므로 $K=3$ 이고, 각 층의 식은 다음과 같이 쓸 수 있다.

\mathbf{h}_1 = a\left[\boldsymbol{\beta}_0 + \Omega_0 \mathbf{x}\right]

\mathbf{h}_2 = a\left[\boldsymbol{\beta}_1 + \Omega_1 \mathbf{h}_1\right]

\mathbf{h}_3 = a\left[\boldsymbol{\beta}_2 + \Omega_2 \mathbf{h}_2\right]

\mathbf{y} = \boldsymbol{\beta}_3 + \Omega_3 \mathbf{h}_3

2. Equation 4.16 형태

위 식을 한 줄의 중첩된 형태로 쓰면 다음과 같다.

\mathbf{y} = \boldsymbol{\beta}_3 + \Omega_3 a\left[ \boldsymbol{\beta}_2 + \Omega_2 a\left[ \boldsymbol{\beta}_1 + \Omega_1 a\left[ \boldsymbol{\beta}_0 + \Omega_0 \mathbf{x} \right] \right] \right]

3. Weight matrix와 bias vector의 크기

입력 벡터 $\mathbf{x}$ 의 크기는

\mathbf{x} \in \mathbb{R}^{5 \times 1}

이다.

각 층의 차원을 따라가면 weight matrix와 bias vector의 크기는 다음과 같다.

(1) 입력층 $\rightarrow$ 첫 번째 은닉층

입력은 5차원이고, 첫 번째 은닉층은 20차원이므로

\Omega_0 \in \mathbb{R}^{20 \times 5}, \qquad \boldsymbol{\beta}_0 \in \mathbb{R}^{20 \times 1}

(2) 첫 번째 은닉층 $\rightarrow$ 두 번째 은닉층

첫 번째 은닉층은 20차원, 두 번째 은닉층은 10차원이므로

\Omega_1 \in \mathbb{R}^{10 \times 20}, \qquad \boldsymbol{\beta}_1 \in \mathbb{R}^{10 \times 1}

(3) 두 번째 은닉층 $\rightarrow$ 세 번째 은닉층

두 번째 은닉층은 10차원, 세 번째 은닉층은 7차원이므로

\Omega_2 \in \mathbb{R}^{7 \times 10}, \qquad \boldsymbol{\beta}_2 \in \mathbb{R}^{7 \times 1}

(4) 세 번째 은닉층 $\rightarrow$ 출력층

세 번째 은닉층은 7차원, 출력층은 4차원이므로

\Omega_3 \in \mathbb{R}^{4 \times 7}, \qquad \boldsymbol{\beta}_3 \in \mathbb{R}^{4 \times 1}

4. 최종 정리

따라서 각 weight matrix와 bias vector의 크기는 다음과 같다.

\Omega_0 : 20 \times 5,\qquad \boldsymbol{\beta}_0 : 20 \times 1

\Omega_1 : 10 \times 20,\qquad \boldsymbol{\beta}_1 : 10 \times 1

\Omega_2 : 7 \times 10,\qquad \boldsymbol{\beta}_2 : 7 \times 1

\Omega_3 : 4 \times 7,\qquad \boldsymbol{\beta}_3 : 4 \times 1

Problem 4.5 Consider a deep Neural Network with $D_i = 1$ input, $D_0 = 1$ output, and $K = 10$ layers, with $D = 10$ hidden units in each. Would the number of weights increase more if we increased the depth by one or the width by one? Provide your reasoning.

이 문제는 깊이를 1 늘릴 때와 너비를 1 늘릴 때 중 어느 쪽이 weight 수를 더 많이 증가시키는지 비교하는 문제이다.

현재 네트워크 구조는 다음과 같다.

입력: 1차원
출력: 1차원
hidden layer: 10개
각 hidden layer의 hidden unit: 10개

1. 현재 weight 수

입력층에서 첫 번째 hidden layer로 가는 weight 수는

1 \times 10 = 10

hidden layer가 10개이므로, hidden layer 사이의 연결은 총 9번 있고,
각 연결마다 weight 수는

10 \times 10 = 100

이므로 전체 hidden-hidden weight 수는

9 \times 100 = 900

마지막 hidden layer에서 output으로 가는 weight 수는

10 \times 1 = 10

따라서 전체 weight 수는

10 + 900 + 10 = 920

2. depth를 1 증가시키는 경우

depth를 1 늘린다는 것은 hidden layer를 1개 더 추가하는 것이다.

새 hidden layer도 unit이 10개이므로 추가되는 연결은 hidden layer 사이 연결 1개이고,
추가되는 weight 수는

10 \times 10 = 100

이다.

즉, depth를 1 늘리면 weight는 100개 증가한다.

3. width를 1 증가시키는 경우

width를 1 늘린다는 것은 각 hidden layer의 unit 수가

10 \rightarrow 11

로 바뀌는 것이다.

그러면 새로운 weight 수는

입력층 $\rightarrow$ 첫 hidden layer:
$1 \times 11 = 11$
hidden layer 사이:
$9 \times (11 \times 11) = 1089$
마지막 hidden layer $\rightarrow$ output:
$11 \times 1 = 11$

따라서 전체 weight 수는

11 + 1089 + 11 = 1111

증가량은

1111 - 920 = 191

이다.

4. 결론

depth를 1 증가시키면 weight는 100개 증가하고,
width를 1 증가시키면 weight는 191개 증가한다.

따라서 이 문제에서는 width를 1 증가시키는 경우가 weight 수를 더 많이 증가시킨다.

그 이유는 depth를 늘리면 새로운 연결이 1개만 추가되지만, width를 늘리면 모든 hidden layer 사이의 weight matrix 크기가 함께 커지기 때문이다.

Problem 4.6 Consider a network with $D_{i} = 1$ input, $D_0 = 1$ output, and $K = 10$ layers, with $D = 10$ hidden units in each. Would the number of weights increase more if we increased the depth by one

Pass

Problem 4.7 Chosse values for the parameters $\phi = \{ \phi_0, \phi_1, \phi_2, \phi_3, \theta_{10}, \theta_{11}, \theta_{20}, \theta_{21}, \theta_{30}, \theta_{31} \}$ for the shallow neural network in equation 3.1 (with ReLU activation functions) that will define an identity function over a finite range $x \in [a, b].$

equation 3.1 :

[a, b]에서 RELU가 모든 구간에서 활성화가 되고,

이것을 만족하면 된다.

Problem 4.8 Figure 4.9 shows the activations in the three hidden units of a shallow network (as in figure 3.3). The slopes in the hidden units are $1.0$ , $1.0$ , and $-1.0$ , respectively, and the "joints" in the hidden units are at positions $1/6$ , $2/6$ , and $4/6$ . Find values of $\phi_0, \phi_1, \phi_2,$ and $\phi_3$ that will combine the hidden unit activations as $\phi_0 + \phi_1 h_1 + \phi_2 h_2 + \phi_3 h_3$ to create a function with four linear regions that oscillate between output values of zero and one. The slope of the leftmost region should be positive, the next one negative, and so on. How many linear regions will we create if we compose this network with itself? How many will we create if we compose it with itself $K$ times?

주어진 세 hidden unit의 joint는 x=1/6, 2/6, 4/6 이므로
전체 함수는 총 4개의 linear region을 가진다.

우리는 출력층
f(x)=φ0+φ1h1(x)+φ2h2(x)+φ3h3(x)
를 적절히 선택하여, 각 구간의 기울기 부호가
+, -, +, -
가 되도록 해야 한다.

즉, 각 구간에서 활성화되는 hidden unit들의 기여를 더한 전체 기울기가
번갈아 양수, 음수, 양수, 음수가 되도록
φ1, φ2, φ3를 정한다.

또한 함수값이 0과 1 사이를 오가도록
각 joint에서의 함수값이
0 → 1 → 0 → 1 → 0
형태가 되게 맞추며,
이를 위해 절편 항 φ0를 함께 조정한다.

이렇게 얻은 함수는 4개의 linear region을 가지는
oscillating piecewise linear function이다.

Problem 4.9 Following Problem 4.8, is it possible to create a function with three linear regions that oscillates back and forth between output values of zero and one using a shallow network with two hidden units? Is it possible to create a function with five linear regions that oscillates in the same way using a shallow network with four hidden units?

figure 4.9 :

답안

네, 두 경우 모두 가능하다.

1차원 입력을 가지는 shallow ReLU network에서는 hidden unit이 $D$ 개일 때, 입력축 위에 최대 $D+1$ 개의 선형 구간(linear regions)을 만들 수 있다.
따라서 hidden unit이 2개이면 최대 3개의 선형 구간을 만들 수 있고, hidden unit이 4개이면 최대 5개의 선형 구간을 만들 수 있다.

또한 출력층에서 hidden unit들의 출력을 적절히 선형결합하면, 전체 함수의 기울기를 구간마다 바꿀 수 있다.
그래서 함수가 단순히 증가하거나 감소하는 형태뿐 아니라,

0 \rightarrow 1 \rightarrow 0

처럼 올라갔다가 다시 내려오는 형태나,

0 \rightarrow 1 \rightarrow 0 \rightarrow 1 \rightarrow 0

처럼 여러 번 번갈아 진동하는 형태도 만들 수 있다.

따라서,

hidden unit이 2개인 경우, 3개의 선형 구간을 가지면서 출력이 0과 1 사이를 왕복하는 함수를 만들 수 있다.
hidden unit이 4개인 경우, 5개의 선형 구간을 가지면서 출력이 0과 1 사이를 번갈아 오가는 함수를 만들 수 있다.

즉, 문제에서 제시한 두 경우는 모두 가능하다.

Problem 4.10 Consider a Deep Neural Network with a single input, a single output, and $K$ hidden layers, each of which contains $D$ hidden units. Show that this network will have a total of $3D + 1 + (K-1)D(D+1)$ parameters.

이해 완료

Problem 4.11 Consider two Neural Networks that map a scalar input $x$ to a scalar output $y$ . The first Network is shallow and has $D = 95$ hidden units. The second is deep and has $K = 10$ layers, each containing $D = 5$ hidden units. How many parameters does each network have? How many linear regions can each network maek (see equation 4.17)? Which would run faster?

두 네트워크의 파라미터 수는 거의 비슷하게 설계되어 있다.
하지만 deep network는 여러 층을 순차적으로 통과해야 하므로, 실행 속도는 보통 shallow network가 더 빠르다.
반면 deep network는 같은 수준의 파라미터 수로도 더 많은 linear regions를 만들 수 있어 표현력이 더 크다.

박성철 | 수리데이터사이언스학과 | 한양대(ERICA)

열심히 하겠습니다.

이전 포스트

Shollow Neural Network 연습문제

다음 포스트

Deep Neural Network 연습문제

1. Equation 4.15 형태

2. Equation 4.16 형태