Identity Mappings in Deep Residual Networks-ResNet(2), 2016

TEMP·2021년 8월 24일

Papers

목록 보기

4/9

https://arxiv.org/abs/1603.05027

1. Introduction

기존 Resnet은 다음과 같이 표현 할 수 있다.
- $\mathbf{y}_l$ $=$ $h(\mathbf{x}_l)$ + $F(\mathbf{x}_l,\mathcal{W}_l)$
  $\mathbf{x} _l \,_+\,_1$ $=$ $f(\mathbf{y}_l)$
- $\mathbf{x}_l$ : $l$ -th input
  $\mathbf{x} _l \,_+\,_1$ : $l$ -th output
  $F$ : Residual function
  $f$ : Activation funtion ( Relu )
  $h$ : Identity mapping ( Short-cut )
ResNets은 100-layer가 넘는 모델에서 좋은 성능을 보여주었다.
ResNet의 핵심 아이디어는 추가적인 identity mapping $h$ 을 추가하여 $F$ residual을 학습시킨다는 것이다.
이 논문에서는 Residual unit뿐 아니라 network를 전반적으로 보았을때, information을 전달하는 Direct ( Short-cut )에 초점을 맞춰 deep residual network를 분석한다.
skip-connection의 역할을 이해하기 위해 Identity mapping $h$ 에 다양한 변화를 주었다.
그 결과 $h(\mathbf{x}_l)$ $=$ $\mathbf{x}_l$ 라고 Clean하게 유지하는 것이 중간에 다른 과정을 다양하게 추가한 여러 모델보다 training loss가 가장 적고 error감소가 가장 빨랐다.
( 유지(keeping)라고 한것은 이전 논문에서도 $h(\mathbf{x}_l)$ $=$ $\mathbf{x}_l$ 라고 했기 때문 )

다음 그림은 위의 결과로 기존 모델을 수정한 모델이다.

(a)는 이전 논문의 모델이며 $indentity$ $h(\mathbf{x}_l)$ 와 $Residual$ $F(\mathbf{x}_l,\mathcal{W}_l)$ 을 더하여 $Relu$ $f$ 를 통과한다.
아래그림 참고
(b)는 이번 논문에서 제시한 수정된 모델로 $indentity$ $h(\mathbf{x}_l)$ 의 Clean하게 유지하기위해
Pre-Activation 즉, activation function을 Convolution에만 주게끔 위치를 Convolution 직전으로 바꾼다.
이렇게 하여 새로운 Residual unit을 디자인하고 1001-layer을 쌓아서 기존 ResNet보다 더 좋은 결과를 얻었고 이는 현재 Deep Learning의 Key가 되는 깊은 신경망이 더 좋다는(?) 것을 시사한다.

2. Analysis of Deep Residual Networks

이번 논문의 수정된 모델은
기존 ResNet
$\mathbf{y}_l$ $=$ $h(\mathbf{x}_l)$ + $F(\mathbf{x}_l,\mathcal{W}_l)$
$\mathbf{x} _l \,_+\,_1$ $=$ $f(\mathbf{y}_l)$

에서 $f$ : $activation$ 은 $F(\mathbf{x}_l,\mathcal{W}_l)$ 으로 pre-activation으로 들어간다.

즉, 위의 formular에서 Activation funtion $f$ 와 Identity mapping $h$ 가 사라지므로 같은 dimension일때 만을 식으로 나타내면
$\mathbf{x}_L$ $=$ $\mathbf{x}_l$ + $\displaystyle\sum_{i=l}^{L-1}$ $F(\mathbf{x}_i,\mathcal{W}_i)$ 으로 간단하게 표현할 수 있다.
이렇게 하면 크게 이점이 있다.
Deeper unit L의 feature $\mathbf{x}_L$ 은 shallower unit l의 feature $\mathbf{x}_l$ 에 Residual function
$\displaystyle\sum_{i=l}^{L-1}$ $F(\mathbf{x}_i,\mathcal{W}_i)$ 을 합한 것으로 표현 할 수 있다.
이는 unit L과 unit l 사이에서 model이 Residual Fashion으로 존재함을 나타낸다.
Deep unit L의 feature에 대해 항상 $\mathbf{x}_L$ $=$ $\mathbf{x}_0$ + $\displaystyle\sum_{i=0}^{L-1}$ $F(\mathbf{x}_i,\mathcal{W}_i)$ 로 나타낼수 있다.
Plain network에서는 $\mathbf{x}_L$ $=$ $\displaystyle\prod_{i=0}^{L-1}$ $\mathcal{W}_i$ $\mathbf{x}_0$ 으로 거듭곰셉의 형태로 나타난다.
$\mathbf{x}_L$ $=$ $\mathbf{x}_l$ + $\displaystyle\sum_{i=l}^{L-1}$ $F(\mathbf{x}_i,\mathcal{W}_i)$ 는 역전파에 좋다.
Ɛ을 loss fuction이라고 해보자
$\cfrac{∂Ɛ}{∂\mathbf{x}_l}$ 은 두개의 덧셈으로 표현할 수 있다.
- $\cfrac{∂Ɛ}{∂\mathbf{x}_L}$ 는 shallower unit $l$ 에 deeper unit $L$ 의 기울기가 direct하게 다른 layer를 거치치 않고 전달된다.
- $\cfrac{∂Ɛ}{∂\mathbf{x}_L}$ $\Bigg(1+\cfrac{∂}{∂\mathbf{x}_l }{\displaystyle\sum_{i=l}^{L-1}}{F(\mathbf{x}_i,\mathcal{W}_i)}\Bigg)$ 에서 $\Bigg(1+\cfrac{∂}{∂\mathbf{x}_l }{\displaystyle\sum_{i=l}^{L-1}}{F(\mathbf{x}_i,\mathcal{W}_i)}\Bigg)$ 이 모든 mini batch의 모든 sample마다 항상 0이 될 가능성이 극히 드물므로 gradient vanishing이 발생할 가능성이 없다.

3. On the Importance of Identity Skip Connections

위에서 skip connection을 clean하게 유지 하였을때, 즉 completely Identity로 하였을 때의 장점을 보여주었다.
여기에서는 Identity shortcut을 break했을때의 단점을 보여준다.

$h(\mathbf{x}_l)$ $=$ $\mathbf{x}_l$ 라고 하여
$\mathbf{x}_{l+1}$ $=$ $\lambda_l\mathbf{x}_l$ $+$ $F(\mathbf{x}_l,\mathcal{W}_l)$

이때 위와같은 점화식으로 나타내면 다음과 같다.
$\mathbf{x}_L$ $=$ $\bigg(\displaystyle\prod_{i=l}^{L-1}\lambda_i \bigg)\mathbf{x}_l$ $+$ $\displaystyle\sum_{i=l}^{L-1}$ $\hat{\mathcal{F}}(\mathbf{x}_i,\mathcal{W}_i)$

$\hat{\mathcal{F}}$ 는 $\mathcal{F}$ 앞에 붙어야 할 $\lambda$ 를 생략한 notation
( $f$ 는 여전히 identity라고 하고 이렇게만 변화했을때도 문제가 생김을 보여준다. )

마찬가지로 역전파를 위한 기울기를 구해보면
이다.
여기서 앞의식( 완전한 identity에서의 기울기 )과는 다르게 $\bigg(\displaystyle\prod_{i=l}^{L-1}\lambda_i \bigg)\mathbf{x}_l$ 부분이 있는데
만약 $^\forall i,$ $\lambda_i<1$ $or$ $\lambda_i>1$ 이면 shortcut의 역전파를 방해한다. 이는 constant scaling일 경우이고 convoultion같은 더 복잡한 경우에도 더욱 그렇하다.

3.1 Experiments on Skip Connections

위 사실을 증명하기 위해 다음과 같이 network를 설계한다.
- 여기서는 activation은 유지한채로 identity mapping을 clean하지 않게끔 변경.

다음은 결과이다.

54개의 two-layer Residual Unit이 있는 ResNet-110이다.
- Constant Scaling의 경우 3가지로 나누어 진행하였다.
  - 첫번째는 shortcut을 0으로 하여 kill 하여서 plain으로 만든경우이고 이는 수렴에 실패하였다.
  - 두번째는 shortcut에 0.5를 곱하고 $\mathcal{F}$ 는 그대로 유지한 경우이고 이 역시 수렴에 실패하였다.
  - 세번째는 shortcut과 $\mathcal{F}$ 에 모두 0.5를 곱하여 평균을 구했고 이는 수렴은 했으나 original network보다 안좋은 결과를 보였다.
- Exclusive gating의 경우 https://arxiv.org/abs/1505.00387 와 https://arxiv.org/abs/1507.06228 의 Highway Networks를 따랐다.
결론 Keeping information of short cut clean is the Best!!!!

4. On the Usage of Activation Functions

위에서는 short cut에서 Activation은 유지하고 여러가지 variation을 주었고 그 결과 Clean하게 두는 것이 가장 좋은 방법이라는 것을 알았다. 그래서 To keep short -cut completly, modify $f$ to Identity by re-arranging the activation functions

4.1 Experiments on Activation

ResNet-110과 bottleneck 구조를 사용한 ResNet-164를 사용하였다.
- BN after addition의 결과가 가장 안좋았는데 이는 BN layer가 정보를 변경하고 전달을 방해하여 이러한 결과가 나왔다.
- 위의 문제점과 그 이전의 실험에서의 결과를 이용하여 ReLU before addition 방법으로 short cut에서의 activation을 제거한 모델이다. 이렇게 하였더니 순전파의 정보가 단조증가 하는 경향을 보였고 이는 안좋은 결과를 생산했다.
- Post-activation or pre-activation? original design에서는
  $\mathbf{y}_{l+1}$ $=$ $f(\mathbf{y}_{l})$ $+$ $\mathcal{F}(f(\mathbf{y}_{l}),\mathcal{W}_{l+1})$ 과 같이 short cut과 residual path모두에게 영향을 준다. 따라서 activation을 $\mathcal{F}$ 으로 넘겨주고 notation을 바꾸어 식을 다시 쓰면 $\mathbf{x}_{l+1}$ $=$ $\mathbf{x}_{l}$ $+$ $\mathcal{F}(\hat{f}(\mathbf{y}_{l}),\mathcal{W}_{l})$
  즉,
  
  결과는
  Full pre activation의 성능이 가장 좋았다.

다음은 Keras의 Resnet이다.
model plot을 그려보면 본 논문의 Resnet이 아닌 이전의 ResNet임을 알 수 이다.
convolution-BN-Relu의 순서이고 idntity를 더하여 Relu을 통과한다.

import tensorflow as tf
from tensorflow.keras.utils import plot_model

model=tf.keras.applications.ResNet50(
   include_top=True,
   weights="imagenet",
   input_tensor=None,
   input_shape=None,
   pooling=None,
   classes=1000)
plot_model(model)

plot_model(model)

TEMP

이전 포스트

Deep Residual Learning for Image Recognition-ResNet(1) ,2015

다음 포스트

Identity Mappings in Deep Residual Networks-ResNet(2), 2016

Papers

1. Introduction

2. Analysis of Deep Residual Networks

3. On the Importance of Identity Skip Connections

3.1 Experiments on Skip Connections

4. On the Usage of Activation Functions

4.1 Experiments on Activation

Deep Residual Learning for Image Recognition-ResNet(1) ,2015

Fully Convolutional Networks for Semantic Segmentation , 2015

0개의 댓글

관련 채용 정보