Wonkwang's Techlog

Wonkwang's Techlog

[ML/DL] SwiGLU

Wonkwang·2023년 8월 17일

Activation NLP SwiGLU

0

📌 Summary

Swish: $\beta$ 값에 따라 다른 특성을 가질 수 있는 activation function 의 한 종류

GLU: Component-wise product of two linear transformations of input

SwiGLU: Swish 를 non-linear function 으로 사용하고, GLU 를 적용한 activation function

1. Swish

1.1 Swish 의 특징

Searching for Activation Functions
- https://arxiv.org/pdf/1710.05941.pdf (2017, Google Brain)
- beta 는 상수 혹은 학습 가능한 파라미터

\text{Swish}(x) = x \cdot \sigma(\beta x)

beta 의 값에 따라,
- $\beta = 1$ 일 때, Sigmoid-weighted Linear Unit (SiLU) 로 동작
- $\beta = 0$ 일 때, $\sigma(0) = 0.5$ 가 되어 scaled linear function 이 됨 ( $x/2$ )
- $\beta = inf$ 일 때, $\sigma$ 가 0-1 function 처럼 되어 ReLU 와 유사해짐
즉, $\beta$ 가 학습 파라미터라면 linear function 과 ReLU 간의 nonlinearly interpolate 하는 smooth function 으로 볼 수 있고, 모델 학습에 의해 결정됨

1.2 Swish 의 실험 결과

Swish 특징에 의해, 다양한 모델에 대하여 ReLU 보다 성능이 좋음
- 0 보다 큰 값에 대하여 unbound
- 0 보다 작은 값에 대하여 bounded (그러나 ReLU 와 달리 작은 음수값에 대하여 허용하고 있음)
- Non-monotonicity (Non-단조함수)
  - 일반적인 activation function 과 구분되는 특징
  - preactivation 분포를 보면, 상당수의 값이 bump 영역 $(-5 < x < 0)$ 에 빠지게 되므로 이는 중요한 영역임
다양한 모델에 대하여 다른 activation 보다 성능이 우수함
- ImageNet, ResNet, MobileNet, MT-Transformer 등

2. SwiGLU

https://arxiv.org/pdf/2002.05202.pdf (2020, Google)
- 다양한 activation 에 GLU 를 적용하였을 때, Transformer 의 성능을 향상시킨다는 논문
최근 MetaAI 에서 나온 Open LLM 모델인 LLaMA1, LLaMA2 은 모두 SwiGLU 를 채택하고 있음

2.1 GLU

Gated Linear Units
component-wise product of two linear transformations of input
- non-linear function 으로 $\sigma(x)$ 사용

\text{GLU}(x, W, V, b, c) = \sigma(xW + b) \otimes (xV + c)

2.2 SwiGLU 의 실험 결과

SwiGLU 는 GLU 의 non-linear function 으로 $\sigma(x)$ 대신 $\text{Swish}(x)$ 를 사용

\text{SwiGLU}(x, W, V, b, c, \beta) = \text{Swish}_\beta(xW + b) \otimes (xV + c)

\text{FFN}_\text{SwiGLU}(x, W, V, W_2) = (\text{Swish}_1(xW) \otimes xV)W_2

T5 architecture 의 FFN layer 에 다양한 GLU 를 적용하여 실험 진행
Pre-training & log-perplexity 결과
Fine-tuning 실험 결과 (GLUE)

2.3 SwiGLU 실험 결론

Transformer architecture 에 다양한 activation 형태의 GLU 를 적용하여 실험하였고, 그 결과 SwiGLU 가 괜찮은 성능을 보임

ML/DL Engineer 입니다. 유용한 정보들을 기록해두려 합니다.

이전 포스트

[Paper Review] LLaMA: Open and Efficient Foundation Language Models

다음 포스트

[Paper Review] Scaling Laws for Neural Language Models

0개의 댓글