Rounding error, Underflow, Overflow

DDME·2020년 9월 24일

굿펠로우 딥러닝

Deep learning (2017, MIT)

목록 보기

6/8

TIMESTAMP
@200924 시작

다음은 이안 굿펠로우의 Ch4 슬라이드를 참고한 것이다.

Numerical Precision: A deep learning super skill

딥러닝은 종종 잘 작동하는 것처럼 보인다.
- 손실함수는 작아지고 SOTA의 정확도에 도달한다.
- 그 자체로 버그가 없어보인다
반대로 딥러닝은 폭발하는 것처럼 보이기도 한다(NaNs, large values)

원흉은 주로 loss of numerical precision이다.

Rounding and truncation errors

디지털컴퓨터에서 실수를 표현하기 위해 float32와 같은 것을 사용한다.
실수 x는 x+delta로 반올림된다(작은 값의 델타).
Overflow : 큰 값의 x는 inf로 대체된다
Underflow : 작은 값의 x는 0으로 대체된다

Example

아주 작은 숫자를 큰 값에 더하는 것은 아무런 영향이 없어보인다. 하지만 이것은 나중에 커다란 변화를 일으킬 수 있다:

>>> a=np.array(0., 1e-8]).astype('float32')
>>> a.argmax()
1
>>> (a+1).argmax()
0

Secondary effects

x와 y가 inf로 오버플로우될 때 x-y를 계산하면, x-y=inf-inf=NaN이다.

exp

large x에 대해 exp(x)는 오버플로우된다.
- 이때 very large가 아니더라도 오버플로우된다
- float32: 89가 오버플로우된다.
  
  절대로 large x를 쓰지 마라!
very negative x에 대해 exp(x)는 언더플로우된다.
- exp(x)가 분모일 경우 최악이다 (예: log)

Subtraction

$x$ 와 $y$ 가 유사한 크기이고 $x$ 가 $y$ 보다 항상 크다고 가정해보자.
컴퓨터에서 rounding error로 인해 $x-y$ 는 음수일 수 있다.
예: 분산 $\operatorname{Var}(f(x))=\mathbb{E}[(f(x)-\mathbb{E}[f(x)]^2)] \\ =\mathbb{E}[f(x)^2]-\mathbb{E}[f(x)]^2$ 첫번째 식은 safe한 반면 두번째 식은 dangeruos하다.

log and sqrt

log(0)=-inf
- log(<negative>) is imaginary, usually nan in software
sqrt(0)is 0, but its derivative has a divide by zero
반드시 argument에 언더플로우나 round-to-negative을 피해라
- 예: standard_dev=sqrt(variance)

log exp

log exp(x)는 x로 단순화되어야 한다.

exp에서 오버플로우를 피하고, log에서 -inf를 일으키는 exp의 언더플로우를 피하라.

Which is the better hack?

normalized_x = x / st_dev
eps=1e^-7
Should we use
- st_dev=sqrt(eps+variance)
- st_dev=eps+sqrt(variance)?
What if variance is implemented safely and will never round to negative?

log(sum(exp))

Naive implementation:
tf.log(tr.reduce_sum(tf.exp(array)))
Failure modes:
- If any entry is very large, exp overflows
- If all entries are very negative, all exp underflow... and then log is -inf

Stable version

mx = tf.reduce_max(array)
safe_array = array - mx
log_sum_exp = mx + tf.log(tf.reduce_sum(exp(safe_array))

built in version: tf.reduce_logsumexp

Why does the logsumexp trick work?

Algebraically equivalent to the original version: $m+\log\sum_i\exp(a_i-m) \\=m+\log\sum_i\frac{\exp(a_i)}{\exp(m)} \\=m+\log\frac{1}{\exp(m)}\sum_i\exp(a_i) \\=m-\log\exp(m)+\log\sum_i\exp(a_i)$

No overflow:
- Entries of safe_array are at most 0
Some of the exp terms underflow, but not all
- At least one entry of safe_array is 0
- The sum of exp terms is at least 1
- The sum is now safe to pass to the log

Softmax

Softmax: use your library's built-in softmax function

If you build your own, use:

safe_logits = logits - tf.reduce_max(logits)
softmax = tf.nn.softmax(safe_logits)

Silmilar to logsumexp

Sigmoid

Use your library's built-in sigmoid function
If you build your own:
- Recall that sigmoid is just softmax with one of the logits hard-coded to 0

Cross-entropy

Cross-entropy loss for softmax (and sigmoid) has both softmax and logsumexp in it
Compute it using the logits not the probabilities
The probabilities lose gradient due to rounding error where the softmax saturates
Use tf.nn.softmax_cross_entropy_with_logits
or similar
If you roll your own, use the stabilization tricks for softmax and logsumexp

Bug hunting strategies

If you increase your learning rate and the loss gets stuck, you are probably rounding your gradient to zero somewhere: maybe computing cross-entropy using probabilities instead of logits

For correctly implemented loss, too high of learning rate should usually cause explosion
If you see explosion (NaNs, very large values) immediately suspect:
- log
- exp
- sqrt
- division
Always suspect the code that changed most recently

DDME

NULL

이전 포스트

Ch4. Numerical Computation for Deep Learning

다음 포스트

Rounding error, Underflow, Overflow

Deep learning (2017, MIT)

Numerical Precision: A deep learning super skill

Rounding and truncation errors

Example

Secondary effects

exp

Subtraction

log and sqrt

log exp

Which is the better hack?

log(sum(exp))

Stable version

Why does the logsumexp trick work?

Softmax

Sigmoid

Cross-entropy

Bug hunting strategies

Ch4. Numerical Computation for Deep Learning

Ch5. Machine Learning Basics

0개의 댓글