PyHessian

YOUNG·2022년 8월 3일

논문

목록 보기

1/1

요약

PyHessian은 헤시안 정보, 즉 second-order statistics를 효율적으로 계산해 주는 파이토치 라이브러리 입니다. 여기서 헤시안 정보란 모델 파라미터에 대한 2차미분인 헤시안 행렬의 eigenvalues, trace, ESD 등을 말합니다.

사이즈가 큰 모델의 경우 헤시안 행렬을 전부 구하려면 계산 부담이 너무 크니까 second-order information을 우회적으로 얻는 방법을 제안합니다. 예를 들어 matvec을 이용해서 eigenvalue를 구하고, Hutchinson method를 이용해서 trace를 구하고, SLQ를 이용해서 ESD를 구합니다.

본 논문에서는 이렇게 얻은 헤시안 정보를 가지고 loss landscape를 분석하는데요, Batch Normalization layer가 landscape를 smooth하게 만들어 준다는 기존의 통념이 틀렸음을 보여 줍니다.

우리가 BN layer를 사용하는 이유는 loss landscape를 smooth하게 만들어서 일반화 성능을 높이기 위해서인데, 이러한 smoothing behavior는 헤시안 스펙트럼 (trace, ESD)이 작아짐을 통해 확인할 수 있습니다. 그런데 본 연구에서는 ResNet 모델에서 BN layer를 제거했을 때 헤시안 스펙트럼이 오히려 작아지는 것을 관찰했습니다. 즉, BN layer를 제거했는데도 수렴 포인트가 더 smooth해진 것이죠. (참고로 loss landscape가 울퉁불퉁하다면 loss가 파라미터 변화에 민감해서 일반화 성능이 낮을 수 있음을 의미합니다)

이 논문의 의의는 단순히 BN layer에 대한 새로운 발견이 아니라, PyHessian이라는 프레임워크를 활용해서 다양한 모델의 loss landscape를 분석할 수 있다는 점이고 실제로 이 프레임워크를 가져다가 vision transformer를 분석한 후속 연구가 존재합니다.

Introduction

ResNet의 핵심 요소인 residual connection과 BN은 직관에 의해 motivated 되었고, 정량적으로 검증되지는 않았다
PyHessian은 헤시안 정보 (e.g., top Hessian eigenvalues, Hessian trace, Hessian eigenvalue spectral density)를 계산할 수 있게 해 주는 오픈소스 프레임워크이다
- 본 연구에서는 이걸 resudual connections와 BN이 신경망 훈련에 미치는 영향을 분석하는 데 사용한다
contributions
- 헤시안 정보를 효율적으로 계산하는 프레임워크인 PyHessian 제안
- ResNet에서 BN layers 제거했을 때, 특히 깊은 모델에서 헤시안 스펙트럼이 급증함을 발견
  - 얕은 모델에서는 헤시안 스펙트럼이 평평해졌는데, 이는 BN layer가 loss landscape를 smooth하게 만든다는 기존의 상식과는 다른 결과임
  - 깊은 모델에서는 BN을 제거했을 때 뾰족한 local minima에 수렴함을 발견
- ResNet에서 residual connections 제거했을 때 헤시안 스펙트럼 증가함을 보임
- 헤시안 분석 결과 BN이 stage의 끝에 위치할 때성능을 더 끌어올림을 발견

Hessian and large-scale Hessian computation
- second-order information 얻기 위해 꼭 헤시안 행렬 안 구해도 된다
  - 대신에 “matvec” 계산하면 됨
- 헤시안 분석으로 신경망 훈련과 추론 관련 연구
- …
Residual connections and batch normalization
- RC: 훈련 중 vanishing gradient 방지, smoother loss landscape 보임
  - 기존 연구에서는 헤시안 스펙트럼 분석 대신 3D plot으로 이걸 확인했다
    - 파라미터가 엄청나게 많으면 그리기 어렵다는 단점
- BN: Internal Covariance Shift(ICS)를 줄이고, smoother loss landscape 보임
  - 기존 연구에서는 first-order analysis 했는데, 본 연구에서 second-order analysis 통해서 사실 smoother 아닌거 밝힘

Method

Neural network Hessian Matvec (Hessian eigenvalues 구하기)
- the gradient of the loss w.r.t. parameters (=a vector)
  - $\frac{\partial L}{\partial \theta}=g_{\theta} \in \mathbb{R}^{m}$
- the second derivative of the loss (=a matrix)
  - $H=\frac{\partial^{2} L}{\partial \theta^{2}}=\frac{\partial g_{\theta}}{\partial \theta} \in \mathbb{R}^{m \times m}$
- matvec, an oracle to compute the application of the Hessian to a random vector $v$
  - $\frac{\partial g_{\theta}^{T} v}{\partial \theta}=\frac{\partial g_{\theta}^{T}}{\partial \theta} v+g_{\theta}^{T} \frac{\partial v}{\partial \theta}=\frac{\partial g_{\theta}^{T}}{\partial \theta} v=H v$
    - 이걸 가지고 eigenvalues 구할 수 있다
    - 근데 파라미터 너무 많으면 eigenvalues가 loss landscape 대표하지 못할 수 있으므로 trace and ESD도 구한다
Hutchinson method for Hessian trace computation (trace 구하기)
- $\begin{aligned}\operatorname{Tr}(H) &=\operatorname{Tr}(H I)=\operatorname{Tr}\left(H \mathbb{E}\left[v v^{T}\right]\right)=\mathbb{E}\left[\operatorname{Tr}\left(H v v^{T}\right)\right] \\&=\mathbb{E}\left[v^{T} H v\right]\end{aligned}$
Full eigenvalue spectral density (ESD 구하기)
- $\phi(t)=\frac{1}{m} \sum_{i=1}^{m} \delta\left(t-\lambda_{i}\right)$
- SLQ: an efficient matrix-free algorithm (본 논문)
  - 따라서 full Hessian 구할 필요가 없고

Results

BN과 residual connection이 헤시안 스펙트럼에 미치는 영향

Experimental setting
- 훈련 중 checkpoints마다 헤시안 스펙트럼 계산함
Full network Hessian analysis
- BN
  - 헤시안 스펙트럼
    - BN 제거하면 일반화 성능 하락함 (특히 깊은 모델에서)
    - 얕은 모델에서 BN 제거하면 trace, ESD 오히려 작아짐
    - BN 제거하면 훈련 힘들어지긴 하지만 꼭 헤시안 스펙트럼을 크게 만드는 건 아니다 (기존의 통념과는 달리)
    - 깊은 모델에서 BN 제거하면 trace, ESD 커짐 → 기존의 통념
  - 3D loss landscapes: 파라미터 흔들어서 계산 (across the first and second eigenvectors of the Hessian)
    - 얕은 모델에서 BN 제거하면 평평한 local minimum에 수렴 (BN이 smoother loss landscape 만든다는 통념과 반대됨)
    - 깊은 모델에서 BN 제거하면 높은 loss값에 수렴
- Residual connection
  - 헤시안 스펙트럼
    - 깊은 모델, 얕은 모델 모두 Res 제거하면 trace, ESD 커진다
  - 3D loss landscape
    - 모델 깊어질 수록 Res 제거하면 수렴 포인트가 sharp해진다
Stage-wise Hessian analysis
- 나중 stage에서 BN을 제거했을 때 trace 더 급증한다
  - 또 정확도도 가장 크게 떨어짐

Conclusions

BN 추가한다고 해서 꼭 smoother loss landscape 아니다
- 기존의 통념은 깊은 모델에서만 관찰되었다
  - BN 제거했을 때 sharp한 local minima에 수렴했고, 이는 훈련 loss가 높고 일반화 성능이 낮다
Residual connections 제거하면 울퉁불퉁한 (coarse) loss landscape

YOUNG

SteveJobsOfficial

PyHessian

논문

요약

Introduction

Related work

Method

Results

Conclusions

1개의 댓글