Ten Lessons From Three Generations Shaped Google’s TPUv4i

Sonny020402·2023년 7월 2일

Accelerator

Introduction to TPU

Ten Lessons Learned since 2015

How the 10 Lessons Shaped TPUv4i’s Design

TPUv4i Performance Analysis

Performance in more Depth: CMEM

Accelerators & Training at Scale

목록 보기

1/1

Introduction to TPU

Commercial Domain Specific Architectures(DSAs) for Deep Neural Networks was required to

redeem Sluggish CPU and diminishing Moore’s Law. 위 그림의 왼쪽은 2015년 Google datacenter에 들어가기 시작한 TPU v1이고 오른쪽은 TPU v2/v3이다. v1의 MXU(matrix multiply unit)의 경우 64K 8bit integer MAC으로 구성되어 1사이클에 64k의 곱셈을 할 수 있다. (256*256 tile size) 또한 host CPU와 PCIe를 통해 통신하며 모델 input/output 등을 주고받는다.

TPU v2는 이전버전과 다르게 training또한 수행할 수 있게 되었다. (v1은 inference or service 용도) Training의 경우 computation을 비롯해 single consistent set of weights를 만들어내기 위해 여러 parallel resource를 조정하고 어쩌구 해야 하기 때문에 더욱 어렵다! computation에서는 backprop이 derivative를 필요로 하며, matrix를 transpose하는 일도 빈번하고 activation function을 high-precision으로 처리해야 하고… 심지어 inference와 다르게, weight update를 위해서 모든(아닐지도,, activation checkpointing? anyway) activation을 memory에 저장하고 있어야 하므로 훨훨씬 많은 메모리를 요구한다! inference에 비해 작은 값들을 sum하는 경우가 많기에 이것들을 sufficiently well capturing하기 위해 floating point로 처리해야 하기에 memory usage 커짐. 또 얘기하자면 최적의 training optimization algorithm은 계속해서 바뀌기에, (e.g. RMSprop, ADAM) programmable해야한다는 점까지..