hanlyang0522.log

hanlyang0522.log

DL Basic) 필수 & 선택과제

한량·2021년 8월 10일

0

[U-stage] DeepLearning Basics

목록 보기

11/13

필수과제1) ViT: Vision Transformer

논문: https://arxiv.org/abs/2010.11929
리뷰: https://jeonsworld.github.io/vision/vit/
구현 참고: https://yhkim4504.tistory.com/5

NLP에서 쓰이는 transformer를 CV에 적용

NLP에서는 connection이 먼(문장 상에서 거리가 먼) node들의 정보도 가져오고 싶어서 attention을 이용했다
image는 각 픽셀을 node로 볼 수 있다
하지만 모든 픽셀과의 관계(attention)을 보는 것은 비효율적
그래서 sub-patch를 적용
patch는 einops 라이브러리를 활용
(torch.view보다 훨씬 간편해서 적극 추천)

Patch Embedding

According to the paper, "To handle 2D images, reshape the image $x \in R^{[H \times W \times C]}$ into a sequence of flattened 2D patches $x_p \in R^{[N \times P^2 * C]}$ where $(H,W)$ is the resolution of the original image, $C$ is the number of channels, $(P, P)$ is the resolution of each image patch, and $N = H \times W/P^2$ is the resulting number of patches, which also serves as the effective input sequence length for the Transformer. The Transformer uses constant latent vector size $D$ through all of its layers, so we
flatten the patches and map to $D$ dimensions with a trainable linear projection."

$H, W$ : input image의 resolution(height, width)
$C$ : input image의 channel
$N$ : pathch의 갯수
$P$ : patch의 한 변의 길이
$D$ : embedding vector의 dimension

Encoder

MHA, Multi-head Attention

Linear Projection

Patch+Position embedding 된 벡터를 linear projection해서 embedding size로 맞춰줌

Multi head

embedding vector와 크기가 같은 QKV를 head_num으로 나눠줌

SDPA, Scalde Dot-Product Attention

위 계산을 통해 MHA의 output이 나옴

놀고 먹으면서 개발하기

이전 포스트

DL Basic 10강) Generative Models 2

다음 포스트

DL Basic 추가논의) Weight Initialize를 하는 이유?

0개의 댓글

관련 채용 정보