Transformers in Vision: A Survey 제3부

이준석·2022년 12월 15일

Transformers in Vision: A Survey

목록 보기

4/4

3.1 Single-head Self-Attention

3.1.1 Self-Attention in CNNs

Inspired by non-local means operation [69] which was mainly designed for image denoising, Wang et al. [70] proposed a differentiable non-local operation for deep neural networks to capture long-range dependencies both in space and time in a feed-forward fashion.
differentiable 차별화 가능한 dependencies 차별성
주로 이미지 노이즈 제거를 위해 설계된 비국소적 수단 작업[69]에서 영감을 얻은 Wang et al. [70]은 피드포워드 방식으로 공간과 시간 모두에서 장거리 종속성을 캡처하기 위해 심층 신경망에 대해 차별화 가능한 비국소 연산을 제안했습니다.
Given a feature map, their proposed operator [70] computes the response at a position as a weighted sum of the features at all positions in the feature map. This way, the non-local operation is able to capture interactions between any two positions in the feature map regardless of the distance between them.
피처 맵이 주어지면 제안된 연산자 [70]는 피처 맵의 모든 위치에 있는 피처의 가중 합으로 위치에서 응답을 계산합니다. 이러한 방식으로 로컬이 아닌 작업은 기능 맵의 두 위치 사이의 거리에 관계없이 상호 작용을 캡처할 수 있습니다.

Videos classification is an example of a task where longrange interactions between pixels exist both in space and time. Equipped with the capability to model long-range interactions, [70] demonstrated the superiority of non-local deep neural networks for more accurate video classification on Kinetics dataset [71].
비디오 분류는 시간과 공간 모두에서 픽셀 간의 장거리 상호 작용이 존재하는 작업의 예입니다. 장거리 상호 작용을 모델링할 수 있는 기능을 갖춘 [70]는 Kinetics 데이터 세트 [71]에서 보다 정확한 비디오 분류를 위한 비국소 심층 신경망의 우수성을 입증했습니다.

이준석

인공지능 전문가가 될레요

이전 포스트

Transformers in Vision: A Survey 제3부