Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision

Transformer models 1 have recently demonstrated exemplary performance on a broad range of language tasks e.g., text classification, machine translatio

For a given entity in the sequence, the self-attention basically computes the dot-product of the query with all keys, which is then normalized using s

Inspired by non-local means operation 69 which was mainly designed for image denoising, Wang et al. 70 proposed a differentiable non-local operation f