Abstract
Problem definition
- Grid-like artifacts in ViTs' feature maps hurts the performance
- Positional embedding at the input stage is the cause.
Model introduction
- Novel noise model that applicable to all ViTs.
- 2 stage approach
Stage 1: Per-image optimization
- dissects ViT outputs into three components(1 semantic + 2 artifact-related conditioned on pixel location)
Stage 2: Learnable denoiser
- predict artifact-free features directly from unprocessed ViT outputs
- shows generalization capabilities to novel data
Model strength
- DVT does not require re-training (can apply to existing pre-trained Vit)
- Consistently and significantly improves existing sota models.
We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings
Introduction
![](https://velog.velcdn.com/images/jinotter3/post/35362e95-0575-4b3f-a4a2-db0aa1023adc/image.png)
- Persistent noise is observable across multiple ViT models
![](https://velog.velcdn.com/images/jinotter3/post/7cfde04e-d986-41ab-8773-d215a6944348/image.png)
- These noise hinders performance of downstream tasks. (like noise clusters in the above figure)
Main Research Question
Is it feasible to effectively denoise these artifacts in pre-trained ViTs, ideally without model re-training?
Key Contributions
- We identify and highlight the widespread occurrence of noise artifacts in ViT features, pinpointing positional embeddings as a crucial underlying factor.
- We introduce a novel noise model tailored for ViT outputs, paired with a neural field-based denoising technique. This combination effectively isolates and removes noise artifacts from features.
- We develop a streamlined and generalizable feature denoiser for real-time and robust inference.
- Our approach significantly improves the performance of multiple pre-trained ViTs in a range of downstream tasks, confirming its utility and effectiveness (e.g., as high as a 3.84 mIoU improvement after denoising).
Backgrounds
ViTs
- ViTs trained with diverse training objectives exhibit commonly observed noise artifacts in their outputs.
- This paper enhance the quality of local features, as evidenced by improvements in semantic segmentation and depth prediction tasks.
ViT artifacts
- Fundamental noise issue noticeable as noisy attention maps in supervised / unsupervised ViTs
- Previously noticed yet often unexplored.
- Recent work(Vision transformers need registers, 2023) explained them as 'high-norm' patches in low-informative background regions, suggesting that suggesting their occurrence is limited to large and sufficiently trained ViTs.
- But, this paper found a strong correlation between the artifacts and the PE.
Preliminaries
- ViT architecture largely remained consistent with its original design
![](https://velog.velcdn.com/images/jinotter3/post/bcc628f6-8296-4e87-9a96-4a62fd369323/image.png)
Factorizing ViT outputs
- While visual features should be translation and reflection invariant, ViTs interwine patch embeddings with positional embeddings, thus breaking the transformation invariance.
![](https://velog.velcdn.com/images/jinotter3/post/928cf5fd-ce6d-439e-9123-6e2368cc9650/image.png)
Above figure shows that artifacts remain almost consistent to their relative positions in a frame.
decomposing ViT outputs into three terms
- f(x) : input-dependent, noise-free semantics term
- g(Epos): input-dependent artifact term related to spatial positions
- h(x,Epos) : residual term accounting for the co-dependency of semantics and positions.
ViT(x)=f(x)+g(Epos)+h(x,Epos)
- spatially invariant output feature map(no PE) => g and h become zero functions
- every feature is dependent on both position and semantics => f and g turn into zero functions
Per-image Denoising with Neural Fields
- Directly addressing above decomposition is hard.
- Cross-view feature and artifact consistencies
- Feature consistency: transformation invariance of visual features(even in case of spatial transformation, the essential semantic content remains invariant)
- Artifact consistency: input-independent artifact remains observable and constant across all transformation
Neural fields as feature mappings
- Semantics field F and artifact field G
- F(holistic image semantics) for each individual image
- G(spatial artifact feature representation) shared by all transformed views.
Learning the decomposition
- Learn semantics field F, artifact field G, residual term Δ by minizing a regularized reconstruction loss
![](https://velog.velcdn.com/images/jinotter3/post/d965eefa-22a0-4ccd-8de0-c52a33b1c979/image.png)
cos: cosine similarity, sg: stop-gradient, t: random transformation sampled from random augmentation, coords: pixel coordinates of the transformed views in the original image
![](https://velog.velcdn.com/images/jinotter3/post/996d766c-0f07-463e-888d-d71bfe4674eb/image.png)
Optimization
- Phase 1: train Fθ and Gξ using only Ldistance
=> capture significant portion of the ViT outputs
- Phase 2: freeze Gξ and continue to train Fθ and hpsi using Lrecon
Generalizable Denoiser
- Per-image denosing method can effectively remove artifacts from ViT, but not run-time effciency and distribution shifts.
- Accumulate a dataset of pairs of noisy ViT outputs and their denoised counterparts after per-image denoising.
- Train denoiser network Dξ
- single Transformer block + additional learnable positional embedding
![](https://velog.velcdn.com/images/jinotter3/post/670a744b-892f-42ca-8fb4-f4822f14bdd9/image.png)
Experiments
Correlation between artifacts and positions
- Maximal information coefficient(MIC) between grid features and normalized patch index
![](https://velog.velcdn.com/images/jinotter3/post/9a9a192d-ac72-425a-bce9-0e6068a12ae7/image.png)
Qualitative results
Visual analysis of ViTs
![](https://velog.velcdn.com/images/jinotter3/post/a4158272-213f-42d2-a63b-5a99126f3709/image.png)
Emerged object discovery ability
![](https://velog.velcdn.com/images/jinotter3/post/c9c3009e-5d95-4f7e-ad48-fa0403b98e75/image.png)
Ablation study
![](https://velog.velcdn.com/images/jinotter3/post/a224aa01-7045-40b6-8b97-c0aada9124bc/image.png)
Future works
Alternative approaches for position embeddings
- The problems looks like positional embedding that are added to the input tokens
- non-adding embeddings like Rotary Positional Embeddings, bias to key-query in T5, ALiBi from LLM can be used.