TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization 제1부

이준석·2022년 10월 21일

TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization

목록 보기

1/2

TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization

TransGeo: 횡단면 이미지 지리적 위치 파악에 필요한 것은 변압기뿐입니다.

Abstract

The dominant CNN-based methods for cross-view image geo-localization rely on polar transform and fail to model global correlation. We propose a pure transformer-based approach (TransGeo) to address these limitations from a different perspective.
크로스 뷰 이미지 지리 위치 파악을 위한 지배적인 CNN 기반 방법은 극좌표 변환에 의존하고 전역 상관 관계를 모델링하지 못합니다. 우리는 다른 관점에서 이러한 한계를 해결하기 위해 순수한 변압기 기반 접근 방식(TransGeo)을 제안합니다.

TransGeo takes full advantage of the strengths of transformer related to global information modeling and explicit position information encoding. We further leverage the flexibility of transformer input and propose an attention-guided non-uniform cropping method, so that uninformative image patches are removed with negligible drop on performance to reduce computation cost.
TransGeo는 글로벌 정보 모델링 및 명시적 위치 정보 인코딩과 관련된 변환기의 장점을 최대한 활용합니다. 우리는 트랜스포머 입력의 유연성을 더욱 활용하고 주의 유도 비균일 자르기 방법을 제안하여 계산 비용을 줄이기 위해 성능 저하를 무시할 정도로 정보가 없는 이미지 패치를 제거합니다.

The saved computation can be reallocated to increase resolution only for informative patches, resulting in performance improvement with no additional computation cost. This “attend and zoom-in” strategy is highly similar to human behavior when observing images. Remarkably, TransGeo achieves stateof-the-art results on both urban and rural datasets, with significantly less computation cost than CNN-based methods.
저장된 계산을 재할당하여 정보 패치에 대해서만 해상도를 높일 수 있으므로 추가 계산 비용 없이 성능이 향상됩니다. 이 "참석 및 확대" 전략은 이미지를 관찰할 때 인간의 행동과 매우 유사합니다. 놀랍게도 TransGeo는 CNN 기반 방법보다 훨씬 적은 계산 비용으로 도시 및 시골 데이터 세트 모두에서 최첨단 결과를 달성합니다.

It does not rely on polar transform and infers faster than CNN-based methods. Code is available at https:
//github.com/Jeff-Zilence/TransGeo2022
극좌표 변환에 의존하지 않고 CNN 기반 방법보다 빠르게 추론합니다. 코드는 https:
//github.com/Jeff-Zilence/TransGeo2022

5. Conclusion and Discussion

We propose the first pure transformer method (TransGeo) for cross-view image geo-localization. It achieves state-of-the-art results on both aligned and unaligned datasets, with less computational cost than CNN-based methods. The proposed method does not rely on polar transform, data augmentation, thus is generic and flexible.
우리는 횡단 영상의 지리적 위치 파악을 위한 최초의 순수 변환기 방법(TransGeo)을 제안합니다. CNN 기반 방법보다 적은 계산 비용으로 정렬 및 정렬되지 않은 데이터 세트 모두에서 최첨단 결과를 달성합니다. 제안된 방법은 극좌표 변환, 데이터 증대에 의존하지 않으므로 일반적이고 유연합니다.

1. Introduction

Image-based geo-localization aims to determine the location of a query street-view image by retrieving the most similar images in a GPS-tagged reference database. It has a great potential for noisy GPS correction [2, 33] and navigation [12, 17] in crowed cities.
이미지 기반 지리 위치 파악은 GPS 태그가 지정된 참조 데이터베이스에서 가장 유사한 이미지를 검색하여 쿼리 스트리트 뷰 이미지의 위치를 결정하는 것을 목표로 합니다. 혼잡한 도시에서 시끄러운 GPS 수정 [2, 33] 및 탐색 [12, 17]에 대한 큰 잠재력이 있습니다.

Due to the complete coverage and easy access of aerial images from Google Map API [1], a thread of works [10, 14, 19, 21–23, 25, 29, 35] focus on cross-view geo-localization, where the satellite/aerial images are collected as reference images for both rural [14,34] and urban areas [29, 36].
Google Map API[1]의 항공 이미지에 대한 완전한 적용 범위와 손쉬운 액세스로 인해 일련의 작업[10, 14, 19, 21–23, 25, 29, 35]은 크로스 뷰 지리 현지화에 중점을 둡니다. 위성/항공 이미지는 시골[14,34]과 도시 지역[29, 36] 모두에 대한 참조 이미지로 수집됩니다.

They generally train a two-stream CNN (Convolutional Neural Network) framework employing metric learning loss [10, 35]. However, such cross-view retrieval systems suffer from the great domain gap between street and aerial views, as CNNs do not explicitly encode the position information of each view.
그들은 일반적으로 메트릭 학습 손실을 사용하는 2스트림 CNN(Convolutional Neural Network) 프레임워크를 훈련합니다[10, 35]. 그러나 CNN이 각 뷰의 위치 정보를 명시적으로 인코딩하지 않기 때문에 이러한 크로스 뷰 검색 시스템은 거리 뷰와 항공 뷰 사이의 큰 영역 갭이 있습니다.

To bridge the domain gap, recent works apply a predefined polar transform [21, 22, 26] on the aerial-view images. The transformed aerial images have a similar geometric layout as the street-view query images, which results in significant boost in the retrieval performance.
도메인 간극을 메우기 위해 최근 작업에서는 항공 사진 이미지에 미리 정의된 극좌표 변환[21, 22, 26]을 적용합니다. 변환된 항공 이미지는 스트리트 뷰 쿼리 이미지와 유사한 기하학적 레이아웃을 가지므로 검색 성능이 크게 향상됩니다.

However, the polar transform relies on the prior knowledge of the geometry corresponding to the two views, and may fail when the street query is not spatially aligned at the center of aerial images [36] (this point is further demonstrated in Sec. 4.5).
그러나 극성 변환은 두 뷰에 해당하는 지오메트리에 대한 사전 지식에 의존하며, 거리 쿼리가 항공 이미지 중심에 공간적으로 정렬되지 않을 때 실패할 수 있습니다 [36] (이 점은 4.5절에 자세히 설명되어 있습니다).

Recently, vision transformer [7] has achieved significant performance on various vision tasks due to its powerful global modeling ability and self-attention mechanism.
Although CNN-based methods are still predominant for cross-view geo-localization, we argue vision transformer is more suitable for this task due to three advantages: 1) Vision transformer explicitly encodes the position information, thus can directly learn the geometric correspondence between two views with the learnable position embedding.
최근 비전 트랜스포머 [7]는 강력한 글로벌 모델링 능력과 자기 주의 메커니즘으로 인해 다양한 비전 작업에서 상당한 성능을 달성하고 있다.
CNN 기반 방법이 여전히 크로스뷰 지역화에 우세하지만, 우리는 비전 트랜스포머가 세 가지 장점 때문에 이 작업에 더 적합하다고 주장한다. 1) 비전 트랜스포머는 위치 정보를 명시적으로 인코딩하므로 학습 가능한 위치 임베딩으로 두 뷰 사이의 기하학적 대응을 직접 학습할 수 있다.

2) The multi-head attention [28] module can model global long-range correlation between all patches starting from the first layer, while CNNs have limited receptive field [7] and only learn global information in top layers. Such strong global modeling ability can help learn the correspondence, when two objects are close in one view while far from each other in the other view.
2) multi-head Attention [28] 모듈은 첫 번째 레이어에서 시작하는 모든 패치 사이의 전역 장거리 상관 관계를 모델링할 수 있는 반면 CNN은 수용 필드가 제한적이며 [7] 최상위 계층에서만 전역 정보를 학습합니다. 이러한 강력한 글로벌 모델링 능력은 두 물체가 한 보기에서는 가깝고 다른 보기에서는 서로 멀리 떨어져 있는 경우의 대응 관계를 학습하는 데 도움이 될 수 있습니다.

3) Since each patch has an explicit position embedding, it is possible to apply non-uniform cropping, which removes arbitrary patches without changing the input of other patches, while CNNs can only apply uniform cropping (i.e. cropping a rectangle area). Such flexibility of patch selection is beneficial for geo-localization.
3) 각 패치에는 명시적 위치 임베딩이 있으므로 다른 패치의 입력을 변경하지 않고 임의의 패치를 제거하는 비균일 크롭을 적용할 수 있지만 CNN은 균일한 크롭(예: 직사각형 영역 크롭)만 적용할 수 있습니다. 이러한 패치 선택의 유연성은 지리적 위치 파악에 유용합니다.

Since some objects in aerial-view may not appear in street view due to occlusion, they can be removed with nonuniform cropping to reduce computation and GPU memory footprint, while keeping the position information of other patches.
공중 뷰의 일부 객체는 오클루전으로 인해 스트리트 뷰에 나타나지 않을 수 있으므로 다른 패치의 위치 정보를 유지하면서 계산 및 GPU 메모리 풋프린트를 줄이기 위해 균일하지 않은 자르기로 제거할 수 있습니다.

However, vanilla vision transformer 7 has some limitation on training data size and memory consumption, which must be addressed when applied to cross-view geo-localization. The original ViT [7] requires extremely large training datasets to achieve state-of-the-art, e.g. JFT-300M [7] or ImageNet-21k [5] (a super set of the original ImageNet-1K).
그러나 바닐라 비전 트랜스포머7는 훈련 데이터 크기와 메모리 소비에 몇 가지 제한이 있으며, 이는 크로스 뷰 지리적 위치 파악에 적용할 때 해결해야 합니다. 원래 ViT[7]는 최첨단 기술을 달성하기 위해 매우 큰 훈련 데이터 세트가 필요합니다. JFT-300M [7] 또는 ImageNet-21k [5](원래 ImageNet-1K의 슈퍼 세트).

It does not generalize well if trained on medium-scale datasets, because it does not have inductive biases [7] inherent in CNNs, e.g. shift-invariance and locality. Recently, DeiT [27] applies strong data augmentation, knowledge distillation, and regularization techniques, in order to outperform CNN on ImageNet-1K [5], with similar parameters and inference throughput.
throughput 처리량
CNN에 내재된 유도 편향[7]이 없기 때문에 중간 규모의 데이터 세트에 대해 훈련된 경우 잘 일반화되지 않는다. 예를 들어, 시프트-불변성 및 지역성. 최근 DeiT[27]는 유사한 매개 변수와 추론 처리량으로 ImageNet-1K[5]에서 CNN을 능가하기 위해 강력한 데이터 증강, 지식 증류 및 정규화 기술을 적용한다.

However, mixup techniques used in DeiT (e.g. CutMix [27, 32]) are not straight-forward for metric learning losses [10].
그러나 DeiT에서 사용되는 혼합 기술(예: CutMix[27, 32])은 메트릭 학습 손실[10]에 대해 간단하지 않습니다.

In this paper, we propose the first pure transformerbased method for cross-view geo-localization (TransGeo).
To make our method more flexible without relying on data augmentations, we incorporate Adaptive Sharpness-Aware Minimization (ASAM) [11], which avoids overfitting to local minima by optimizing the adaptive sharpness of loss landscape and improves model generalization performance.
본 논문에서는 횡단면 지리 위치 파악을 위한 최초의 순수 변환기 기반 방법(TransGeo)을 제안합니다.
데이터 증대에 의존하지 않고 방법을 더 유연하게 만들기 위해 ASAM(Adaptive Sharpness-Aware Minimization)[11]을 통합합니다. 이는 손실 환경의 적응형 선명도를 최적화하여 국소 최소값에 대한 과적합을 피하고 모델 일반화 성능을 향상시킵니다.

Moreover, by analyzing the attention map of top transformer encoder, we observe that most of the occluded regions in aerial images have negligible contribution to the output.
occluded 가리다
또한 상위 변압기 인코더의 주의 맵을 분석함으로써 항공 이미지에서 가려진 대부분의 영역이 출력에 무시할 만한 기여를 한다는 것을 관찰한다.

This motivates us to introduce the attention-guided non-uniform cropping, which first attends to informative image regions based on attention map of transformer encoder, then increases the resolution only on the selected regions, resulting in an “attend and zoom-in” procedure, similar to human vision. Our method achieves state-ofthe-art performance with significant less computation cost (GFLOPs) than CNN-based methods, e.g. SAFA [21].
이것은 우리로 하여금 먼저 트랜스포머 인코더의 어텐션 맵을 기반으로 정보 이미지 영역에 주의를 기울인 다음 선택된 영역에서만 해상도를 증가시켜 "참석 및 확대" 절차를 초래하는 Attention-guided non-uniform cropping을 도입하도록 동기를 부여합니다. , 인간의 시력과 유사합니다. 우리의 방법은 CNN 기반 방법보다 훨씬 적은 계산 비용(GFLOP)으로 최첨단 성능을 달성합니다. 사파[21].

We summarize our contributions as follows:

• The first pure transformer-based method (TransGeo) for cross-view image geo-localization, without relying on polar transform or data augmentation.
• 극좌표 변환 또는 데이터 증강에 의존하지 않고 횡단면 이미지 지리 위치 파악을 위한 최초의 순수 변환기 기반 방법(TransGeo).

• A novel attention-guided non-uniform cropping strategy that removes a large number of uninformative patches in reference aerial images to reduce computation with negligible performance drop. The performance is further improved by reallocating the saved computation to higher image resolution of the informative regions.
• 무시할 수 있는 성능 저하로 계산을 줄이기 위해 참조 항공 이미지에서 많은 수의 정보가 없는 패치를 제거하는 새로운 주의 유도형 비균일 자르기 전략. 저장된 계산을 정보 영역의 더 높은 이미지 해상도로 재할당함으로써 성능이 더욱 향상됩니다.

• State-of-the-art performance on both urban and rural datasets with less computation cost, GPU memory consumption, and inference time than CNN-based methods.
• CNN 기반 방법보다 계산 비용, GPU 메모리 소비 및 추론 시간이 더 적은 도시 및 농촌 데이터 세트에 대한 최첨단 성능.

이준석

인공지능 전문가가 될레요

다음 포스트

TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization 제1부