[ML models A-to-Z] Week2. Spatial Transformer Networks

Nopego Keepgo·2023년 3월 26일

CNN DeepLearning ML STN Spatial Transformer Network

[AI Models A-to-Z]

목록 보기

2/2

1 Introduction

Since AlexNet was invented in ImageNet Large-Scale Visual Recognition
Challenge(ILSVRC), the method of sizing down the feature map drawn
through max-pooling has became as a practice.
There are some benefits of max-pooling. (i)First is well-known, sizing down the input feature with no parameters. By reducing size of feature map with only max values in a given kernel, we can only extract features that are useful to us. (ii)Second is spatial invariance, which is the main topic that will be discussed in Section. 2.1.
However, Author of this paper mentioned that there are some limitation
for ’conv-maxpooling’ as a function of spatial invariance. In this paper, the author propose "Spatial Transformer Network"(STN) which learns the transformation of the input image itself by end-to-end structure and transform the given image by using 2d-Sampling.

2 Problem Formulation

2.1 Max pooling: Spatially invariant

As mentioned in Section. 1, Max-pooling can allow a network to be somewhat spatially invariant. This can be explained by an example. Let’s assume there are a pixel [3, 4] which has a feature value of nose and pixel [1, 2] which has a feature value of eyes. Then, by general 2x2 max pooling, they will each be mapped into [1,1] and [0,1]. And as long as they translate in the kernel size 2x2, the output of max pooling will be same. Therefore, we can say that max-pooling keeps the spatial invariance within the same kernel.
However, here is the problem. Since max-pooling only keeps invariance within the same kernel, and general using kernel size is 2x2, if the input feature is not reached high-level feature enough, the function of spatial invariance of max pooling has little effect. In other words, the method of recovering from the transformed image with only max pooling is insufficient in scale invariance. This is the main problem formulation of the paper.

3 Spatial Transformer Network

The main idea of solving the problem is from Linear Algebra. Spatial Transformer Network is a neural network that input is an input image or a feature map and output is a transformation matrix. The main function of STN is three. (i) it can simplify input feature by crop-out and scale normalization. (ii) it can also localise in co-localisation task by parallel STN (iii) STN is also capable to spatial attention.

3.1 Idea: Trained transformation

As I mentioned above, what STN want to do is learn the affine transformation of given modified image and by applying same transformation for identity grid, we can sample the input feature using the transformed grid and this can leads to restoring distortion. The detailed architecture of STN is followed.

3.2 Architecture

The overall architecture of STN is 3 steps. (i) Localisation net: this network
takes input as a input feature and outputs parameters of transformation matrix.(The network can be composed of CNN or FCN.) (ii) Grid Generator: Using the learned transformation matrix, grid generator transforms identity grid. (iii) Sampler: By 2d-sampling(eq. 2), we can sample the input feature with the basis of the same distorted dimension. And this sampling leads to the recovered input feature. The detail is followed. In conclusion, what STN want to implement is that by learning the "warp" of input feature and sampling with an equally warped grid, the result will be disentangled.

3.2.1 Localisation and Parameterised Sampling Grid

First, the Localisation net gets an input image or an intermediate feature map as a input and outputs 6 transformation parameter(we want 2x3 transformation matrix). This can be implemented FCN or CNN both.
After extracting transformation matrix parameters, we can generate sampling affine grid. Each coordinates are normalized in [-1, 1] by their mean and variance. The extracted transformation can represent cropping, translation, rotation, scale, and skew.

where, $G_i$ is an identity grid, $A_θ$ is the transfromation which we generated in Localisation nets. Therefore, the gernerated coordinates of affine
grid is $(x_i^s,y_i^s)$ .

3.2.2 Differentiable Sampling

The sampling process with the affine grid in Section. 3.2.1 is performed
using 2d-sampling theorem in signal and system.

where, $k()$ is the sampling kernel, $U^C_{nm}$ is a channel value of (n,m)th input image pixel, V is transformed image., and Φ is parameters of the kernel.
Note that the important thing here is the kernel k() must be differentiable for
$(x_i^s,y_i^s)$ If not, the whole STN cannot be backproped in end-to-end scheme. the derivative of θ is,

Therefore, as long as the kernel we are trying to use is differentiable,
we can learn theta in end-to-end scheme from a given output V.

4. Experiments

Experiment were conducted with 3 types of data, Distorted MNIST, Street View House Numbers(SVHN), and Fine-Grained Classification(species of birds).
By comparing CNN-maxpool and ST-CNN(1st row of Table. 1), it shows that ST-CNN works better and ST can be an alternative way to archieve spatial invariance. And the 2nd row shows that ST-CNN is better because CNN contains maxpooling which adds additional spatial invariance. Additionally, TPS was the most powerful among various transformations. STN reduced the complexity of input considerably, which leads to reducing the complexity of classification network following. As a result, ST model created the effect of changing input to an upright pose.

In SVHN dataset, it is more spatially transformed data than former, because the SVHN dataset is more distorted and living data than distorted MNIST. The authors experimented single STN consisting of 4 CNN and multi STN consisting of 2 FCN each. Former is located directly after input image, and latter is located after input image and intermediate of CNNs(Table. 2). The author predicted that intermediate STN predict a transformation based on richer features.
Last dataset was fine-grained classification with 200 species of birds. The remarkable point here is the authors used multi-parallel STNs, which is parameterised for attention and acting on the input image.

Especially, the predicted bounding box from each of 2-STN is interesting, one spatial transformer predicted head while the other fixated center of the body of a bird. The reult of this experiment is shown in Table. 3. Considering that 4 paralleled STN is dealing with 448px input image, both of them had shown considerable accuracy which was State-of-the-art.

5 Discussion

Spatial Transformer Network trains the transformation of the input image in
an end-to-end manner with no touching of loss function. Come to think of it, it may be questionable why the addition of simple FCNs or CNNs allows us to learn transformation in the way we want without touching the loss function. After thinking of it, the reason was behind the sampling process. In the process of sampling and grid transformation, the inductive bias for the output of localisation net is added, indirectly.
In general, if we want to learn in end-to-end manner with adding mean to intermediate output of an network, there must be some assumption for the parameterised outputs of network and an mathematical formula or process that supports the parameters. This point is different from ResNet of VGGNet which just learns the feature of the given image. That’s why end-to-end learning manner is more difficult. STN achieved the end-to-end learning approach by applying parameterised grid and differentiable sampling as shown in Section. 3.2. This realisation from this lab04 was very interesting

Nopego Keepgo

Majoring in EE

이전 포스트