Image classification 1 & Annotation data efficient learning

Andrew·2021년 3월 8일

[Contents]

1) Image classification 1
2) Annotation data efficient learning

Image classification 1

우리는 오감 중 특히 시각에 의존하여 사물을 바라보고 이해하며 살아가고 있다

동일한 프로세스를 컴퓨터에 적용한 컴퓨터 비전이다

여기에서는 컴퓨터 비전 (CV)의 첫 시간으로 CV에 대해 짧게 소개하고, CV에서 가장 기본적인 task, image clasiification을 소개한다

Image Classification은 사진이 주어졌을 때 특정 카테고리로 분류하는 task이다

이번에는 먼저 기존의 머신러닝과 구분되는 딥러닝을 사용한 Image classification의 특징에 대해서 배운다

다음으로 대표적인 CNN 모델인 AlexNet을 배우고 이에 대한 실습을 진행한다

끝으로 가장 유명한 classification 모델 중 하나인 VGGNet에 대해 배운다

Course overview

Why is visual perception important?

Artificial Intelligence (AI)?
- 사람의 지능을 computer system으로 구현 하는 것
- 지능 : 인지능력, 지각능력, 기억과 이해 및 사고 능력까지 포함
Perception to system (지각능력)
- 시스템에서 입력과 출력 데이터에 관련된 것
- As humans grow, we learn about the world by interacting with it
- We gather informative signals from multi-modal association
- Developing machine perception is still an open research area
시각지각 능력이 가장 중요하다
- 왜냐하면 다른 오감에 비해서 압도적으로 시각에 많이 의지해서 살아가기 때문이다

What is computer vision?

사람이 정면을 이해하는 과정
컴퓨터가 정면을 이해하는 과정
rendering : 정보를 통해서 2d 이미지를 drawing 하는 테크닉
Visual perception & intelligence
- Input : visual data (image or video)
Class of visual perception
- Color perception
- Motion perception
- 3D perception
- Semantic-level perception
- Social perception (emotion perception)
- Visuomotor perception, etc.
Also, computer vision includes understanding human visual perception capability!
Our visual perception is imperfect
- to develop machine visual perception,
  - we need to understand the good and bad of our visual perception
  - we need to come up with how to compensate for the imperfection
How to implement?

what you will learn in this course

Fundamental image tasks
- deep learning based tasks
Data augmentation and knowledge distillation
Multi-modal learning (vision + {text, sound, 3D, etc.})
Conditional generative model
Neural network analysis by visualization
- 딥러닝을 디버깅하고 이해하기 위한 visualization tool들도 배운다

What is Classification

Classifier: A mapping f(.) that maps an image to a category level
- 어떤 물체가 영상속에 들어있는지 분류
- 입력 : 영상
- 출력 : 영상이 해당하는 카테코리(클래스)

An ideal approach for image recognition

What is we could memorize all the data in the world?
- All the classification problems could be solved by k Nearest Neighbors (k-NN)
k Nearest Neighbors (k-NN)
- classifies a query data point according to reference points closest to the query
All the classification problems could be solved by k-NN!
Is it realizable?
- 정말 많은 데이터가 있다면
  - 검색하는데 걸리는 시간이 데이터 수에 비례해서 증가하게 되고
  - 메모리 용량도 데이터 수에 비례해서 증가한다
- 따라서 아무리 컴퓨터가 빨라져도 이세상에 모든 데이터를 담기에는 부족하다

Convolutional Neural Networks (CNN)

Compress all the data we have into the neural network
- 데이터가 너무 많으면 시스템 복잡도가 올라가서 실현할 수 없다
- 방대한 데이터를 제한된 복잡도의 시스템(neural networks)에 압축해서 녹여놓는거라고 볼수 있다
let's look at a simple model(singal layer neural network aka. fully connected layer), perceptron, that takes every pixel of an image as input
But, is this model suitable for solving the image classification problem?
Visualization of single fully connected layer networks
- 문제 1) layer가 한층이라 단순해서 평균 이미지들 같으거 이외에는 표현이 안된다
A problem of single fully conneceted layer networks
문제 2) 적용시점(test time)때의 문제
- 학습시에는 영상이 가득찬 하나에 물체에 대해 학습해서 그 클래스에 대한 대표적인 패턴을 학습했다고 볼 수 있다
- 적용시점에 조금이라도 다른 패턴을 적용할 경우 학습동안에 본적이 없기 때문에 조금이라도 템플릿에 위치나 스케일에 맞지 않으면 굉장히 다른 결과나 해석을 내놓는다
위 문제점들을 해결하기 위해서 cnn 이 등장한다
하나의 특징을 뽑기위해서 모든 픽셀을 고려하는 fully connected layer 대신에 하나의 특징을 영상의 공간적인 특성을 고려해서 국부적인 영역들만 connection을 고려한 layer(locally connected layer) 사용
- 필요한 parameter 가 획기적으로 줄어든다
- 더 적은 파라미터로도 효과적인 특징을 추출할 수 있고 또한 overfitting도 방지한다
Convolution neural networks are locally connected neural networks
- local feature learning
- parameter sharing
CNN is used as a backbone of many CV tasks
in this lecture, we will focus on image-level classification

CNN architectures for image classification 1

Brief History

AlexNet

LeNet-5
- A very simple CNN architecture introduced by Yann LeCun in 1998
  - overal architecture: Conv - Pool - Conv - Pool - FC - FC
  - Convolution: 5 x 5 filters with stride 1
  - Pooling: 2 x 2 max pooling with stride 2
Similar with LeNet-5, But
- Bigger (7 hidden layers, 605k neurons, 60 million parameters)
- Trained with ImageNet (large amount of data, 1.2 millions)
- Using better activation function (ReLU) and regularization technique (dropout)
Overal architecture
- Conv - Pool - LRN - Conv - Pool - LRN - Conv - Conv - Conv - Pool - FC - FC - FC
- network path 가 2개로 나뉘어져 있다
  - 이때 당시에는 GPU 메모리가 모자라서 network를 절반씩 나누어서 각각 2개의 GPU에 올렸다
  - 위 구조에서 살펴보면 중간에 몇번 activation map이 서로 cross 한다
  - 이것이 GPU 2개간의 모든 부분에서 cross communication이 일어나면 느리기 때문에 일부에서만 교환하도록 설계한 것이다
- max pooing 된 2d activation map이 linear layer로 가기 위해선 3d 구조에서 2d 구조로 벡터화를 해야한다
- 벡터화 옵션:
  - average pooling
  - flatten
- 위에 그림에서 눈여겨 볼것은 fc layer 의 dimension 이 여기서는 4096이다
  - gpu용량 부족 문제로 그 당시에 학습할 때는 절반씩 각각 gpu에 올려서 학습했기 때문에 two stream 그림에서는 2048로 적혀있었던 것이다

AlexNet(deprecated components)

local response normalization(LRN)
- activation map 에서 명암을 normalization하는 역할을 한다
- Lateral inhibition : the capacity of an excited neuron to subdue its neighbors
- LRN normalizes around the local neighborhood of the excited neuron
- Excited neuron becomes even more sensitive as compared to its neighbors
- 지금은 LRN 보다는 Batch normalization을 활용한다
11 x 11 convolution filter
- The filter size is increased, as the input size of the image has increased
  - input size of LeNet: 28x28
  - input size of AlexNet: 227x227
- Larger size filters are used to cover a wider range of the input image
- 최신 네트워크 구조에서는 큰 filter size 를 사용하지 않는다

AlexNet

__Receptive field in CNN
- The region in the input space that a particular CNN feature is looking at
- suppose K x K convolution filters with stride 1, and a pooling layer of size P x P,
  - then a value of each unit in the pooling layer depends on an input patch of size : (P + K - 1) x (P + K - 1)

VGGNet

Deeper architecture
- 16 and 19 layers
simpler architecture
- no local response normalization
- only 3x3 conv filters blocks, 2x2 max pooling
better performance
- significant performance improvement over AlexNet(second in ILSVRC14)
better generalization
- final features generalizing well to other taks even without fine-tuning
summarize
- deeper architecture
- simpler architecture
- better performance
- better generalization

overall architecture

작은 convolution layer들도 stack을 많이 쌓으면 큰 receptive field size를 얻을 수 있다
- 큰 receptive field size를 input 단에서 얻었다는 의미는
- 이미지 영역에 많은 부분들을 고려해서 결론을 냈다는 의미와 동일하다

Annotation data(label data) efficient learning

컴퓨터 비전 문제를 푸는 딥러닝 모델은 supervised learning으로 학습하는 것이 유리하다는 사실은 알려져 있다

하지만, 딥러닝 모델을 학습할 수 있을 만큼 고품질의 데이터를 많이 확보하는 것은 보통 불가능하거나 그 비용이 매우 크다

여기에서는 Data Augmentation, Knowledge Distillation, Transfer learning, Learning without Forgetting, Semi-supervised learning 및 Self-training 등 주어진 데이터셋의 분포를 실제 데이터 분포와 최대한 유사하게 만들거나, 이미 학습된 정보를 이용해 새 데이터셋에 대해 보다 잘 학습하거나,
label이 없는 데이터셋까지 이용해 학습하는 등 주어진 데이터셋을 최대한 효율적으로 이용해 딥러닝 모델을 학습하는 방법을 소개한다

Data Augmentation

Learning representation of dataset

learning representation from a dataset
- neural networks learn compact features(information) of a dataset
Dataset is (almost) always biased
- Images taken by camera(training data) $\ne$ real data
The training dataset is sparse samples of real data
- The training dataset contains only fractional part of real data
The training dataset and real data always have a gap
- Suppose a training dataset has only bright images
- during test time, if a dark image is fed as input, the trained model may be confused
- problem : datasets do not fully represent real data distribution
Augmenting data to fill more space and to close the gap
examples of augmentations to make a dataset denser

Data Augmentation

image data augmentation
- applying various image transformations to the dataset
  - crop, shear, brightness, perspective, rotate, etc.
- OpenCV and Numpy have various methods useful for data augmentation
- Goal: make training dataset's distribution similar with real data distribution

Various data augmentation methods

brightness adjustment
- various brightness in dataset
- brightness adjustment (brightening) using numpy

def brightness_augmentation(img):
   # numpy array img has RGB value(0~255) for each pixel
   img[:, :,0] = img[:, :,0] + 100 # add 100 to R value
   img[:, :,0] = img[:, :,1] + 100 # add 100 to G value
   img[:, :,0] = img[:, :,2] + 100 # add 100 to B value
   
   img[:, :,0][img[:, :,0] > 255] = 255 # clip R values over 255
   img[:, :,1][img[:, :,1] > 255] = 255 # clip G values over 255
   img[:, :,2][img[:, :,2] > 255] = 255 # clip B values over 255
   return img

Rotate, flip
- Diverse angles in dataset
- Rotating (flipping) image using OpenCV

img_rotated = cv2.rotate(image, cv2.ROTATE_90_CLOCKWISE)
img_flipped = cv2.rotate(image, cv2.ROTATE_180)

crop
- learning with only part of images
- cropping image using numpy

y_start = 500     # y pixel to start cropping
crop_y_size = 400 # cropped image's height
x_start = 300     # x pixel to start cropping
crop_x_size = 800 # cropped image's width
img_cropped = image[y_start : y_start + crop_y_size, x_start : x_start + crop_x_size, :]

Affine transformation
- preserve 'line', 'length ratio', and 'parallelism' in image
- for example, transforming a rectangle into a parallelogram
  - see the shear transform example below
- Affine transformation (shear) using OpenCV

rows, cols, ch = image.shape
pts1 = np.float32([[50,50], [200,50], [50,200]])
pts2 = np.float32([[10,100], [200,50], [100,250]])
M = cv2.getAffineTransform(pts1, pts2)
shear_img = cv2.warpAffine(image, M, (cols, rows))

Model augmentation techniques

CutMix
- 'Cut' and 'Mix training example to help model better localize objects
Generating new training image
- mixing both images and labels
  - 의미있는 수준의 성능 향상과 동시에 물체의 위치를 더 정교하게 catch 할수 있게끔 학습한다
RandAugment
- 여러가지 영상 처리 기능들을 조화해서 전혀 다른 데이터를 생성한다
- many augmentation methods exist. Hard to find best augmentations to apply
- automatically finding the best sequence of augmentations to apply
- Random sample, apply, and evaluate augmentations
  - 랜덤하게 augmentation기법들을 sampling해서 수행하고 성능이 잘 나오는 것을 활용
example of augmented images in RandAug
- Augmentation policy has two parameters
  - which augmentation to apply
  - magnitude of augmentation to apply (how much to augment)
- parameters used in the above example
  - which augmentations to apply : 'shearX' & 'AutoContrast'
  - Magnitude of augmentations to apply : 9
Randomly testing augmentations policies
- finding the best augmentations policy
  - sample a policy : Policy = {N augmentations to apply} by random sampling
  - train with a sampled policy, and evaluate the accuracy
Augmentation helps model learning
- Higher test accuracy than training w/o augmentation

Leveraging pre-trained information

Transfer learning

The high-quality dataset is expensive and hard to obtain
- supervised learning requires a very large-scale dataset for training
- annotating data is very expensive, and its quality is not ensured
- transfer learning : A practical training method with a small dataset
Benefits when using transfer learning
- by transfer learning, we can easily adapt to a new task by leveraging pre-trained knowledge(feature)!
Motivational observation : similar datasets share common information
- 한 데이터셋에서 배운 지식을 다른 데이터셋에서 활용하는 기술
- E.g., 4 distinct datasets with similar images
- knowledge learned from one dataset can be applied to other datasets
Approach 1: Transfer knowledge from a pre-trained task to a new task
- Given a model pre-trained on a 10-class dataset,
- Chop off the final layer of the pre-trained model, add and only re-train a new FC layer
- Extracted features preserve all the knowledge from pre-training
- 새로운 task에 대응하도록 학습이 된게된다
  - 이렇게 학습된 새로운 fc layer는 적은 데이터로 부터도 잘 작동되게 학습이 된다
  - 몸통 전체를 학습시키지 않아도 되기때문에 학습해야할 파라미터수가 적어졌기 때문이다
Approach 2: Fine-tuning the whole model
- Given a model pre-trained on a dataset
- Replace the final layer of the pre-trained model to a new one, and re-train the whole model
- set learning rates differently
  - 아래 그림에서 convolution layer 부분은 learning rate을 낮게 잡고 학습을 같이한다
  - learning rate를 낮게 잡은 부분은 업데이트가 느리게 되는 반면 새로운 fc layer는 높은 learning rate를 통해서 새로운 target task에 빨리 적응하도록 학습이 되게한다

Knowledge distillation

passing what model learned to 'another' smaller model(Teacher-student learning)
- 'Distillate' knowledge of a trained model into another smaller model
- used for model compression (Mimicking what a larger model knows)
- Also, used for pseudo-labeling (generating pseudo-labels for an unlabeled dataset)
Teacher-student network structure
- the student network learns what the teacher network knows
- the student network mimics outputs of the teacher network
  - 두 모델의 output의 차이를 KL div. Loss 를 통해 측정해서 backpropagation을 통해 student model 만 학습한다
- Unsupervised learning, since training can be done only with unlabeled data
knowledge distillation
- when labeled data is available, can leverage labeled data for training(student loss)
- Distillation loss to 'predict similar outputs with the teacher model'
- Semantic information is not considered in distillation
  - teacher에서 나온 output의 각각 dimension이 pre-train 할때 사용되었던 어떤 이전 task class 들과 연관이 되어있다
  - student loss로 학습하는 데이터는 pre-training dataset과 전혀 다른 task 일 수 있다(label 정보가 겹치지 않는)
  - 그래서 중복되는 정보가 없더라도 soft label에서 발생하는 output dimension들 각각의 그 의미가 중요하다기 보다는 전체 output이 추상적인 지식의 형태를 표현하고 있어서 그 행동 자체를 따라하도록 만드는 것이 가장 중요하지 그 내부 각각 element의 semantic information이 중요한건 아니다 라는 의미
Hard label vs Soft label
- Hard label(ground truth, one-hot vector)
  - typically obtained from the dataset
  - indicates whether a class is 'true answer' or not
- Soft label
  - typical output of the model(=inference result)
  - regard it as 'knowledge'. Useful to observe how the model thinks
softmax with temperature( $T$ )
- Softmax with temperature: controls difference in output between small & large input values
- A large $T$ smoothens large input value differences
  - 극단적으로 0과 1의 값만 있는것 보단 0과 1 사이의 중간 값도 가지면서 입력에 따라 민감하게 변하는 신호를 따라하게 만듬으로서 student가 teacher를 더 따라하게 도와주는 역할
- Useful to synchronize the student and teacher models' outputs
intuition about distillation loss and student loss
- Distillation loss
  - KLdiv(soft label, soft prediction)
  - Loss = difference between the teacher and student network's inference
  - Learn what teacher network knows by mimicking
- Student Loss
  - CrossEntropy(hard label, soft prediction)
  - Loss = difference between the student network's inference and tru label
  - Learn the 'right answer'
Loss functions in knowledge distillation
- Weighted sum of 'distillation loss' and 'student loss'

Leveraging unlabeled dataset for training

Semi-supervised learning

There are lots of unlabeled data
- typically, only a small portion of data is labeled
- is there any way to learn from unlabeled data?
- Semi-supervised learning : Unsupervised (no label) + Fully supervised (fully labeled)
Semi-supervised learning with pseudo labeling
- pseudo-labeling unlabeled data using a pre-trained model, then use for training

Self-training

Recap : Data efficient learning methods so far
- Data Augmentation
  - Augment a dataset to make the dataset closer to real data distribution
- knowledge distillation
  - train a student network to imitate a teacher network
  - transfer the teacher network's knowledge to the student network
- semi-supervised learning (Pseudo label-based method)
  - Pseudo-label an unlabeled dataset using a pre-trained model, then use for training
  - Leveraging an unlabeled dataset for training
self-training
- Augmentation + teacher-student networks + semi-supervised learning
- SOTA ImageNet Classification, 2019
self-training with noisy student
iteratively training noisy student network using teacher network
brief overview of self-training
- Train initial teacher model with labeled data
- pseudo-label unlabeled data using teacher model
- train student model with both labeled and unlabeled data with augmentation
- set the student model as a new teacher, and set new model(bigger) as a new student
- repeat 2~4 with new teacher/student models

Reference

Data augmentation
- Yun et al.,CutMix:Regularization Strategy to Train Strong Classifiers with Localizable Features,ICCV2019
- Cubuk et al.,Randaugment:Practical automated data augmentation with a reduced search space,CVPRW 2020
Leveraging pre-trained information
- Ahmed et al.,Fusion of local and global features for effective image extraction, Applied Intelligence 2017
- Oquab et al.,Learning and Transferring Mid- Level Image Representations using Convolutional Neural Networks, CVPR 2015
- Hinton et al.,Distilling the Knowledge in a Neural Network, NIPS deep learning workshop 2015
- Li & Hoiem, Learning without Forgetting, TPAMI 2018
Leveraging unlabeled dataset for training
- Lee,Pseudo-label: The simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks, ICML Workshop 2013
- Xie et al., Self-training with Noisy Student improves ImageNet classification, CVPR 2020

Andrew

아기개발자

이전 포스트

AI 시대의 커리어 빌딩

다음 포스트

Image classification 1 & Annotation data efficient learning

[Contents]

Image classification 1

Further Reading

Course overview

Why is visual perception important?

What is computer vision?

what you will learn in this course

What is Classification

An ideal approach for image recognition

Convolutional Neural Networks (CNN)

CNN architectures for image classification 1

Brief History

AlexNet

AlexNet(deprecated components)

AlexNet

VGGNet

overall architecture

Annotation data(label data) efficient learning

Further Reading

Data Augmentation

Learning representation of dataset

Data Augmentation

Various data augmentation methods

Model augmentation techniques

Leveraging pre-trained information

Transfer learning

Knowledge distillation

Leveraging unlabeled dataset for training

Semi-supervised learning

Self-training

Reference

AI 시대의 커리어 빌딩

Image classification 2

0개의 댓글