03/04 - 03/10
Trained large, deep convolutional neural network to classify the 1.2 million high-resolution images into the 1000 different classes.
Neural Network Have 🙌
- 60 million parameters
- 650,000 neurons
- 5 convolutional layers
- some of max-pooling layers
- 3 fully-connected layers ( final 1000-way softmax )
Neural Network Do 👍
- non-saturating neurons ( faster )
- efficient GPU implementation of the convolution operation ( faster )
- dropout ( reduce overfitting )
To imporve object recognition performance
Simple recognition tasks can be solved quite well with small datasets, especially if they are augmented with label-preserving transformations.
Objects in realistic settings exhibit considerable variability >> use larger training sets
Learn about many objects from images →
Need Model with large learning capacity ( immense complexity of object recognition ) →
Should also have lots of prior knowledge to compensate for all the data we don't have
Problem is CNNs are prohibitively expensive to apply in large scale to high- resolution images. But! current GPUs, paired with a highly-optimized implementation of 2D convolution, are powerful enough to facilitate the training.
Specific contribution of this paper
They trained one of the largest convolutional neural networks to date on the subsets of ImageNet and achieved the best results ever reported.
Network Contains.. 📦
- A number of new and unusual features
- improve its performance
- reduce its training time
- Several effective techniques
- prevent overfitting
- Five convolutional & three fully-connected layers
- removing any convolutional layer resulted in inferior performance
🧑💼 : Results can be improved simply by waiting for faster GPUs and bigger datasets to become available..
ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories
Dataset 📀 : 1.2 million training images, 50,000 validation images, and 150,000 testing images
Need to report two error rates : top - 1 & top - 5
Their system required a constant imput dimensionality >> down - sampled the images to a fixed resolution of 256 x 256.
Trained their network on the (centered) raw RGB values of the pixels
The architecture of network contains eight learned layers - 5 convolutional & 3 fully connected
Saturating nonlinearity (포화 비선형성) :
or
Non-Saturating nonlinearity (불포화 비선형성) :
Non-Saturating nonlinearity is faster than Saturating nonlinearity
ReLU = Rectified Linear Units
Deep convolutional neural networks with ReLUs train faster than with tanh units

🧑💼 : Faster learning has a great influence on the performance of large models trained on large datasets.
Single GPU limits the maximum size of the networks that can be trained on it → Spread the net across two GPUs
🧑💼's Scheme |
- Put half of the kernels (or neurons) on each GPU
- GPUs communicate only in certain layers
- Can precisely tune the amount of communication until it is an acceptable fraction of the amount of computation
Two-GPU net is faster to train than One-GPU net
ReLU don't require input normalization to prevent them from saturating
Response-normalized activity :
Denoting by the activity of neuron (computed by applying kernal at position )
⬇️
Applying the ReLU nonlinearity
Response Normalization creates competition for big activities amongst neuron outputs computed using different kernels
Apply the ReLU nonlinearity in certain layers >> Apply this normalization
🧑💼 : Ours would be termed "brightness normalization", since we do not substract the mean activity
Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel map.
= How far apart the pooling units are placed ( pixels)
= The size of the neighborhood that each pooling unit observes ( x pooling units)
- , obtain traditional local pooling as commonly employed in CNNs
- , obtain overlapping pooling
🧑💼 : We used, . Models with overlapping pooling find it slightly more difficult to overfit

Net contains 8 layers with weights.
- Layer 1~5 : convolutional layer
- Layer 6~8 : fully-connected
- Output : Fed to a 1000-way softmax which produces a distribution over the 1000 class labels
Maximizing the multinomial logistic regression objective =
Maximizing the average across training cases of the log-probability of the correct label under the prediction distribution
- Conv Layer 1 → ReLU → Response Normalization → Max Pooling → Conv Layer 2
- Conv Layer 2 → ReLU → Response Normalization → Max Pooling → Conv Layer 3
- Conv Layer 3 → ReLU → Conv Layer 4
- Conv Layer 4 → ReLU → Conv Layer 5
- Conv Layer 5 → ReLU → Max Pooling → Fully-connected Layer
🧑💼 :Our neural network architecture has 60 million parameters
To reduce overfitting on image data : Artificially enlarge the dataset using label-preserving transformations
🧑💼 : Two distinct forms of data augmentation is employed
Both allow transformed images to be produced from the original with very little computation → don't need to be stored on disk
(1) Generating image translations and horizontal reflections
Extract random 224 x 224 patches & their horizontal reflections from the 256 x 256 images → Train network on these extracted patches.
At test time, the network makes a prediction by extracting five 224 x 224 patches (4 corner + 1 center) + their horizontal reflections = 10 patches. And average the predictions made by the network's softmax layer on the 10 patches
(2) Altering the intensities and horizontal reflections
For training image, Add multiples of the found principal components with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from ~ , they add the following quantity :
To reduce test errors : Combining the predictions of many different models → EXPENSIVE!
Dropout : Consists of setting to zero the output of each hidden neuron with probability 0.5
- The "dopped out" neuron don't contribute to the forward pass and don't participate in back-propagation.
Dropout reduces complex co-adaptations of neurons,since a neuron cannot rely on the presence of particular other neurons → Forced to learn more rubust features that are useful in conjunction with many different random subsets of the other neuron
Trained models using stochastic gradient descent with a batch size of 128 examples, momemtum of 0.9, and weight decay of 0.0005
Weight decay : Reduces the model's training error
- : iteration index
- : momentum variable
- : learning rate
- : average over the th gatch of the derivative of the objective with respect to , evaluated at
🧑💼 : We trained the network for rougly 90 cycles through the training set of 1.2 million images, which took five to six days on two NVIDIA GTX 580 3GB GPUs.


^ Convolutional kernels learned by the network's two data-connected layers
Specialization occurs during every run and is independent of any particular random weight initialization.

^ What the network has learned by computing its top-5 predictions on eight test images
🧑💼 : Even off-center objects can be recognized by the net

Probe the network's visual knowledge is to consider the feature activations induced by and image at the last, 4096-dimensional hidden layer.
🧑💼 : At the pixel level, the retrieved training images are generally not close in L2 to the query images in the first column. (Retrieved dogs and elephants appear ina variety of poses)
Computing similarity by using Euclidean distance between two 4096-dimensional, real-valued vectors is inefficient
BUT! it could be made efficient by training an auto-encoder to compress these vectors to short binary codes
Network's perfomance degrades if a single convolutional layer is removed. So depth really is important for achieving results.
🧑💼 : We did not use any unsupervised pre-training even though we expect that it will help. But we still have many orders to go in order to match the infero-temporal pathway of the human visual system.