https://arxiv.org/abs/1506.02640
YOLO, a new approach to object detection.✨
we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image.
✨ We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities.
Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.
😀 YOLO is refreshingly simple (Figure 1)
A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance.
This unified model has several benefits over traditional methods of object detection.
😢 YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones.
We unify the separate components of object detection into a single neural network.
Our system divides the input image into an S × S grid.
Each grid cell predicts B bounding boxes and confidence scores for those boxes.
Each bounding box consists of 5 predictions: x, y, w, h, and confidence.
Each grid cell also predicts C conditional class probabilities, . These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.
At test time we multiply the conditional class probabilities and the individual box confidence predictions, which gives us class-specific confidence scores for each box.
These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.
For evaluating YOLO on PASCAL VOC, we use S = 7, B = 2. PASCAL VOC has 20 labelled classes so C = 20. Our final prediction is a 7 × 7 × 30 tensor.
We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset.
Our network architecture is inspired by the GoogLeNet.
(The full network is shown in Figure 3.)
The final output of our network is the 7 × 7 × 30 tensor of predictions.
We pretrain our convolutional layers on the ImageNet 1000-class competition dataset. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer.
We achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo.
We use the Darknet framework for all training and inference.
We then convert the model to perform detection. We add four convolutional layers and two fully connected layers with randomly initialized weights.
Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.
Our final layer predicts both class probabilities and bounding box coordinates.
We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.
We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:
We optimize for sum-squared error in the output of our model, because it is easy to optimize
💢However it does not perfectly align with our goal of maximizing average precision.
😀 To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confi- dence predictions for boxes that don’t contain objects.
→ Using two parameters, and .
💢 Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes.
😀 To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.
💢 YOLO는 grid cell 당 multiple bounding boxes를 예측하는데, training 때 우리는 객체에 responsible한 one bounding box predictor만 필요하다. (We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth.)
😀 This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.
During training we optimize the following, multi-part loss function:
where denotes if object appears in cell and denotes that the th bounding box predictor in cell is “responsible” for that prediction.
The loss function only penalizes classification error if an object is present in that grid cell.
It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).
We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012.
When testing on 2012 we also include the VOC 2007 test data for training.
batch size of 64 while training
a momentum of 0.9
a decay of 0.0005.
learning rate schedule
For the first epochs we slowly raise the learning rate from 10−3 to 10−2. We continue training with 10−2 for 75 epochs, then 10−3 for 30 epochs, and finally 10−4 for 30 epochs.
overfitting
✨ Just like in training, predicting detections for a test image only requires one network evaluation.
Often it is clear which grid cell an object falls in to and the network only predicts one box for each object.
However, some large objects or objects near the border of multiple cells can be well localized by multiple cells.
Non-maximal suppression can be used to fix these multiple detections.
1. Deformable parts models.
→ YOLO is a single convolutional neural network.✨
→ faster, more accurate ✨
Instead of static features, the network trains the features in-line and optimizes them for the detection task.
2. R-CNN.
R-CNN and its variants use region proposals to find objects in images.
This complex pipeline must be precisely tuned independently and the resulting system is very slow.😵
→ YOLO puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object.✨
→ YOLO proposes far fewer bounding boxes, only 98 per image compared to about 2000 from Selective Search. ✨
→ YOLO combines these individual components into a single, jointly optimized model.
3. Other Fast Detectors
We introduce YOLO, a unified model for object detection.
🌸 YOLO is simple to construct
🌸 YOLO can be trained directly on full images.
🌸 YOLO is trained on a loss function that directly corresponds to detection performance
🌸 Entire model is trained jointly.
🌸 Fast YOLO is the fastest general-purpose object detector in the literature.
🌸 YOLO pushes the state-of-the-art in real-time object detection.
🌸 YOLO also generalizes well to new domains making it ideal for applications that rely on fast, robust object detection.