
Fast R-CNN revolutionizes object detection by addressing the key limitations of R-CNN. It introduces RoI Pooling, enabling more efficient computation and better utilization of shared feature maps, but it still primarily relies on CPUs, limiting its speed.
R-CNN brought significant advancements to object detection but still suffered from several key challenges.
1. Slow Training: Training R-CNN on large datasets like the PASCAL VOC dataset takes about 84 hours and consumes considerable disk space.
2. Slow Inference: Detection takes 47 seconds per image with VGG16.
3. Spatial Distortion: Warping region proposals to a fixed size (227x227) introduces spatial loss.
4. Inefficient Region Proposal Processing: Each of the 2000 region proposals is processed independently through the CNN, leading to redundant computations and slow performance.
Fast R-CNN addresses these issues by rethinking how region proposals and features are processed, making detection faster and more efficient. R-CNN's inefficiencies paved the way for Fast R-CNN, which optimizes object detection with a unified and efficient pipeline, significantly improving speed and accuracy.
Let's dig in.
Model Overview:


Fast R-CNN builds on the idea of Spatial Pyramid Pooling (SPP) from SPPnet. While SPP divides the feature map into multiple levels to create fixed-size feature vectors, capturing multi-scale spatial information.
Fast R-CNN simplifies this by using RoI Pooling, which directly converts each RoI into a fixed-size 7x7 feature map. This approach retains the benefits of SPP while significantly reducing computational complexity.

RoI Pooling is a key component in Fast R-CNN that converts region proposals into fixed-size feature maps. Dividing the feature map of a RoI into a fixed grid and applying max pooling to each cell standardizes the output size, enabling consistent processing in FCLs.
Step 1) Feature Map Extraction via CNN Backbone
Step 2) Region Proposal Generation
Step 3) RoI Projection
Step 4) Grid Division
Step 5) Max Pooling
Output: Fixed-size Feature Map
- The fixed-size feature maps produced by RoI Pooling are flattened into feature vectors and passed through FCLs for:
- Classification: Softmax over K + 1 categories
- Bounding-Box Regression: Category-specific bounding-box regressor
A Fast R-CNN network has two sibling output layers, designed to jointly handle classification and bounding-box regression. This multi-task loss framework ensures that both tasks are optimized simultaneously, providing efficient and accurate object detection.


t: Predicted bounding-box offsets
v: Ground-truth bounding-box targets
Combined Loss Function
The overall multi-task loss is defined as:
λ: Balancing factor for the two loss terms (commonly λ = 1)
[𝑢≥1]: Ensures that bounding-box regression is only applied to non-background RoIs. (u = 0, if it's background)
For detection, the processing of a large number of RoIs significantly increases computational demands, with nearly half of the forward pass time spent on the fully connected layers.
To address this bottleneck, Fast R-CNN leverages Truncated Singular Value Decomposition (SVD) to compress these large fully connected layers, accelerating computations and reducing processing time effectively.
How Truncated SVD Works

U: A 𝑢×𝑡 matrix containing the first 𝑡 left singular vectors.
Σₜ: A t×t diagonal matrix with the top t singular values.
V: A v×t matrix containing the first t right singular vectors.
2. Compression:
The original parameter count uv is reduced to t(u+v), where t≪min(u,v).
This compression significantly decreases the number of computations required in the fully connected layers, resulting in faster forward passes.

3. Layer Substitution:
The fully connected layer using W is replaced by two smaller layers:
First Layer: Uses Σₜ Vᵀ as weights.
Second Layer: Uses U as weights, along with the original biases.

Truncated SVD can reduce detection time by more than 30%, achieving significant computational efficiency with minimal impact on detection accuracy.
Fast R-CNN exhibits remarkable improvements in both accuracy and speed over R-CNN:
Mean Average Precision (mAP):
Speed Enhancement:
Revolutionary Improvements:
Fast R-CNN revolutionized object detection by addressing R-CNN's inefficiencies, significantly enhancing both training and inference speeds while maintaining strong detection accuracy.
Fast R-CNN marks a pivotal advancement in object detection by addressing the key limitations of its predecessor, R-CNN, while introducing innovative features like RoI Pooling and a one-stage training structure. Here's how Fast R-CNN overcomes the challenges faced by R-CNN:
| R-CNN Limitation | Fast R-CNN Solution |
|---|---|
| Slow Training: Training on large datasets takes over 84 hours and consumes significant disk space. | Jointly optimizes classification and bounding-box regression in a single stage, drastically reducing training time. |
| Slow Inference: Detection takes 47 seconds per image with VGG16. | Shared computation of feature maps and RoI Pooling significantly reduce inference time (up to 213x faster). |
| Spatial Distortion: Warping region proposals to fixed size (227x227) causes spatial loss. | RoI Pooling extracts fixed-size features without distorting the input regions, preserving spatial information. |
| Inefficient Region Proposal Processing: Each of the 2,000 proposals is processed independently through the CNN, leading to redundancy. | Processes the entire image once through the CNN backbone to generate shared feature maps, avoiding redundant computations. |
While Fast R-CNN resolves many of R-CNN's inefficiencies, it introduces its own challenges:
Region Proposal Bottleneck:
GPU Utilization:
The most fascinating aspect of Fast R-CNN lies in how a seemingly simple change—introducing RoI Pooling to produce fixed-size feature maps at the right stage—leads to such a profound impact on performance. This adjustment, though small in concept, revolutionized object detection by enabling faster and more efficient computation, showcasing how minor innovations in architectural design can yield transformative results. It’s a reminder of the elegance and power of optimizing processes at the right place.
Girshick, R. (2015). Fast R-CNN. 2015 IEEE International Conference on Computer Vision (ICCV).
https://doi.org/10.1109/iccv.2015.169
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916.
https://doi.org/10.1109/tpami.2015.2389824
Singular value decomposition (SVD). (2019). Data-Driven Science and Engineering, 3–46. https://doi.org/10.1017/9781108380690.002
Li, F.-F., Jonhson, J., & Yeung, S. (2017). Lecture 11: Detection and segmentation. https://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
Grel, T. (2017, March 15). Region of interest pooling explained. deepsense.ai. Retrieved from https://deepsense.ai/region-of-interest-pooling-explained