
Faster R-CNN combines a Region Proposal Network with Fast R-CNN, enabling a fully end-to-end, CNN-based approach to generating high-quality region proposals and performing object classification and bounding-box regression. This design significantly improves both accuracy and speed compared to traditional proposal methods like Selective Search.
Conventional two-stage detection approaches often rely on external region proposal algorithms (e.g., Selective Search, Edge Boxes) for candidate regions before classification. Although effective, these approaches are computationally expensive and relatively slow. Faster R-CNN addresses these shortcomings by learning region proposals directly via a CNN, offering an end-to-end and more efficient pipeline.
Model Overview:

The RPN is the cornerstone of Faster R-CNN by replacing traditional region proposal methods like Selective Search. Sharing convolutional layers with Fast R-CNN significantly reduces computational overhead.
RPN Workflow
1. Sliding Window: A small n×n (e.g., 3×3) sliding window is applied to the shared feature map to extract local features.
2. Anchor Boxes: For each sliding window position, k anchor boxes (e.g., 9 anchors) with different scales and aspect ratios are generated.
3. Binary Classification: Each anchor (2k outputs) is classified as an “object” or “background.”
4. Bounding-Box Regression: Each anchor (4k outputs) is refined by predicting offsets (x, y, w, h) to best fit the object.
Why RPN Matters
Non-Maximum Suppression (NMS) After RPN
Once the RPN outputs objectness scores and refined boxes, an NMS step is applied to remove highly overlapped proposals that likely refer to the same object. This reduces redundancy and yields a more compact set of candidate regions.
Once the RPN generates high-quality proposals, the Fast R-CNN head processes these proposals via RoI Pooling to produce fixed-sized feature maps.
It then performs:
Multi-Class Classification: Determines the object category (plus background).
Bounding-Box Regression: Further refine each proposal to match ground truth coordinates.
By combining the RPN for region proposals and Fast R-CNN for detection, Faster R-CNN forms a unified, fully CNN-based framework for end-to-end object detection.

What are Anchor Boxes?
Anchor boxes are predefined bounding-boxes of various shapes and sizes placed at each position of the feature map.
In practice, 3 scales x 3 aspect ratios = 9 anchors per position.
Single-scale features, multi-scale predictions— each anchor has its own prediction function.
Positive and Negative Anchors


The RPN loss function typically consists of two parts:
1. Classification Loss
2. Regression Loss
A smooth L1 loss on the predicted offsets (x,y,w,h) to align each positive anchor with its corresponding ground truth box.
Only positive anchors contribute to the regression loss, while all anchors (positive + negative) are used for classification.
Faster R-CNN employs a two-step training process:
1) Alternating Training
RPN and Fast R-CNN are trained alternately to allow the RPN to generate better proposals. This iterative process improves both RPN and Fast R-CNN in tandem.
2) End-to-End Training
After the alternating steps, Faster R-CNN can be fine-tuned jointly, optimizing the shared layers, RPN, and detection head end-to-end for better performance and efficiency.
Faster R-CNN incorporates the RPN and Fast R-CNN into a unified, fully CNN-based pipeline, delivering significant improvements in both speed and accuracy for object detection. By sharing convolutional features for both proposal generation and classification, it eliminates the need for slow external proposals and achieves near real-time performance. This framework has laid the groundwork for many subsequent two-stage detectors. I believe this architectural design offers ample room for further studies and may serve as a guiding blueprint for future research.


Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149.
https://doi.org/10.1109/tpami.2016.2577031
Li, F.-F., Jonhson, J., & Yeung, S. (2017). Lecture 11: Detection and segmentation. https://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf