Background
1. The authors shed light on how global average pooling layer explicitly enables the CNN to have remarkable localization ability despite trained on image-level labels.
2. Recent work has shown that the convolutional units of various layers of CNN actually behave as object detectors despite no supervision on the location of the object was provided.
3. This localizable ability is lost when fully-connected layers are used for classification.
4. The authors found: with a little tweaking, the network can retain localization ability until the final layer.
5. This technique allows the classification-trained CNN to both classify the image and localize class-specific image regions in a single forward-pass.
Main points
It is very easy to implement the net architecture. Just add a fully-connected layer after global average pooling. The derivation process is ingenious. For more details, please refer to the original paper.