论文链接: https://arxiv.org/pdf/1311.2524.pdf
- Abstract
Our approach combines two key insights:
(1) one can apply high-capacity convolutional neural net-works (CNNs) to bottom-up region proposals in order to localize and segment objects.
#bottom-up region proposals估计是指的自下而上的region proposals,有点类似DFM的思想
(2) when labeled training data is scarce, supervised pre-training for an auxiliary task,followed by domain-specific fine-tuning, yields a significant performance boost.
#检测任务中带标记的数据往往比较少,辅助的预训练加上特定领域的fine-tuning,可以起到比较好的boost
- Introduction
Features matter.
#过去的数十年见证了基于SIFT和HOG的不同算法的发展,回头看权威的视觉检测任务中,自2010年到2012年间进展开始放缓.
#CNN在90年代曾经有一波热潮,但随着SVM的兴起逐渐淡出,随后2012年Krizhevsky在ILSVRC任务上大举获胜,从而重新燃起人们对CNN的兴趣,他们的成功很大程度上受益于1.2 million张带标记的图片以及LeCun的CNN.
The central issue can be distilled to the following: To what extent do the CNN classification results on ImageNet generalize to object detection results on the PASCAL VOC Challenge?
#随后人们讨论的焦点成为:CNN在图像检测领域能够多大程度上重现其在图像检测上的成功
This paper is the first to show that a CNN can lead to dramatically higher object detection performance on PASCAL VOC as compared to systems based on simpler HOG-like features.To achieve this result, we focused on two problems: localizing objects with a deep network and training a high-capacity model with only a small quantity of annotated detection data
#这篇论文证明了CNN在PASCAL VOC目标检测任务上相比基于HoG特征的算法具有巨大的优势.为了达到这一个目标,聚焦于两个问题:使用深度网络定位目标,同时使用高容量模型以及少量的检测标注数据进行训练.
One approach frames localization as a regression problem.We also considered adopting a sliding-window approach.However,units high up in our network, which has five convolutional layers, have very large receptive fields (195 × 195 pixels) and strides (32×32 pixels) in the input image.
#我们的算法中将每一帧的定位视为回归问题,同时我们也考虑使用滑动窗口进行检测,但我们采用了更深的模型,具有更大的感受野和步长,这使得滑动窗口下的精确定位成了一个非常有挑战性的任务.
At test time, our method generates around 2000 category-independent region proposals for the input image, extracts a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear SVMs. We use a simple technique (affine image warping) to compute a fixed-size CNN input from each region proposal, regardless of the region’s shape.
#在测试阶段,针对每一张图片我们的算法产生了约2000不依赖于目录的region proposals,并使用CNN从每个proposal中提取固定长度的特征向量,然后使用特定范围的相信SVM进行分类.我们使用了一个简单的技术(仿射及扭曲变换)从每个region proposal中提取固定尺寸的输入数据.
A second challenge faced in detection is that labeled data is scarce and the amount currently available is insufficient for training a large CNN. The conventional solution to this problem is to use unsupervised pre-training, followed by supervised fine-tuning (e.g., [35]).
#第二个面临的挑战是标注的数据非常少,以至于不足以训练大型CNN网络.传统的解决方法是使用无监督学习进行预训练,随后使用监督学习进行fine-tuning.
The second principle contribution of this paper is to show that supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain specific fine-tuning on a small dataset (PASCAL), is an effective paradigm for learning high-capacity CNNs when data is scarce
#这篇论文第二个原则性的贡献在于展示了在大型辅助数据集上进行监督预训练,随后在一个小数据集上进行特定范围的fine-tuning,对于标注数据较少情况下训练大型CNN具有比较好的示范.
Understanding the failure modes of our approach is also critical for improving it, and so we report results from the detection analysis tool of Hoiem et al. [23].
#理解我们方法中的错误检测对于模型的改进也非常重要,所以我们提交了Hoiem检测分析工具的分析结果.
- Object detection with R-CNN
Our object detection system consists of three modules.
#我们的目标检测系统包含三个模块
The first generates category-independent region proposals.These proposals define the set of candidate detections available to our detector.
#第一个模块产生不依赖于目录的region proposals.这些proposals定义了一组可以在检测器中使用的候选检测.
The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region.
#第二个模块是一个从各个region中提取特征向量的大型卷积神经网络.
The third module is a set of class-specific linear SVMs.
#第三个模块是一系列特定类别的线性SVM分类器
2.1 Module design
a) Region proposals
A variety of recent papers offer methods for generating category-independent region proposals.
#近期的许多论文都提供了生成不依赖目录的region proposals的方法.
While R-CNN is agnostic to the particular region proposal method, we use selective search to enable a controlled comparison with prior detection work (e.g., [39, 41]).
#由于R-CNN没有明确的region proposal方法,我们使用selective search来开启一个与之前最优检测成果的对照比较.
b) Feature extraction
We extract a 4096-dimensional feature vector from each region proposal using the Caffe [24] implementation of the CNN described by Krizhevsky etal. [25].
#我们使用Krizhevsky在文献25中描述的基于Caffe的CNN实现从每个region proposal中提取4096维的特征向量.
Features are computed by forward propagating a mean-subtracted 227 × 227 RGB image through five convolutional layers and two fully connected layers. We refer readers to [24, 25] for more network architecture details.
#通过对227*227 RGB的图像去均值并经过五个卷积层和两个全连接层的前向传播计算得到特征向量,更多与网络相关的细节请查阅文献24及25.
In order to compute features for a region proposal, we must first convert the image data in that region into a form that is compatible with the CNN (its architecture requires inputs of a fixed 227 × 227 pixel size).
#为了计算region proposal的特征向量,我们必须首先将region里的图像转换为CNN兼容的(这种结构需要输入分辨率固定为227*227)
Of the many possible transformations of our arbitrary-shaped regions, we opt for the simplest. Regardless of the size or aspect ratio of the candidate region, we warp all pixels in a tight bounding box around it to the required size.
#在所有可能对任意形状regions进行转换的方法中,我们选择最简单的.撇开候选region的尺寸及宽高比,我们将所有像素缩放到要求的tight bounding box.
2.2 Test-time detection
At test time, we run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments).
#在测试阶段,我们使用selective search在测试图片中提取了约2000个region proposals(实验中我们使用快速模式)
We warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class.
#我们将每个proposal进行缩放并使用CNN前向传播来提取特征.然后对于每个类别,我们使用SVM对每个提取出来的特征向量进行分类.
Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over-union (IoU) overlap with a higher scoring selected region larger than a learned threshold.
#当一张图片中所有regions都被评分后,我们应用非极大值抑制(每个类别独立进行),当候选region与一个得分更高的region相交IoU超过预设值时,舍弃该候选region.
a) Run-time analysis.
Two properties make detection efficient. First, all CNN parameters are shared across all categories. Second, the feature vectors computed by the CNN are low-dimensional when compared to other common approaches, such as spatial pyramids with bag-of-visual-word encodings.
#这其中的两个性质使得检测富有成效.首先,所有目录的CNN参数相互共享.其次,这些由CNN提取出来的特征相比于其他普通方法(例如视觉bags空间金字塔)具有更低的维度.
2.3 Training
a) Supervised pre-training.
We discriminatively pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only (bounding-box labels are not available for this data).
#我们针对性地在一个仅包含image-level标注的数据集(不包含bounding-box)上对CNN进行预训练.
In brief, our CNN nearly matches the performance of Krizhevsky et al. [25], obtaining a top-1 error rate 2.2 percentage points higher on the ILSVRC2012 classification validation set. This discrepancy is due to simplifications in the training process.
#简单来说,我们的CNN获得了接近于Krizhevsky在文献25中的表现,在ILSVRC2012分类验证集上top-1的错误率高了2.2%,这里的出入主要是由于训练过程的简化处理.
b) Domain-specific fine-tuning.
In each SGD iteration, we uniformly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128. We bias the sampling towards positive windows because they are extremely rare compared to background.
#在每个SGD iteration中,我们取32个正样本和96个背景样本用于构造一个尺寸为128的min-batch.我们倾向于对正样本进行采样是由于他们相对数量较少.
c) Object category classifiers
Consider training a binary classifier to detect cars. It’s clear that an image region tightly enclosing a car should be a positive example. Similarly, it’s clear that a background region, which has nothing to do with cars, should be a negative example.
#假定训练一个分类器用于检测汽车.显然一个紧密包围汽车的region应该被视为正向样本.类似的,与汽车没有任何交集的应该被当做背景样本.
Less clear is how to label a region that partially overlaps a car.We resolve this issue with an IoU overlap threshold, below which regions are defined as negatives.
#比较含糊不清的是如何标注一个region,当它和汽车部分相交时.我们通过定义IoU阈值来解决这个问题,低于阈值的即被定义为负样本.
We found that selecting this threshold carefully is important. Setting it to 0.5, as in [39], decreased mAP by 5 points. Similarly, setting it to 0 decreased mAP by 4 points.
#我们发现谨慎选择这个阈值非常重要.如文献39所述,将阈值设置为0.5会使mAP降低5%,将阈值设置为0会使mAP降低4%.
Once features are extracted and training labels are applied, we optimize one linear SVM per class. Since the training data is too large to fit in memory, we adopt the standard hard negative mining method [17, 37].
#一旦特征提出并且完成训练标注,我们将对一个线性SVM进行优化.由于训练数据太大以至于无法存入内存,我们采用了标准的hard negative mining方法.
2.4. Results on PASCAL VOC 2010-12
We compare our method against four strong baselines, including SegDPM [18], which combines DPM detectors with the output of a semantic segmentation system [4] and uses additional inter-detector context and image-classifier rescoring.
#我们将所用的方法与四个很强的baselines进行对比,包括将DPM检测器和一个语义分割系统输出结合起来并使用额外的上下文检测器和图像二次评分的SegDPM[18].
The most germane comparison is to the UVA system from Uijlings et al. [39],since our systems use the same region proposal algorithm.
#最合适的比较是Uijlings在文献39中提出的UVA系统,因为我们的系统用了相同的region proposal算法.
2.5. Results on ILSVRC2013 detection
Most of the competing submissions (OverFeat, NEC-MU, UvA-Euvision, Toronto A, and UIUC-IFP) used convolutional neural networks, indicating that there is significant nuance in how CNNs can be applied to object detection, leading to greatly varying outcomes.
#大多数有竞争力的提交(OverFeat, NEC-MU, UvA-Euvision, Toronto A, and UIUC-IFP)都使用了卷积神经网络,说明如何将CNN应用到目标检测中有重要的细微差异,这些差异使得结果相差巨大.
- Visualization, ablation, and modes of error
3.1. Visualizing learned features
First-layer filters can be visualized directly and are easy to understand [25]. They capture oriented edges and opponent colors. Understanding the subsequent layers is more challenging. Zeiler and Fergus present a visually attractive deconvolutional approach in [42].
#第一层滤波器可以直接可视化并且很容易理解.它们获取有向边及有对比的颜色.理解后续的网络层变得具有挑战性.Zeiler和Fergus展示了一种生动的可视化反卷积方法,如文献42所述.
We visualize units from layer pool5, which is the max-pooled output of the network’s fifth and final convolutional layer.
#我们从第五层开始进行可视化,这是第五层网络的最大池化输出,也是最后的卷积层.
The network appears to learn a representation that combines a small number of class-tuned features together with a distributed representation of shape,texture, color, and material properties. The subsequent fullyconnected layer fc6 has the ability to model a large set of compositions of these rich features.
#网络似乎学会了表达融合一小部分特定类别的特征,这些特征由离散形状,质地,颜色及材料性质组成.随后的全连接层fc6具有表征一系列复杂特征组合的能力.
3.2. Ablation studies
a) Performance layer-by-layer, without fine-tuning.
To understand which layers are critical for detection performance,we analyzed results on the VOC 2007 dataset for each of the CNN’s last three layers.
#为了了解哪一层对检测性能最重要,我们分析了最后三层对VOC2007数据集的影响.
We start by looking at results from the CNN without fine-tuning on PASCAL, i.e. all CNN parameters were pre-trained on ILSVRC 2012 only.
#我们开始回顾CNN未经fine-tuning在PASCAL数据集的结果,也就是说,所有的CNN参数都只在ILSVRC 2012数据集上预训练.
Analyzing performance layer-by-layer (Table 2 rows 1-3) reveals that features from fc7 generalize worse than features from fc6.This means that 29%, or about 16.8 million, of the CNN’s parameters can be removed without degrading mAP.
#令人惊讶的是层间对比显示fc7产生特征的能力比fc6还要差.这意味着29%或者说16.8 million参数可以被移除而不至于降低mAP.
More surprising is that removing both fc7 and fc6 produces quite good results even though pool5 features are computed using only 6% of the CNN’s parameters.
#更加惊奇的是把fc7和fc6都移除掉同样可以有很好的结果,即使pool5特征的计算只占用6%的CNN参数.
Much of the CNN’s representational power comes from its convolutional layers, rather than from the much larger densely connected layers.
#大多数CNN的表达能力都来自于它的卷积层,而不是更大型的紧密连接层.
b) Performance layer-by-layer, with fine-tuning.
We now look at results from our CNN after having fine-tuned its parameters on VOC 2007 trainval. The improvement is striking (Table 2 rows 4-6): fine-tuning increases mAP by 8.0 percentage points to 54.2%.
#我们现在看CNN模型经过VOC 20007 数据集上fine-tun后的参数.这个改善是惊人的:fine-tuning将mAP提高了8个百分点到了54.2%.
The boost from fine-tuning is much larger for fc6 and fc7 than for pool5, which suggests that the pool5 features learned from ImageNet are general and that most of the improvement is gained from learning domain-specific non-linear classifiers on top of them.
#fc6和fc7经fine-tuning带来的boost远超过pool5,这说明pool5从ImageNet学习到的特征是通用的,而大多数提升来源于特定类别非线性分类器的学习.
c) Comparison to recent feature learning methods.
All R-CNN variants strongly outperform the three DPM baselines (Table 2 rows 8-10), including the two that use feature learning.
#所有R-CNN的变体都明显优于三个DPM baselines(包括两个使用特征学习的DPM)
3.3. Network architectures
Most results in this paper use the network architecture from Krizhevsky et al. [25]. However, we have found that the choice of architecture has a large effect on R-CNN detection performance.
#本篇论文使用的网络框架大多来源于Krizhevsky在文献25提出的.但是,我们发现网络框架的选择对R-CNN检测性能有巨大的影响.
In Table 3 we show results on VOC 2007 test using the 16-layer deep network recently proposed by Simonyan and Zisserman[43]. This network was one of the top performers in the recent ILSVRC 2014 classification challenge.
#在表3中我们展示了使用近期Simonyan和Zisserman文献43中提出的16层深度网络在VOC 2007上的测试结果.这个网络是ILSVRC 2014分类挑战赛获得最佳表现.
We refer to this network as “O-Net” for OxfordNet and the baseline as “T-Net” for TorontoNet.
#我们把OxfordNet称之为O-Net,以及把TorontoNet称之为T-Net.
The results in Table 3 show that R-CNN with O-Net substantially outperforms R-CNN with T-Net, increasing mAP from 58.5% to 66.0%.
#表3中的结果显示使用O-Net的R-CNN表现相当多的优于使用T-Net的R-CNN,将mAP从58.5%提升到66%
3.4. Detection error analysis
We applied the excellent detection analysis tool from Hoiem et al.[23] in order to reveal our method’s error modes, understand how fine-tuning changes them, and to see how our error types compare with DPM.
#我们应用了Hoiem在文献23中提到的完美的检测分析工具,为了揭示我们的错误模式,理解fine-tuning是如何改变它们,并将我们的错误类型与DPM相比.
3.5. Bounding-box regression
Based on the error analysis, we implemented a simple method to reduce localization errors. Inspired by the bounding-box regression employed in DPM [17], we train a linear regression model to predict a new detection window given the pool5 features for a selective search region proposal.
#基于错误分析结果,我们实施了一个简单的方法用以降低定位错误.受到DPM中提到的bounding box回归启发,我们训练了一个线性分类器用于给selective search region proposal预测新的pool5给定检测窗口
3.6. Qualitative results
略过
- The ILSVRC2013 detection dataset
4.1. Dataset overview
The ILSVRC2013 detection dataset is split into three sets: train (395,918), val (20,121), and test (40,152), where the number of images in each set is in parentheses.
#ILSVRC2013检测数据集被氛围三个子集:train (395,918), val (20,121), and test (40,152),括号里为每个图片集的数量.
The val and test splits are drawn from the same image distribution. These images are scene-like and similar in complexity (number of objects, amount of clutter, pose variability, etc.) to PASCAL VOC images.
#val和test子集取自相同的图像数据发布.这些图像与PASCAL VOC图像集具有相似的场景和复杂度(物体的数量,杂乱程度,姿态变化等).
The val and test splits are exhaustively annotated, meaning that in each image all instances from all 200 classes are labeled with bounding boxes.
#val和test子集是完全注释的,意味着每张图片都对200种物体进行bounding boxes所标注.
The nature of these splits presents a number of choices for training R-CNN. The train images cannot be used for hard negative mining, because annotations are not exhaustive.Where should negative examples come from? Also,the train images have different statistics than val and test.
#作这些分割的本意是给R-CNN的训练提供一系列的选择.这些训练样本不能够用于hard negative mining,因为标注不够完全.那么negative样本该从哪里来呢?当然训练图像除了val和test还有其他不同的统计.
Our general strategy is to rely heavily on the val set and use some of the train images as an auxiliary source of positive examples.
#我们通常的策略是主要依赖于val数据集并使用一些训练图像作为正向样本的辅助来源.
4.2. Region proposals
We followed the same region proposal approach that was used for detection on PASCAL. Selective search [39] was run in “fast mode” on each image in val1, val2 , and test (but not on images in train). One minor modification was required to deal with the fact that selective search is not scale invariant and so the number of regions produced depends on the image resolution. ILSVRC image sizes range from very small to a few that are several mega-pixels, and so we resized each image to a fixed width (500 pixels) before running selective search. #我们使用PASCAL检测任务中使用的相同的region proposal方法.val1,val2和test数据集(但不包含train数据集中的图像)中的每张图像都采用了快速模式的selective search.我们作了一些镜像修正因为selective search不具备尺度不变性,所以产生的regions取决于图像分辨率.ILSVRC图像尺寸从很小到很大,所以我们在运行selective search之前需要将图像尺寸重新调整到固定的宽度(500像素).
4.3. Training data
For training data, we formed a set of images and boxes that includes all selective search and ground-truth boxes from val1 together with up to N ground-truth boxes per class from train (if a class has fewer than N ground-truth boxes in train, then we take all of them). We’ll call this dataset of images and boxes val1 +trainN. #对于训练数据,我们使用一系列图片和selective search产生的所有boxes以及val1中的ground-truth boxes形成每个类别包含N个ground-truth boxes(如果有一个类别包含少于N个ground-truth,那么取所有的ground-truth).我们称这样的图像集为val1+trainN.
Training data is required for three procedures in R-CNN:(1) CNN fine-tuning, (2) detector SVM training, and (3)bounding-box regressor training.
#训练数据在R-CNN的三个程序中用到:(1)CNN fine-tuning,(2)SVM训练检测器,以及(3)bounding-box回归训练
4.4. Validation and evaluation
Before submitting results to the evaluation server, we validated data usage choices and the effect of fine-tuning and bounding-box regression on the val2 set using the training data described above. #在往服务器提交结果之前,我们对数据使用,fine-tuning的效果以及使用上述数据在val2训练bounding-box regression. All system hyperparameters (e.g.,SVM C hyperparameters, padding used in region warping, NMS thresholds, bounding-box regression hyperparameters) were fixed at the same values used for PASCAL. #所有的系统超参数(举例来说,SVM C参数,region缩放时使用的padding,NMS阈值,bounding-box regression超参数)保持为与PASAL使用的一致. Undoubtedly some of these hyperparameter choices are slightly suboptimal for ILSVRC, however the goal of this work was to produce a preliminary R-CNN result on ILSVRC without extensive dataset tuning. #毫无疑问的有一些超参数的选择对于ILSVRC在一定程度上并非最优,但是这篇文章的目的是确定没有充足数据tuning时R-CNN的初步结果.
4.5. Ablation study
4.6. Relationship to OverFeat
- Semantic segmentation
a)CNN features for segmentation.
b)Results on VOC 2011
- Conclusion
- Appendix
A. Object proposal transformations
B. Positive vs. negative examples and softmax
C. Bounding-box regression
D. Additional feature visualizations
E. Per-category segmentation results
F. Analysis of cross-dataset redundancy
G. Document changelog