论文链接: https://arxiv.org/pdf/1506.02640.pdf
代码下载: https://github.com/gliese581gg/YOLO_tensorflow
- Abstract
We present YOLO, a new approach to object detection.Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. #我们提出了一种新的目标检测算法,YOLO.先前关于目标检测的算法在检测时需要对分类器进行重定向.与之不同的是,我们将目标检测当成空间独立的bounding boxes以及相关类别概率的回归问题.单一神经网络在对整张图片的一次评估中直接预测bounding boxes和分类概率.因此整个检测原则是可以优化的端到端实现的单一网络. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. #我们的联合结构是极为迅速的.基础YOLO模型处理速度可以达到每秒45帧.更小的网络模型,Fast YOLO,其处理速度可以达到惊人的每秒155帧,同时可以保持其他实时检测算法2倍的平均准确率. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork. #相比于state-of-art检测系统,YOLO会产生更多的定位错误,但是更少可能地在背景上作出错误的检测判断.最后,YOLO可以学习物体的非常通用的表示.它的表现超过了包括DPM及R-CNN在内的其他的检测方法,因为直接从原始图片而不是从图像区域中产生判断.
- Introduction
Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. #人类看一眼图片就立即知道图像中物体是什么,在哪里,以及相互间是什么关系.人类的视觉系统是快速而准确的,从而允许我们从事类似开车的复杂任务而无需过多的思考. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems. #快速而准确的目标检测系统允许计算机在不配备特定传感器的情况下驾驶车辆,帮助辅助装置将实时图像信息传达给人类,并解锁特殊用途的潜能,实时机器人系统. Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10]. #当前的检测系统通过分类器来实现检测.为了检测目标,这些系统针对特定目标训练分类器,并且在图片不同位置和尺度下进行预测.类似deformable parts models(DPM)系统使用滑动窗口几乎对整张图片进行处理,其中分类器在滑动窗口中运行. More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately. #最新的例如R-CNN方法使用region proposal方法首先在一张图片中产生潜在的bounding boxes,然后在这些proposed boxes上进行分类.分类之后,需要对bounding boxes进行后处理,减少重复的检测,并基于场景中的其他物体对boxes进行重新评分[13].这些复杂的原则很慢而且很难优化. We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are. #我们将目标检测重新构造成一个单一的回归问题,直接从像素到bounding box坐标系和分类概率.使用我们的系统,你只需要看一次图片(YOLO)就可以预测物体是什么以及在哪里. YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection. #从图1可以看出,YOLO是简单清新的.一个单一的卷积神经网络同时产生多个bounding boxes以及这些boxes的分类概率.YOLO在全图片进行训练和优化.这个联合模型相比传统的目标检测模型有很多的优点. First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline.We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. #首先,YOLO是极度迅速的.因为我们把检测构造成一个回归问题,我们不需要复杂的原则.只需要让神经网络在测试阶段跑在新的图片来进行预测.我们的基础模型可以在没有批处理的情况下使用Titan X GPU实现每秒45帧,更快的模型可以达到每秒150帧. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our project webpage:http://pjreddie.com/yolo/. #这意味着我们可以实时处理视频流,其中延迟低于25毫秒.此外,YOLO实现了两倍于其他实时检测系统的平均准确率.访问http://pjreddie.com/yolo/可以看到我们的系统运行在一个实时网络摄像机的demo. Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance.Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN. #第二,YOLO基于全局图像进行判断.不像sliding window以及基于region proposal技术,YOLO在训练和测试阶段看的是整张图片,所以它毫无疑问的对物体前后关系以及它们的表现进行编码.Fast R-CNN,文献[14]所述的一个顶尖的检测方法,错误地将背景当成目标,因为它无法看到更广泛的内容.YOLO相比Fast R-CNN的背景误判率低了一半. Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs. #第三,YOLO学习物体更为通用的特征表示.当神经网络在作品上训练和测试时,YOLO在很大程度上超越了其他类似DPM和R-CNN方法.由于YOLO适用性更强,当遇到新的领域或者未知输入时更低可能失效. YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments. #YOLO在准确率上仍然落后于state-of-art检测系统.它在快速识别物体努力提高定位细小物体的精度,我们将在后续的实验中检视这些取舍原则. All of our training and testing code is open source. Avariety of pretrained models are also available to download. #我们所有的训练及测试代码都是开源的.我们也提供无数的预训练模型可被下载.
- Unified Detection
We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. #我们将目标检测中的分离任务统一成一个神经网络.我们的网络使用全图像特征来预测每个bounding box.它也是同时基于所有类别对bounding boxes进行分类预测的. This means our network reasons globally about the full image and all the objects in the image.The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision. #这意味着我们的网络基于全局图像以及图像上所有物体进行推理.YOLO的设计保障了端到端训练以及保证高准确率的前提下实现实时. Our system divides the input image into an S × S grid.If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. #我们的系统将输入图像分割成S*S个格子.如果物体的中心落在一个grid单元上,那么这个单元就需要检测这个物体. Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. #每个格子预测B个bounding boxes以及这些bounding boxes的置信度.这些置信度反应了模型有多确信box中包含目标,以及它认为预测类别的准确率. Formally we define confidence as Pr(Object) ∗ IOU . If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth. #先前我们将置信度定义为Pr(Object)*IoU.如果grid单元中不存在目标,那么置信度应该为0,否则我们希望置信度得分等于预测框与ground truth的IoU. Each bounding box consists of 5 predictions: x, y, w, h,and confidence. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box. #每个bounding box包含了五个预测量:x,y,w,h和置信度.(x,y)坐标代表了box相对于grid单元的中心位置.宽度和高度是相对于整幅图像而言的.最后置信度代表预测框与任意ground truth box的IoU Each grid cell also predicts C conditional class probabilities, Pr(Classi |Object). These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B. #每个grid单元也预测C个条件分类概率,Pr(Classi |Object).这些概念是基于grid单元已经包含一个目标.我们针对每个grid单元只预测一系列分类,而不管boxes的数量B. At test time we multiply the conditional class probabilities and the individual box confidence predictions,which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object. #在测试阶段我们将条件分类概念乘以单个box的置信度预测值,这个预测值告诉我们每个box特定的类别的置信度.这些分数对box中出现物体类别的概念以及预测有多适合物体进行编码.
For evaluating YOLO on PASCAL VOC, we use S = 7,B = 2. PASCAL VOC has 20 labelled classes so C = 20.Our final prediction is a 7 × 7 × 30 tensor. #为了使用PASCAL VOC对YOLO进行评估,我们使用S=7,B=2.PASCAL VOC拥有20个标注类别,因此C=20.我们最终的预测是一个7*7*30的向量.
2.1 Network Design
We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset[9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates. #我们通过卷积神经网络实现这个模型,并且在PASCAL VOC检测数据集[9]上进行验证.最初的卷积层从图片中提取特征,而全连接层负责预测输出概念及坐标. Our network architecture is inspired by the GoogLeNet model for image classification [34]. Our network has 24 convolutional layers followed by 2 fully connected layers.Instead of the inception modules used by GoogLeNet, we simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers, similar to Lin et al [22]. The full network is shown in Figure 3. #我们的网络结构受到用于图像分类的GoogLeNet模型[34]启发.我们的网络拥有24层卷积层,随后有2层全连接层.我们简单地使用了1*1还原层以及3*3卷积层替代了GoogLeNet的inception模块,近似于Lin在文献[22]中描述的.图3展示了整个网络. We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO. #我们同时也训练了YOLO的快速版本用于实现快速目标检测.快速YOLO使用更少卷积层的神经网络(9而不是24层)以及卷积层中更少都滤波器.除了网络尺寸的差异,所有的训练和测试参数都保持一致.
The final output of our network is the 7 × 7 × 30 tensor of predictions. #网络的最终输出是一个7*7*30的预测向量
2.2 Training
We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24]. We use the Darknet framework for all training and inference [26]. #我们大约训练了一周,在ImageNet 2012验证集上达到了single crop top-5 88%的准确率,接近Caffe’s Model Zoo中的GoogLeNet模型.在整个训练和推理阶段我们都使用了Darknet框架(基于C和Cuda的开源框架). We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [29].Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448. #我们随后对网络进行转换用于检测.Ren等人证明了给预训练模型添加卷积层和全连接层可以提高性能[29].遵循他们的例子,我们添加了随机初始化的4个卷积层和两个全连接层.检测通常需要有纹理条纹的视觉信息,所以我们把网络输入分辨率从224*224提高到448*448. Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1. #我们的最后一层对分类概念和bounding box坐标进行预测.我们对bounding box的宽和高关于整张图进行归一化,所以它们落在0和1之间.我们对bounding box的坐标x和y关于特定的grid单元的偏置进行参数化,所以它们也应该落在0和1之间. We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation: #我们为最后一层使用线性激活函数,并且为剩下的所有层使用leaky rectified linear 激活函数:
We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. #我们针对模型输出的平方和进行优化.我们使用平方和是因为它容易优化,但是它并不完美匹配我们最大化平均准确率的目标.它将定位错误等效于分类错误,因此可能并不是非常理想. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability,causing training to diverge early on. #当然,在每张图片中,很多grid单元并不包含任何目标,这将"confidence"得分置为0,通常具有远大于其他包含目标单元的梯度.这可能会导致模型的不稳定,造成早期训练不收敛. To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, λcoord and λnoobj to accomplish this. We set λcoord = 5 and λnoobj = .5. #为了纠正这个,我们增加bounding box 坐标预测的损失并降低不包含目标的boxes预测的损失.我们通过使用了两个参数,λcoord和λnoobj来实现这一目标.我们设置λcoord = 5 以及 λnoobj = .5.
Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.
#大的boxes和小的boxes也使用相同的误差权重.我们的误差度量应该要让大的boxes反应更小的偏差.为了部分强调这个(原则),我们预测bounding box宽度和高度的平方根而不是直接预测宽度和高度.
YOLO predicts multiple bounding boxes per grid cell.At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain
sizes, aspect ratios, or classes of object, improving overall recall.
#YOLO在每个grid单元预测多个bounding boxes.在训练阶段我们只希望一个预测每个目标的bounding box.我们为与ground truth的IoU最高的物体分配了一个预测器.这会造成bounding box预测器的专一化.每个预测器在特定尺度,宽高比,目标类别都具有更好的性能,进而提高整体的recall.
During training we optimize the following, multi-part loss function:
#训练中我们对下面的多部损失函数进行优化:
Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).
#注意损失函数只对grid 单元中出现的目标分类错误进行惩罚(因此之前讨论了条件分类概念).同时只有当那个预测器对ground truth box负责时,才会对这个bounding box的错误定位进行惩罚(也就是说,这个bounding box在grid单元中具有最高的IoU)
We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012. When testing on 2012 we also include the VOC 2007 test data for training. Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.
#我们使用PASCAL VOC 2007以及2012数据集对网络进行135轮的训练及验证.在使用VOC 2012验证集进行测试时我们也将VOC 2007包含到训练数据中.在这个训练阶段我们使用了batch size = 64,momentum = 0.9以及decay = 0.0005.
Our learning rate schedule is as follows: For the first epochs we slowly raise the learning rate from 10−3 to 10−2.If we start at a high learning rate our model often diverges due to unstable gradients. We continue training with 10−2 for 75 epochs, then 10−3 for 30 epochs, and finally 10−4 for 30 epochs.
#我们的学习率机制如下:对于第一轮训练我们将学习率从10-3缓慢提高到10-2.如果我们一开始便设置比较高的学习率,那么经常由于梯度不稳定导致不收敛.继续以10-2训练75轮,以及10-3训练30轮,最后使用10-4训练30轮.
To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers [18].For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.
#为了避免过拟合我们使用了dropout以及数据增广手段.第一层全连接层后面加一个0.5的dropout层避免了层间的相互适应[18].对于数据增广,我们引入随机缩放及平移20%的原始尺寸.我们也在HSV颜色空间以1.5的系数随机调整图像的曝光及饱和度.
2.3 Interence
Just like in training, predicting detections for a test image only requires one network evaluation. On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods. #正如训练一样,测试图片的检测预测只需要单一网络评估.在PASCAL VOC数据集上网络为每张图片预测了98个bounding box,并给每个box预测分类概率.YOLO是极为迅速的,测试时只需要一个单一网络评估,而不像基于分类器的方法. The grid design enforces spatial diversity in the bounding box predictions. Often it is clear which grid cell an object falls in to and the network only predicts one box for each object. However, some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections. While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2-3% in mAP. #grid设计促进了bounding box预测的空间多样化.通常很容易辨别哪个grid中有物体,而网络只需要预测每个物体的box.但是,一些大的物体或者靠近多个cells边界的物体仍然可能被多个cells定位出来.非极大值抑制可用于解决这些重复检测.非极大值抑制在YOLO模型中增加了2-3%的平均准确率,而对R-CNN或DPM的性能则认为没那么重要
2.4 Limitations of YOLO
YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds. #YOLO也暴露出bounding box预测上严重的空间约束制,因为每个grid单元只预测两个boxes,且只能有一个分类.这种空间约束限制了我们模型所能预测的nearby目标数量.我们的模型努力处理成群出现的小目标,例如鸟群. Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image. #由于我们的模型从数据中学会预测bounding boxes,它需要努力推广到新的或不寻常宽高比或配置的目标.我们的模型也相当程度上使用粗纹理来预测bounding boxes因为我们的结构自输入图像后有多种下采样层. Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU.Our main source of error is incorrect localizations. #最后,当我们训练损失函数使之接近检测性能,我们的损失函数同等对待小的bounding boxes和大的bounding boxes中的错误.一个大的box中的小错误通常是可接受的,而小box中的小错误对IoU产生更大的影响.错误的主要原因是不正确的定位.
- Comparison to Other Detection Systems
These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image [35, 15, 39].We compare the YOLO detection system to several top de- tection frameworks, highlighting key similarities and differences. #这些分类器或定位器要么是在整张图像或图像的一些局部区域[35,15,39]上使用sliding window.我们将YOLO检测系统与其他顶级的检测框架进行对比,重点强调主要异同点.
a). Deformable parts models
Deformable parts models(DPM) use a sliding window approach to object detection[10]. DPM uses a disjoint pipeline to extract static features,classify regions, predict bounding boxes for high scoring regions, etc. Our system replaces all of these disparate parts with a single convolutional neural network. #Deformable parts models(DPM)使用滑动窗口实现目标检测[10].DPM使用不相交原则解析局部特征,分类regions,以及预测得分高的regions中的bounding boxes.我们的系统使用单一卷积神经网络替换所有这些离散部件. The network performs feature extraction, bounding box prediction, non-maximal suppression, and contextual reasoning all concurrently. Instead of static features, the network trains the features in-line and optimizes them for the detection task. Our unified architecture leads to a faster, more accurate model than DPM. #网络在特征提取,bounding box预测,非极大值抑制,以及上下文推理上表现非常优异.网络训练的是行特征,而不是局部特征,并在检测任务中进行优化.我们的联合结构促使了一个比DPM更快更准确的模型.
b). R-CNN
R-CNN and its variants use region proposals instead of sliding windows to find objects in images. Selective Search [35] generates potential bounding boxes, a convolutional network extracts features, an SVM scores the boxes, a linear model adjusts the bounding boxes, and non-max suppression eliminates duplicate detections. Each stage of this complex pipeline must be precisely tuned independently and the resulting system is very slow, taking more than 40 seconds per image at test time [14]. #R-CNN及其变种使用region proposals替换sliding windows搜寻图像中的目标.Selective Search[35]产生潜在的bounding boxes,一个卷积网络提取特征,一个SVM对这些boxes进行评分,一个线性模型调整bounding boxes,以及非最大值抑制来去除重复的检测.这个复杂原则中的每个阶段都需要分别精心调试,同时系统极为缓慢,在测试阶段需要处理一张图片需要40秒[14]. YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and scores those boxes using convolutional features. However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object. Our system also proposes far fewer bounding boxes, only 98 per image compared to about 2000 from Selective Search.Finally, our system combines these individual components into a single, jointly optimized model. #YOLO与R-CNN有很多相似处.每个grid单元产生潜在的bounding boxes,并使用卷积特征对bounding boxes进行评分.但是,我们的系统在grid 单元建议上设置了空间限制,这可以避免同一物体的重复检测.我们的系统也提出更少的bounding boxes,每张图仅98个,作为对比的Selective Search则产生2000的.最后,我们的系统将很多独立的模块合并层一个单一不相交的优化模型.
c). Other Fast Detectors
Fast and Faster R-CNN focus on speeding up the R-CNN framework by sharing computation and using neural networks to propose regions instead of Selective Search [14] [28]. While they offer speed and accuracy improvements over R-CNN, both still fall short of real-time performance. #Fast及Faster R-CNN强调通过共享卷积层实现R-CNN框架的加速,并使用神经网络替代Selective Search[14,28]来推荐regions.这些举措起到提高速度和准确率的作用,但仍然没有达到实时的性能. Many research efforts focus on speeding up the DPM pipeline [31] [38] [5]. They speed up HOG computation,use cascades, and push computation to GPUs. However,only 30Hz DPM [31] actually runs in real-time. #很多研究[31,38,5]致力于DPM原则的加速上.他们使用cascades来加速HOG计算,并使用GPU进行计算.但是只达到30Hz的实时效果. Instead of trying to optimize individual components of a large detection pipeline, YOLO throws out the pipeline entirely and is fast by design. #YOLO提出整个的原则,并且在设计时即保证快速性,而不是尝试基于一个大的检测原则进行单独优化. Detectors for single classes like faces or people can be highly optimized since they have to deal with much less variation [37]. YOLO is a general purpose detector that learns to detect a variety of objects simultaneously. #针对例如人脸或者行人的检测器可以被高度优化,因为它们需要处理更少的变换[37].YOLO是一个通用的多用途检测器,可以同时检测多种目标.
d). Deep MultiBox
Unlike R-CNN, Szegedy et al. train a convolutional neural network to predict regions of interest[8] instead of using Selective Search. MultiBox can also perform single object detection by replacing the confidence prediction with a single class prediction. However, Multi-Box cannot perform general object detection and is still just a piece in a larger detection pipeline, requiring further image patch classification. Both YOLO and MultiBox use a convolutional network to predict bounding boxes in an image but YOLO is a complete detection system.
#与R-CNN不同的是,Szegedy et al.训练了一个卷积神经网络来预测感兴趣区域,而不是使用Selective Search.通过将置信度预测替换为单个类别的预测,Multi-Box也可以实现单一物体的检测.但是,MultiBox无法实现通用目标检测,因此在更大的检测pipeline中始终是小部分,需要进一步的图像路径分类.YOLO和MultiBox都使用卷积神经网络来预测图像中的bounding boxes,但是YOLO是一个完整的检测系统.
e). OverFeat
Sermanet et al. train a convolutional neural network to perform localization and adapt that localizer to perform detection [32]. OverFeat efficiently performs sliding window detection but it is still a disjoint system. Over-Feat optimizes for localization, not detection performance.Like DPM, the localizer only sees local information when making a prediction. OverFeat cannot reason about global context and thus requires significant post-processing to produce coherent detections
#Sermanet et al.训练了一个卷积神经网络用于实现定位,并对localizer进行改进以实现检测[32].OverFeat在sliding window检测上非常有效率,但是仍然是一个不相交的系统.OverFeat针对定位进行了优化,但不是检测性能.与DPM类似,localizer在预测时只关注局部信息.OverFeat无法对全局信息进行解释,因此需要重要的后处理以实现相干检测.
d). MultiGrasp
Our work is similar in design to work on grasp detection by Redmon et al [27]. Our grid approach to bounding box prediction is based on the MultiGrasp system for regression to grasps. However, grasp detection is a much simpler task than object detection. MultiGrasp only needs to predict a single graspable region for an image containing one object. It doesn’t have to estimate the size, location, or boundaries of the object or predict it’s class, only find a region suitable for grasping. YOLO predicts both bounding boxes and class probabilities for multiple objects of multiple classes in an image
#我们的工作在设计上与Redmon et al[27]的grasp detection类似.我们关于bounding box预测的网格法是基于MultiGrasp系统中的grasp回归.但是,grasp检测比起目标检测要简单很多.MultiGrasp只需要检测图像中包含一个目标的graspable区域.它不需要估计尺寸,定位或者目标的边界以及它所属的类别,只需要找到一个适合grasping的区域.YOLO同时预测图像中多个分类物体的bounding boxes以及所属类别的概率.
- Experiments
First we compare YOLO with other real-time detection systems on PASCAL VOC 2007. To understand the differences between YOLO and R-CNN variants we explore the errors on VOC 2007 made by YOLO and Fast R-CNN, one of the highest performing versions of R-CNN [14]. Based on the different error profiles we show that YOLO can be used to rescore Fast R-CNN detections and reduce the errors from background false positives, giving a significant performance boost. We also present VOC 2012 results and compare mAP to current state-of-the-art methods. Finally,we show that YOLO generalizes to new domains better than other detectors on two artwork datasets
#我们首先在PASCAL VOC 2007数据集上对比了YOLO以及其他实时检测系统.为了理解YOLO以及其他R-CNN变体的差别,我们对YOLO以及Fast R-CNN(R-CNN[14]中性能最好的版本)在VOC 2007数据集上的错误进行探索.从不同的错误profiles中我们展示了YOLO可以用于Fast R-CNN检测的二次评分,降低背景中false positives错误,进而带来明显的性能提升.同时我们也展示了VOC 2012数据集上的结果,并针对平均准确率与当前的state-of-the-art方法进行对比.最后,我们在两个artwork数据集上展示了YOLO在新领域迁移上优化其他检测器.
4.1 Comparison to Other Real-Time Systems
Many research efforts in object detection focus on making standard detection pipelines fast. [5] [38] [31] [14] [17] [28] However, only Sadeghi et al. actually produce a detection system that runs in real-time (30 frames per second or better) [31]. We compare YOLO to their GPU implementation of DPM which runs either at 30Hz or 100Hz.While the other efforts don’t reach the real-time milestone we also compare their relative mAP and speed to examine the accuracy-performance tradeoffs available in object detection systems.
Fast YOLO is the fastest object detection method on PASCAL; as far as we know, it is the fastest extant object detector. With 52.7% mAP, it is more than twice as accurate as prior work on real-time detection. YOLO pushes mAP to 63.4% while still maintaining real-time performance.
We also train YOLO using VGG-16. This model is more accurate but also significantly slower than YOLO. It is useful for comparison to other detection systems that rely on VGG-16 but since it is slower than real-time the rest of the paper focuses on our faster models
Fastest DPM effectively speeds up DPM without sacrificing much mAP but it still misses real-time performance by a factor of 2 [38]. It also is limited by DPM’s relatively low accuracy on detection compared to neural network approaches
R-CNN minus R replaces Selective Search with static bounding box proposals [20]. While it is much faster thanR-CNN, it still falls short of real-time and takes a significant accuracy hit from not having good proposals
Fast R-CNN speeds up the classification stage of R-CNN but it still relies on selective search which can take around 2 seconds per image to generate bounding box proposals.Thus it has high mAP but at 0.5 fps it is still far from realtime
The recent Faster R-CNN replaces selective search with a neural network to propose bounding boxes, similar to Szegedy et al. [8] In our tests, their most accurate model achieves 7 fps while a smaller, less accurate one runs at 18 fps. The VGG-16 version of Faster R-CNN is 10 mAP higher but is also 6 times slower than YOLO. The Zeiler-Fergus Faster R-CNN is only 2.5 times slower than YOLO but is also less accurate
4.2 VOC 2007 Error Analysis
To further examine the differences between YOLO and state-of-the-art detectors, we look at a detailed breakdown of results on VOC 2007. We compare YOLO to Fast R-CNN since Fast R-CNN is one of the highest performing detectors on PASCAL and it’s detections are publicly available.
Figure 4: Error Analysis: Fast R-CNN vs. YOLO These charts show the percentage of localization and background errors in the top N detections for various categories (N = # objects in that category).
We use the methodology and tools of Hoiem et al. [19] For each category at test time we look at the top N predictions for that category. Each prediction is either correct or it is classified based on the type of error:
• Correct: correct class and IOU > .5
• Localization: correct class, .1 < IOU < .5
• Similar: class is similar, IOU > .1
• Other: class is wrong, IOU > .1
• Background: IOU < .1 for any object
Figure 4 shows the breakdown of each error type averaged across all 20 classes.
YOLO struggles to localize objects correctly. Localization errors account for more of YOLO’s errors than all other sources combined. Fast R-CNN makes much fewer localization errors but far more background errors. 13.6% of it’s top detections are false positives that don’t contain any objects. Fast R-CNN is almost 3x more likely to predict background detections than YOLO.
4.3 Combining Fast R-CNN and YOLO
YOLO makes far fewer background mistakes than Fast R-CNN. By using YOLO to eliminate background detections from Fast R-CNN we get a significant boost in performance. For every bounding box that R-CNN predicts we check to see if YOLO predicts a similar box. If it does, we give that prediction a boost based on the probability predicted by YOLO and the overlap between the two boxes
The best Fast R-CNN model achieves a mAP of 71.8% on the VOC 2007 test set. When combined with YOLO, its mAP increases by 3.2% to 75.0%. We also tried combining the top Fast R-CNN model with several other versions of Fast R-CNN. Those ensembles produced small increases in mAP between .3 and .6%, see Table 2 for details.
Table 2: Model combination experiments on VOC 2007. We examine the effect of combining various models with the best version of Fast R-CNN. Other versions of Fast R-CNN provide only a small benefit while YOLO provides a significant performance boost.
The boost from YOLO is not simply a byproduct of model ensembling since there is little benefit from combining different versions of Fast R-CNN. Rather, it is precisely because YOLO makes different kinds of mistakes at test time that it is so effective at boosting Fast R-CNN’s performance.
Unfortunately, this combination doesn’t benefit from the speed of YOLO since we run each model seperately and then combine the results. However, since YOLO is so fast it doesn’t add any significant computational time compared to Fast R-CNN.
4.4 VOC 2012 Results
On the VOC 2012 test set, YOLO scores 57.9% mAP.This is lower than the current state of the art, closer to the original R-CNN using VGG-16, see Table 3. Our system struggles with small objects compared to its closest competitors. On categories like bottle, sheep, and tv/monitor YOLO scores 8-10% lower than R-CNN or Feature Edit. However, on other categories like cat and train YOLO achieves higher performance.
Table 3: PASCAL VOC 2012 Leaderboard. YOLO compared with the full comp4 (outside data allowed) public leaderboard as of November 6th, 2015. Mean average precision and per-class average precision are shown for a variety of detection methods. YOLO is the only real-time detector. Fast R-CNN + YOLO is the forth highest scoring method, with a 2.3% boost over Fast R-CNN.
Our combined Fast R-CNN + YOLO model is one of the highest performing detection methods. Fast R-CNN gets a 2.3% improvement from the combination with YOLO,boosting it 5 spots up on the public leaderboard.
4.5 Generalizability: Person Detection in Artwork
Academic datasets for object detection draw the training and testing data from the same distribution. In real-world applications it is hard to predict all possible use cases and the test data can diverge from what the system has seen before [3]. We compare YOLO to other detection systems on the Picasso Dataset [12] and the People-Art Dataset [3], two datasets for testing person detection on artwork.
Figure 5 shows comparative performance between YOLO and other detection methods. For reference, we give VOC2007 detection AP on person where all models are trained only on VOC 2007 data. On Picasso models are trained on VOC2012 while on People-Art they are trained on VOC2010.
R-CNN has high AP on VOC 2007. However, R-CNN drops off considerably when applied to artwork. R-CNN uses Selective Search for bounding box proposals which is tuned for natural images. The classifier step in R-CNN only sees small regions and needs good proposals.
DPM maintains its AP well when applied to artwork.Prior work theorizes that DPM performs well because it has strong spatial models of the shape and layout of objects.Though DPM doesn’t degrade as much as R-CNN, it starts from a lower AP.
YOLO has good performance on VOC 2007 and its AP degrades less than other methods when applied to artwork.Like DPM, YOLO models the size and shape of objects,as well as relationships between objects and where objects commonly appear. Artwork and natural images are very different on a pixel level but they are similar in terms of the size and shape of objects, thus YOLO can still predict good bounding boxes and detections.
- Real-Time Detection In The Wild
YOLO is a fast, accurate object detector, making it ideal for computer vision applications. We connect YOLO to a webcam and verify that it maintains real-time performance,including the time to fetch images from the camera and display the detections.
#YOLO是一个快速准确的目标检测器,使得它成为计算机视觉应用的理想之选.我们将YOLO与一个网络摄像机相连,并验证了它的实时性能,包括从相机获取图像并展示检测结果.
The resulting system is interactive and engaging. While YOLO processes images individually, when attached to a webcam it functions like a tracking system, detecting objects as they move around and change in appearance. A demo of the system and the source code can be found on our project website: http://pjreddie.com/yolo/.
#相应的系统是可交付和迷人的.YOLO处理图片时是独立进行的,而与网络摄像机相连之后它表现的像一个追踪系统,检测物体的运动以及外表的变化.系统的一个demo以及源文件可以在我们的项目官网找到:http://pjreddie.com/yolo/
- Conclusion
We introduce YOLO, a unified model for object detection. Our model is simple to construct and can be trained directly on full images. Unlike classifier-based approaches,YOLO is trained on a loss function that directly corresponds to detection performance and the entire model is trained jointly.
#我们引入了YOLO,目标检测的一个联合模型.我们的模型易于构建并且可以在完整的图像上直接进行训练.不像基于分类器的方法,YOLO是在一个与检测性能直接相关的损失函数上训练的,同时完整的模型是共同训练的.
Fast YOLO is the fastest general-purpose object detector in the literature and YOLO pushes the state-of-the-art in real-time object detection. YOLO also generalizes well to new domains making it ideal for applications that rely on fast, robust object detection.
#Fast YOLO是这些作品里面最快的通用目标检测模型,YOLO推动了实时目标检测的state-of-the-art.YOLO在新的领域上也运行良好,这使它称为高速鲁棒检测系统的理想选择.