Progress and Prospect of target detection technology based on deep learning

Viola-Jones face detector

One of the more successful examples of object detection in the whole computer field is the Viola-Jones face detector that appeared around 2000,which makes it a more mature technique compared to the object detection.The basic idea of this method is sliding window type,using a fixed size window to slide in the input image,and the window frame will be sent to the classifier to judge whether it is a face window or a non human face window.The size of the sliding window is fixed,but the size of the face is varied.In size of the sliding is fixed,but the size of the face is varied.In order to detect the different size of the face,it is necessary to scale the input image to different sizes,so that the face of different sizes,so that the face of different sizes can match the size of the window at a certain scale.One obvious problem with this sliding window approach is that there are too many problem with this sliding window approach is that there are too many places to check to determine whether the face is human or not.
Judging whether it is a human face,this is the two classification problem.In 2000,the AdaBoost classifier was used.When classifying,the input of the classifying,the input of the classifier is a Haar feature,which is a very simple feature that can see a lot of small black and white blocks on the graph.The Haar feature is the sum of all pixel values of the black region minus the sum of all the pixels in the white area,with this difference as a feature,black block and white.Blocks have different sizes and relative positions,which form many different Haar features.AdaBoost classifier isa strong classifier which is composed of multiple weak classifiers.The Viola-Jones detector is composed of multiple AdaBoost classifiers.This cascade is an important function of acceleration.
When face detection technology is mature in 2000,some practital applications have appeared,such as the function of face focusing in digital cameras.When the camera is photographed,the camera will automatically detect the face,and the adjust the focal length to better according to the position of the face.

Deformable component model

After Viola-Jones face detector,another important method appeared in 2009:Deformable part model(DPM),that is,deformable part model.As far as face detection is concerned,a face can be roughly regarded as a rigid body,usually without very large deformation,such as the position of the mouth to the nose.But for other objects,such as the human body,people can lift their arms up and turn their legs up,which makes the body have a very large,very large non rigid transformation,and DPM can better deal with the transformation by modeling the components.At the beginning,people tried to try to test pedestrians with the Haar feature +AdaBoost classifier ,but find the results were not good.By 2009,there was a DPM to model different parts,such as the head with arms and knees,and then classifying the parts based on the local components and the whole.That's a lot better.DPM is relatively complex and the detection speed is relatively slow,but it has achieved some results in the task of face detection and pedestrian and vehicle detection.Later,some methods of speeding up DPM were introduced to improve the detection speed.DPM gets the modeling of components,which is a good method,but is covered by the light of the deep learning.Deep learning brought a great improvement in the accuracy of detection.so some people studying DPM are also rapidly moving to deep learning.

R-CNN series

For object detection methods based on deep learning,one of those is R-CNN series,another is the combination of traditional methods and deep learning methods.

The so-called R-CNN,is based on a very simple thinking,for the input image,through selective search and other methods,we first identify 2000 windows that are most likely to contain objects,and for these 2000 windows,we want it to be able to contain objects,and for these 2000 windows,we want it to be able to achieve a very high recall of the detected object.Then each of these 2000 is used to extract and classify the features from CNN.For these 2000 areas to run a CNN,then it is very slow,and even if it takes only 0.5 seconds each time,20000 windows will take 1000 seconds.In order to SPP-net,which is to run a CNN on the whole map,without doing each window alone,but there is a small difficulty is that the size of each of the 2000 candidate windows is different.In order to solve to solve this problem,SPP-net designs the spatial pyramid pooling,which makes the different large the large windows characteristic of the same dimension.This method makes it unnecessary to compute the convolution of each candidate window,but it is still not fast enough.It still takes several seconds to detect second to detect an image.

Fast R-CNN draws on the practice of SPP-net, convolutions in the entire graph, and then uses ROI-pooling to get the fixed feature vectors, such as the size of the window, converted to 7x7 so large. Fast R-CNN also introduces an important strategy to classify the window and return the frame of the object to make the detection box more accurate. In front of us, we say that the candidate window will have a very high recall rate, but the possible location of the frame is not very accurate. For example, a person's body frame may be a lack of arm and leg, then the detection frame can be calibrated by regression, and the initial position is refined. Fast R-CNN put classification and regression together and adopted a multi task collaborative learning approach.
Faster R-CNN brings a bigger change than Fast R-CNN, and the step that will generate the candidate window is also done with the depth network, and the network is shared with the Fast R-CNN classification network. The network that produces the candidate window is called the RPN, which is the core of Faster R-CNN. RPN replaced the very slow Selective Search before, and usually the number of candidate windows used is relatively small, only 300 is enough, which makes the back classification faster. In order to detect a variety of objects, RPN introduces the design of the so-called anchor box. Specifically, on the feature graph of the last coiling layer output, RPN is first used to obtain the eigenvectors of each position with the convolution of 3X3, and then based on the eigenvector to return to 9 windows of different sizes and the ratio of length to width, if the size of the feature graph is size. It's 40x60, so there will be about more than 20 thousand windows in total, sorting these windows by credibility, and taking the first 300 as a candidate window to do the final classification. By replacing Selective Search with RPN and using a shared coiling layer and reducing the number of candidate windows at the same time, Faster R-CNN has been significantly improved in speed, and the speed of 5fps can be reached on GPU.

Regression position-YOLO&SSD

In 2015,a method called YOLO was published,which was finally published in CVPR 2016.This is a very strange way.For a given input image,YOLO,regardless of any pictures,eventually divides to 7*7 grid,that is,to get 49 windows,and then to predict two rectangular boxes in each window.This prediction is done through a fully connected layer,and YOLO predicts the 4 parameter of each rectangular boxes in each window.This prediction is done through a fully a connected layer,and YOLOpredicts the 4 parameters of each rectangular and the reliability of the object it contains,as well as the probability that it belongs to each object category.The speed of YOLO is very fast,and it can reach 45fps on GPU.

After YOLO,in 2015,Liu Wei put forward a method named SSD.One of the obvious shortcomings of the YOLO mentioned above is that at most only 7*7=49 objects can be detected at most.If there are more than 49 objects in each grid.there will be some objects undetected,YOLO will only detect one object in each grid.If two objects are placed in a grid,one of them will be missed.

In contrast,SSD uses a mechanism similar to anchor box in RPN.YOLO uses global information to return the detection box in all locations on the whole feature graph,and SSD also uses convolutions to regress all positions based on local features,and SSD also uses the characteristics of different layers,before YOLO only use the last convolution.The feature on the layer is that it is difficult to detect small scale object,and the neurons in the last convolution layer are very large,and the small scale objects are very poor.From a speed point of view,in some cases,SSD wil even be faster than YOLO and achieve 58fps speed on GPU.

Cascade CNN

In the field of object detection, there used to be a phenomenon that we need to design and learn a separate detector for each object, such as face detection and vehicle detection. The two detectors are different and the classifier is different. For each class of objects, we need to try different features and classifiers. The combination. But now, whether the R-CNN series of methods, or YOLO and SDD, do not have any restrictions on the object category, it is a very important advantage to detect human faces and detect other categories of objects at the same time. However, there are still some special methods for the detection of specific objects, such as Cascade CNN for face detection, which replace the AdaBoost classifier with CNN. In order to ensure that the speed is fast enough, it uses a very simple CNN, for example, the number of convolution kernel is very little. On the front level of cascade, you need to process the sliding window very quickly, so the use of CNN will be very simple, fewer windows at the back level, more difficult classification, and a slightly more complex CNN. At present, Cascade CNN in the open face detection evaluation set FDDB, in the generation of 100 false detection, the recall rate can reach 85%.

The Unknown Word

The First Column	The Second Column
execution	[eksi'kjution]执行
resume	恢复
resume the script exection	恢复脚本执行
pedestrain	[pe'destrien]行人
vehicle	车辆[vi:ekl]
deformable	[di'fo:mebel]可变形的
graphics	图像['graefiks]
transactions	处理
graphics transactions	图像处理
spatial	空间的['speishel]
pyramid	金字塔['piremid]
regression	回归[ri'gretion]
grid	网格[grid]
anchor
mechanism

相关阅读:
LINUX_bash
【c++】必须在类初始化列表中初始化的几种情况
 Hadoop 学习笔记（八） hadoop2.2.0 测试环境部署及两种启动方式
 hadoop各版本下载
 mapreduce (六) MapReduce实现去重 NullWritable的使用
 hadoop 生态系统版本对应问题
 mapreduce (五) MapReduce实现倒排索引修改版 combiner是把同一个机器上的多个map的结果先聚合一次
 mapreduce (四) MapReduce实现Grep+sort
ctr预估模型
 mapreduce (七) 几个实例
原文地址：https://www.cnblogs.com/hugeng007/p/9410837.html