Let I ∈ RW×H×3 be an input image of width W and height H. Our aim is to produce a keypoint heatmap Yˆ ∈ [0, 1] W R × H R ×C , where R is the output stride and C is the number of keypoint types. Keypoint types include C = 17 human joints in human pose estimation [4, 55], or C = 80 object categories in object detection [30,61]. We use the default output stride of R = 4 in literature [4,40,42]. The output stride downsamples the output prediction by a factor R. A prediction Yˆ x,y,c = 1 corresponds to a detected keypoint, while Yˆ x,y,c = 0 is background. We use several different fully-convolutional encoder-decoder networks to predict Yˆ from an image I: A stacked hourglass network [30,40], upconvolutional residual networks (ResNet) [22,55], and deep layer aggregation (DLA) [58].
初步
我们使用 W * H * 3作为我们图片的输入,我们的目标是为了生成一个关键点热力图 Y(0, 1) W / R * H / R * C, 这里的R是输出的步长以及C是关键点的种类数量,关键点类型包括C=17关联着人类姿态估计,或者C=80物体类别在物体检测中。我们使用默认的输出步长 R=4在结构中。通过一个R的因素下采样输出预测结果。一个预测标签等于1对应着一个检测的关键点,然后预测标签等于0对应着背景。在一张图片上,我们使用几个不同的全卷机的编码器和解码器去预测Y。一个堆叠的沙漏网络,上采样卷积的Resnet,和一个深层次的聚合结构。
We train the keypoint prediction network following Law and Deng [30]. For each ground truth keypoint p ∈ R2 of class c, we compute a low-resolution equivalent p˜ = b p R c. We then splat all ground truth keypoints onto a heatmap Y ∈ [0, 1] W R × H R ×C using a Gaussian kernel Yxyc = exp − (x−p˜x) 2+(y−p˜y) 2 2σ2 p , where σp is an object size-adaptive standard deviation [30]. If two Gaussians of the same class overlap, we take the element-wise maximum [4]. The training objective is a penalty-reduced pixelwise logistic regression with focal loss [33]:
我们训练这个关键点的预测网络遵循Law和Deng。对于每个类别实际样本的真实框,我们等价的计算一个低分辨的图片,我们将真实的点投射到热力图上通过一个高斯卷积核,σp 是一个物体自适应的标准差。如果两个高斯在相同的物体上堆叠,我们使用元素范围最大的,训练物体时一个惩罚减少像素逻辑回归和角点损失
To recover the discretization error caused by the output stride, we additionally predict a local offset Oˆ ∈ R W R × H R ×2 for each center point. All classes c share the same offset prediction. The offset is trained with an L1 loss
为了消除由于输出步长导致的离散误差,我们额外预测局部偏移对于每一个中心点,所有的类别都享有相同的偏移预测,这个偏移预测使用L1 loss
The supervision acts only at keypoints locations p˜, all other locations are ignored.
该监督仅在关键点起作用,其他位置忽略
Objects as Points
Let (x (k) 1 , y (k) 1 , x (k) 2 , y (k) 2 ) be the bounding box of object k with category ck. Its center point is lies at pk = ( x (k) 1 +x (k) 2 2 , y (k) 1 +y (k) 2 2 ). We use our keypoint estimator Yˆ to predict all center points. In addition, we regress to the object size sk = (x (k) 2 − x (k) 1 , y (k) 2 − y (k) 1 ) for each object k. To limit the computational burden, we use a single size prediction Sˆ ∈ R W R × H R ×2 for all object categories. We use an L1 loss at the center point similar to Objective 2:
使用x1, y1, x2, y2 作为类别的物体边界框,我们使用我们的关键点估计其去预测所有中心点。另外的,对于每一个物体K,我们都回归物体的尺寸。为了去限制计算负担,我们使用一个简单的预测尺寸对于所有物体类别,我们在中心点使用类似于目标2的L1loss
We do not normalize the scale and directly use the raw pixel coordinates. We instead scale the loss by a constant λsize. The overall training objective is
我们没有使用归一化的尺寸和直接使用像素位置,我们控制损失函数的范围通过λ的值,这个全部的训练目标是
We set λsize = 0.1 and λof f = 1 in all our experiments unless specified otherwise. We use a single network to predict the keypoints Yˆ , offset Oˆ, and size Sˆ. The network predicts a total of C + 4 outputs at each location. All outputs share a common fully-convolutional backbone network. For each modality, the features of the backbone are then passed through a separate 3 × 3 convolution, ReLU and another 1 × 1 convolution. Figure 4 shows an overview of the network output. Section 5 and supplementary material contain additional architectural details.
我们设置λsize = 0.1和λof f = 1在我们的实验中,除非有特定的情况,我们使用一个简单的网络去预测关键点Y, 偏置项O和尺寸S,这个网络在每一个点预测一个C+4的输出。在所有的输出都共享一个全卷机的主干网络。对于每一个形态,主干的特征通过一个分离的3x3卷积,激活侧和另外一个1X1卷积。Section5和补充材料包含了额外的结构细节。
Training schedule By default, we train the keypoint estimation network for 140 epochs with a learning rate drop at 90 epochs. If we double the training epochs before dropping the learning rate, the performance further increases by 1.1 AP (Table 3d), at the cost of a much longer training schedule. To save computational resources (and polar bears), we use 140 epochs in ablation experiments, but stick with 230 epochs for DLA when comparing to other methods. Finally, we tried a multiple “anchor” version of CenterNet by regressing to more than one object size. The experiments did not yield any success. See supplement.
训练流程 默认的,我们训练关键点估计网络140epoch,在90epoch进行学习率衰减,如果我们延长训练次数在降低学习率之前,模型表现可以提高1.1AP, 花费更多的时间在训练流程上,为了节省计算资源,我们使用140epoch的消融时间,相比于其他方法,对于DLA我们使用230epoch,最后,我们尝试使用一个多anchor的CenterNet去回归一个物体的更多尺寸,这些实验没有获得任何的成功。
Conclusion
In summary, we present a new representation for objects: as points. Our CenterNet object detector builds on successful keypoint estimation networks, finds object centers, and regresses to their size. The algorithm is simple, fast, accurate, and end-to-end differentiable without any NMS postprocessing. The idea is general and has broad applications beyond simple two-dimensional detection. CenterNet can estimate a range of additional object properties, such as pose, 3D orientation, depth and extent, in one single forward pass. Our initial experiments are encouraging and open up a new direction for real-time object recognition and related tasks.
结论
总而言之,我们使用一个新的对象表现方法: 点。我们的CenterNet检测器建立在非常成功的点估计网络,发现物体的中心,并且回归他们的尺寸。这种方法是简单,快速,精确的,不同于任何NMS的后处理,这是端到端的,这个想法是普通的并且有广泛的应用在2维检测器中。CenterNet可以评估额外的物体检测属性,如姿态,3D方向,深度和范围,通过一个简单的前向。我们初始的实验是振奋人心的并且开发了一个新的方向对于实时物体识别和相关任务。