smooth_L1_loss_layer.cu解读 caffe源码初认识

smooth_L1_loss_layer.cu解读 caffe源码初认识
.cpp是cpu上运行的代码，.cu是gpu上运行的代码。

这是smooth_L1_loss_layer.cu的前向传播部分
```
#include "caffe/fast_rcnn_layers.hpp"

namespace caffe {

template <typename Dtype>
__global__ void SmoothL1Forward(const int n, const Dtype* in, Dtype* out) {
  // f(x) = 0.5 * x^2    if |x| < 1
  //        |x| - 0.5    otherwise
  CUDA_KERNEL_LOOP(index, n) {
    Dtype val = in[index];
    Dtype abs_val = abs(val);
    if (abs_val < 1) {
      out[index] = 0.5 * val * val;
    } else {
      out[index] = abs_val - 0.5;
    }
  }
}

template <typename Dtype>
void SmoothL1LossLayer<Dtype>::Forward_gpu(const vector<Blob<Dtype>*>& bottom,
    const vector<Blob<Dtype>*>& top) {
  int count = bottom[0]->count();
  caffe_gpu_sub(
      count,
      bottom[0]->gpu_data(),
      bottom[1]->gpu_data(),
      diff_.mutable_gpu_data());    // d := b0 - b1
  if (has_weights_) {
    caffe_gpu_mul(
        count,
        bottom[2]->gpu_data(),
        diff_.gpu_data(),
        diff_.mutable_gpu_data());  // d := w * (b0 - b1)
  }
  SmoothL1Forward<Dtype><<<CAFFE_GET_BLOCKS(count), CAFFE_CUDA_NUM_THREADS>>>(
      count, diff_.gpu_data(), errors_.mutable_gpu_data());
  CUDA_POST_KERNEL_CHECK;

  Dtype loss;
  caffe_gpu_asum(count, errors_.gpu_data(), &loss);
  top[0]->mutable_cpu_data()[0] = loss / bottom[0]->num();　　　　　　　　　　　　　　注意：这里是bottom[0]->num()，不是bottom[0]->count()
}
```
blob的主要变量：
```
shared_ptr<SyncedMemory> data_;
shared_ptr<SyncedMemory> diff_;
vector<int> shape_;
int count_;
int capacity_;
```
blob只是一个基本的数据结构，因此内部的变量相对较少，首先是data_指针，指针类型是shared_ptr，属于boost库的一个智能指针，这一部分主要用来申请内存存储data，data主要是正向传播的时候用的。同理，diff_主要用来存储偏差，shape_都是存储Blob的形状，count表示Blob中的元素个数，也就是个数*通道数*高度*宽度,capacity表示当前的元素个数，因为Blob可能会reshape。count是一个迭代期参与的图片个数。带data的里面存储的是激活值和W、b，diff中存储的是残差和dW、db。

blob中除了数据成员之外，也有很多用于操作数据的函数成员，下面就说几个比较重要的：

    void Blob<Dtype>::Reshape()：这个函数是在原来分配的内存不够的情况下重新分配内存。

    const Dtype* Blob<Dtype>::cpu_data()：这个是获取Blob结构体中的data_数据的指针，同时限制不能对返回的指针指向的内容进行更改。

    const Dtype* Blob<Dtype>::cpu_diff()：这个是获取Blob结构体中的diff_数据的指针，同时限制不能对返回的指针指向的内容进行更改。

    Dtype* Blob<Dtype>::mutable_cpu_data()：获取Blob结构体中的data_数据的指针，同时可以对指针指向的内容更改。

    Dtype* Blob<Dtype>::mutable_cpu_diff()：获取Blob结构体中的diff_数据的指针，同时可以对指针指向的内容更改。

    void Blob<Dtype>::ShareData(const Blob& other)：让其他Blob的data_数据和当前Blob共享。

  void Blob<Dtype>::ShareDiff(const Blob& other)：让其他Blob的diff_和当前的Blob共享。

blob类里面有重载很多个count()函数，主要还是为了统计blob的容量（volume），或者是某一片（slice），从某个axis到具体某个axis的shape乘积。
```
inline int count(int start_axis, int end_axis)
```
int count = bottom[0]->count(); count()没带参数，计算的是bottom[0]这个输入blob所有的元素个数。这里就是计算一个迭代期的所有图片的所有通道的所有坐标点形成的blob数据结构元素的个数。

top[0]->mutable_cpu_data()[0] = loss / bottom[0]->num();　　num()是计算一个迭代期参与的所有图片的个数。这里就是求一个迭代期所有几张图片的平均loss。

caffe_gpu_asum(count, errors_.gpu_data(), &loss);　　caffe_gpu_asum是对向量进行L1范数计算，实际上就是对向量求其每个元素绝对值的和。第一个参数是要计算的元素的个数。
```
 caffe_gpu_sub(
      count,
      bottom[0]->gpu_data(),
      bottom[1]->gpu_data(),
      diff_.mutable_gpu_data());    // d := b0 - b1
  if (has_weights_) {
    caffe_gpu_mul(
        count,
        bottom[2]->gpu_data(),
        diff_.gpu_data(),
        diff_.mutable_gpu_data());  // d := w * (b0 - b1)
}
```
caffe_gpu_sub,caffe_gpu_mul:这两个函数分别实现element-wise（即点乘，每个矩阵对应元素相乘）的乘减（y[i] = a[i] * - b[i]）。第一个参数是要计算的元素个数。

总结; smooth_L1_loss_layer的loss计算是将所有对应元素（某张图片，某个通道的对应坐标）相减，判断绝对值是否小于1然后各个元素分别进行smooth_L1(x)这个函数的处理，各个元素都有一个loss，然后把所有的loss相加除以图片数，就得到每张图片box_loss的值。

loss的两个输入是1x84维的向量（fast中是这样，faster中的rpn是36*w*h），这个向量表示21类的dx，dy，dh，dw。count数出所有的个数，然后两个输入相对应的每一个进行这个计算，计算出84个loss，再对84个loss求和。当然这是单个图片，如果batch有多个图片，对多个图片loss求平均。

fast中使用的smoothL1和faster中使用的smoothL1有一点差别，但不大。faster中除了在rpn使用smoothl1，还要在fast那部分使用，所以faster中的smoothl1应该是兼容的。
相关阅读:
[转]用异或交换两个整数的陷阱
 线索化二叉树
 [转]Socket编程中，阻塞与非阻塞的区别
 两个链表的归并
 [转] std::string and stl 算法
 类图
 leetcode 答案
 about raw socket
54. Spiral Matrix【数组】
矩阵乘法问题的实现
原文地址：https://www.cnblogs.com/ymjyqsx/p/7086698.html