ncnn+PPYOLOv2首次结合！全网最详细代码解读来了（2）-阿里云开发者社区

forward()函数即可变形卷积的前向代码，bottom_blobs是可变形卷积的输入，当bottom_blobs里有3个输入时，分别是inputs、offset、mask，表示是DCNv2，当bottom_blobs里有2个输入时，分别是inputs、offset，表示是DCNv1。接下来的代码，我计算了out_h、out_w，即输出特征图的高度和宽度。接下来是对输出张量output开辟空间，获取可变形卷积层的权重、偏置的指针weight_ptr、bias_ptr。最后进入for循环。第1个for循环表示的是卷积窗在h方向滑动，滑了out_h次。第2个for循环表示的是卷积窗在w方向滑动，滑了out_w次；之后计算的h_in、w_in分别表示当前卷积窗位置左上角采样点在pad之后的inputs的y坐标、x坐标（实际上inputs不需要pad，之后你会看到，采样点超出inputs的范围时，采样得到的像素强制取0）。第3个for循环表示的是填写输出特征图的每一个通道，填了num_output次；首先让sum=0，当使用偏置时，sum=bias_ptr[oc]，即第oc个偏置。第4、第5、第6个for循环遍历了卷积核的高度、宽度、通道数，计算卷积层权重weight每个卷积采样点每个通道和原图inputs相应位置的像素val（双线性插值得到）和积，再累加到sum中。offset_h、offset_w是当前卷积采样点的y、x偏移，mask_是双线性插值得到的val的重要程度。真正采样位置的y坐标是h_im = 当前卷积窗左上角y坐标h_in + 卷积核内部y偏移i * dilation_h + y偏移offset_h；真正采样位置的x坐标是w_im = 当前卷积窗左上角x坐标w_in + 卷积核内部x偏移j * dilation_w + x偏移offset_w。之后，计算好双线性插值中h_im、w_im上下取整的结果h_low、w_low、h_high、w_high，双线性插值中4个像素的权重w1、w2、w3、w4等。注意，不要在for (int c_im = 0; c_im < in_c; c_im++){}中计算，因为在每一个输入通道中，采样位置h_im、w_im是相等的，所以h_low、w_low、h_high、w_high、w1、w2、w3、w4也是相等的，提前计算好就不用在每个输入通道重复计算，提高计算速度和算法效率。第6个for循环中，遍历每个输入通道，求采样得到像素val，如果采样位置超出inputs的范围，取0；对比cond和v1_cond、v2_cond、v3_cond、v4_cond，会发现cond的边界会比v1_cond、v2_cond、v3_cond、v4_cond的边界大一点，比如当h_im==-1且w_im==-1时， cond是true。这是因为，h_im和w_im会经过上下取整，其中上取整得到的采样点位置是(0, 0)，刚好是在inputs范围内，所以cond的边界会比v1_cond、v2_cond、v3_cond、v4_cond的边界大一点。计算好val之后，将val * mask_ * weight_ptr[((oc * in_c + c_im) * kernel_h + i) * kernel_w + j]累加到sum之中。

PPYOLOv2输出解码

PPYOLOv2输出解码比YOLOv3复杂一些，它使用了iou_aware和Grid Sensitive。在YOLOv3中，输出3个特征图，表示3种感受野（大中小）的预测结果，每个特征图的每个格子输出3个bbox，对应3个聚类出来的anchor进行解码。当数据集类别数是80时候，YOLOv3每个特征图通道数是3 * (4+1+80)，3表示每个格子输出3个bbox，4表示未解码的xywh，1表示未解码的objness，80表示80个类别未解码的条件概率。PPYOLOv2使用了iou_aware，每个特征图通道数是3 * (1+4+1+80)，即每个bbox多出1个ioup属性。共有258个通道，但是前3个通道才是每个bbox的ioup，后255个通道和YOLOv3的排列一样。通过阅读IouAwareLoss的代码，ioup使用F.binary_cross_entropy_with_logits()训练，解码时需要用sigmoid()激活，使用当前预测框和它所学习的gt的iou作为监督信息，所以ioup其实预测的是当前预测框和它所学习的gt的iou。所以，当然是希望ioup越大越好。在mmdet(ppdet)中，用了1条曲线救国的道路对输出解码：

# mmdet/models/heads/yolov3_head.py

...

if self.iou_aware:
                    na = len(self.anchors[i])
                    ioup, x = out[:, 0:na, :, :], out[:, na:, :, :]
                    b, c, h, w = x.shape
                    no = c // na
                    x = x.reshape((b, na, no, h * w))
                    ioup = ioup.reshape((b, na, 1, h * w))
                    obj = x[:, :, 4:5, :]
                    ioup = torch.sigmoid(ioup)
                    obj = torch.sigmoid(obj)
                    obj_t = (obj**(1 - self.iou_aware_factor)) * (
                        ioup**self.iou_aware_factor)
                    obj_t = _de_sigmoid(obj_t)
                    loc_t = x[:, :, :4, :]
                    cls_t = x[:, :, 5:, :]
                    y_t = torch.cat([loc_t, obj_t, cls_t], 2)
                    out = y_t.reshape((b, c, h, w))
                box, score = paddle_yolo_box(out, self._anchors[self.anchor_masks[i]], self.downsample[i],
                                             self.num_classes, self.scale_x_y, im_size, self.clip_bbox,
                                             conf_thresh=self.nms_cfg['score_threshold'])

即分别对ioup和obj进行sigmoid激活，再obj_t = (obj ** (1 - self.iou_aware_factor)) * (ioup ** self.iou_aware_factor)作为新的obj，新的obj经过sigmoid的反函数还原成未接码状态，未接码的新obj贴回x中。最后out的通道数是255，只要像原版YOLOv3那样解码out就行了。

这么做的原因是paddle_yolo_box()的作用是对原版YOLOv3的输出进行解码，充分利用paddle_yolo_box()的话就不用自己写解码的代码。所以就走了曲线救国的道路。

从中我们可以得到一些信息，ioup只不过是和obj经过表达式obj_t = (obj ** (1 - self.iou_aware_factor)) * (ioup ** self.iou_aware_factor)得到新的obj，其余只要像YOLOv3一样解码就ok了！

所以在ncnn中，我这样实现PPYOLOv2的解码：

// examples/test2_06_ppyolo_ncnn.cpp
...
class PPYOLODecodeMatrixNMS : public ncnn::Layer
{
public:
    PPYOLODecodeMatrixNMS()
    {
        // miemie2013: if num of input tensors > 1 or num of output tensors > 1, you must set one_blob_only = false
        // And ncnn will use forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) method
        // or forward_inplace(std::vector<Mat>& bottom_top_blobs, const Option& opt) method
        one_blob_only = false;
        support_inplace = false;
    }
    virtual int load_param(const ncnn::ParamDict& pd)
    {
        num_classes = pd.get(0, 80);
        anchors = pd.get(1, ncnn::Mat());
        strides = pd.get(2, ncnn::Mat());
        scale_x_y = pd.get(3, 1.f);
        iou_aware_factor = pd.get(4, 0.5f);
        score_threshold = pd.get(5, 0.1f);
        anchor_per_stride = pd.get(6, 3);
        post_threshold = pd.get(7, 0.1f);
        nms_top_k = pd.get(8, 500);
        keep_top_k = pd.get(9, 100);
        kernel = pd.get(10, 0);
        gaussian_sigma = pd.get(11, 2.f);
        return 0;
    }
    virtual int forward(const std::vector<ncnn::Mat>& bottom_blobs, std::vector<ncnn::Mat>& top_blobs, const ncnn::Option& opt) const
    {
        const ncnn::Mat& bottom_blob = bottom_blobs[0];
        const int tensor_num = bottom_blobs.size() - 1;
        const size_t elemsize = bottom_blob.elemsize;
        const ncnn::Mat& im_scale = bottom_blobs[tensor_num];
        const float scale_x = im_scale[0];
        const float scale_y = im_scale[1];
        int out_num = 0;
        for (size_t b = 0; b < tensor_num; b++)
        {
            const ncnn::Mat& tensor = bottom_blobs[b];
            const int w = tensor.w;
            const int h = tensor.h;
            out_num += anchor_per_stride * h * w;
        }
        ncnn::Mat bboxes;
        bboxes.create(4 * out_num, elemsize, opt.blob_allocator);
        if (bboxes.empty())
            return -100;
        ncnn::Mat scores;
        scores.create(num_classes * out_num, elemsize, opt.blob_allocator);
        if (scores.empty())
            return -100;
        float* bboxes_ptr = bboxes;
        float* scores_ptr = scores;
        // decode
        for (size_t b = 0; b < tensor_num; b++)
        {
            const ncnn::Mat& tensor = bottom_blobs[b];
            const int w = tensor.w;
            const int h = tensor.h;
            const int c = tensor.c;
            const bool use_iou_aware = (c == anchor_per_stride * (num_classes + 6));
            const int channel_stride = use_iou_aware ? (c / anchor_per_stride) - 1 : (c / anchor_per_stride);
            const int cx_pos = use_iou_aware ? anchor_per_stride : 0;
            const int cy_pos = use_iou_aware ? anchor_per_stride + 1 : 1;
            const int w_pos = use_iou_aware ? anchor_per_stride + 2 : 2;
            const int h_pos = use_iou_aware ? anchor_per_stride + 3 : 3;
            const int obj_pos = use_iou_aware ? anchor_per_stride + 4 : 4;
            const int cls_pos = use_iou_aware ? anchor_per_stride + 5 : 5;
            float stride = strides[b];
            #pragma omp parallel for num_threads(opt.num_threads)
            for (int i = 0; i < h; i++)
            {
                for (int j = 0; j < w; j++)
                {
                    for (int k = 0; k < anchor_per_stride; k++)
                    {
                        float obj = tensor.channel(obj_pos + k * channel_stride).row(i)[j];
                        obj = static_cast<float>(1.f / (1.f + expf(-obj)));
                        if (use_iou_aware)
                        {
                            float ioup = tensor.channel(k).row(i)[j];
                            ioup = static_cast<float>(1.f / (1.f + expf(-ioup)));
                            obj = static_cast<float>(pow(obj, 1.f - iou_aware_factor) * pow(ioup, iou_aware_factor));
                        }
                        if (obj > score_threshold)
                        {
                            // Grid Sensitive
                            float cx = static_cast<float>(scale_x_y / (1.f + expf(-tensor.channel(cx_pos + k * channel_stride).row(i)[j])) + j - (scale_x_y - 1.f) * 0.5f);
                            float cy = static_cast<float>(scale_x_y / (1.f + expf(-tensor.channel(cy_pos + k * channel_stride).row(i)[j])) + i - (scale_x_y - 1.f) * 0.5f);
                            cx *= stride;
                            cy *= stride;
                            float dw = static_cast<float>(expf(tensor.channel(w_pos + k * channel_stride).row(i)[j]) * anchors[(b * anchor_per_stride + k) * 2]);
                            float dh = static_cast<float>(expf(tensor.channel(h_pos + k * channel_stride).row(i)[j]) * anchors[(b * anchor_per_stride + k) * 2 + 1]);
                            float x0 = cx - dw * 0.5f;
                            float y0 = cy - dh * 0.5f;
                            float x1 = cx + dw * 0.5f;
                            float y1 = cy + dh * 0.5f;
                            bboxes_ptr[((i * w + j) * anchor_per_stride + k) * 4] = x0 / scale_x;
                            bboxes_ptr[((i * w + j) * anchor_per_stride + k) * 4 + 1] = y0 / scale_y;
                            bboxes_ptr[((i * w + j) * anchor_per_stride + k) * 4 + 2] = x1 / scale_x;
                            bboxes_ptr[((i * w + j) * anchor_per_stride + k) * 4 + 3] = y1 / scale_y;
                            for (int r = 0; r < num_classes; r++)
                            {
                                float score = static_cast<float>(obj / (1.f + expf(-tensor.channel(cls_pos + k * channel_stride + r).row(i)[j])));
                                scores_ptr[((i * w + j) * anchor_per_stride + k) * num_classes + r] = score;
                            }
                        }else
                        {
                            bboxes_ptr[((i * w + j) * anchor_per_stride + k) * 4] = 0.f;
                            bboxes_ptr[((i * w + j) * anchor_per_stride + k) * 4 + 1] = 0.f;
                            bboxes_ptr[((i * w + j) * anchor_per_stride + k) * 4 + 2] = 1.f;
                            bboxes_ptr[((i * w + j) * anchor_per_stride + k) * 4 + 3] = 1.f;
                            for (int r = 0; r < num_classes; r++)
                            {
                                scores_ptr[((i * w + j) * anchor_per_stride + k) * num_classes + r] = -1.f;
                            }
                        }
                    }
                }
            }
            bboxes_ptr += h * w * anchor_per_stride * 4;
            scores_ptr += h * w * anchor_per_stride * num_classes;
        }

...

只要在obj那里动手脚，其余像YOLOv3那样解码就行了，而且，只对obj > score_threshold的bbox解码，其余bbox敷衍处理，提升后处理速度。Grid Sensitive的提出是为了解决训练过程中gt中心点落在格子线上的问题，它允许解码后的x、y超出0~1的范围一点点。

ncnn+PPYOLOv2首次结合！全网最详细代码解读来了（2）

新智元

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件