万字长文解析 YOLOv1-v5 模型（中）-阿里云开发者社区

三，YOLOv3

YOLOv3 的论文写得不是很好，需要完全看懂，还是要看代码，C/C++ 基础不好的建议看 Pytorch 版本的复现。下文是我对原论文的精简翻译和一些难点的个人理解，以及一些关键代码解析。

摘要

我们对 YOLO 再次进行了更新，包括一些小的设计和更好的网络结构。在输入图像分辨率为 320×320320 \times 320320×320 上运行 YOLOv3 模型，时间是 22 ms 的同时获得了 28.2 的 mAP，精度和 SSD 类似，但是速度更快。和其他阈值相比，YOLOv3 尤其在 0.5 IOU（也就是 AP50AP_{50}AP50）这个指标上表现非常良好。在 Titan X 环境下，YOLOv3 的检测精度为 57.9 AP50，耗时 51 ms；而 RetinaNet 的精度只有 57.5 AP50，但却需要 198 ms，相当于 YOLOv3的 3.8 倍。

一般可以认为检测模型 = 特征提取器 + 检测头。

1，介绍

这篇论文其实也是一个技术报告，首先我会告诉你们 YOLOv3 的更新（改进）情况，然后介绍一些我们失败的尝试，最后是这次更新方法意义的总结。

2，改进

YOLOv3 大部分有意的改进点都来源于前人的工作，当然我们也训练了一个比其他人更好的分类器网络。

2.1，边界框预测

这部分内容和 YOLOv2 几乎一致，但是内容更细致，且阈值的取值有些不一样。

和 YOLOv2 一样，我们依然使用维度聚类的方法来挑选 anchor boxes 作为边界框预测的先验框。每个边界框都会预测 444 个偏移坐标 (tx,ty,tw,th)(t_x,t_y,t_w,t_h)(tx,ty,tw,th)。假设 (cx,cy)(c_x, c_y)(cx,cy) 为 grid 的左上角坐标，pwp_wpw、php_hph 是先验框（anchors）的宽度与高度，那么网络预测值和边界框真实位置的关系如下所示：

假设某一层的 feature map 的大小为 13×1313 \times 1313×13，那么 grid cell 就有 13×1313 \times 1313×13 个，则第 nnn 行第 nnn 列的 grid cell 的坐标 (xx,cy)(x_x, c_y)(xx,cy) 就是 (n−1,n)(n-1,n)(n−1,n)。

bx=σ(tx)+cxby=σ(ty)+cybw=pwetwbh=phethb_x = \sigma(t_x) + c_x \\\\ b_y = \sigma(t_y) + c_y \\\\ b_w = p_{w}e^{t_w} \\\\ b_h = p_{h}e^{t_h}bx=σ(tx)+cxby=σ(ty)+cybw=pwetwbh=pheth

网络异常，图片无法展示

bx,by,bw,bhb_x,b_y,b_w,b_hbx,by,bw,bh 是边界框的实际中心坐标和宽高值。在训练过程中，我们使用平方误差损失函数。利用上面的公式，可以轻松推出这样的结论：如果预测坐标的真实值（ground truth）是 t^∗\hat{t}_{\ast}t^∗，那么梯度就是真实值减去预测值 t^∗−t∗\hat{t}_{\ast} - t_{\ast }t^∗−t∗。

梯度变成 t^∗−t∗\hat{t}_{\ast} - t_{\ast }t^∗−t∗ 有什么好处呢？

注意，计算损失的时候，模型预测输出的 tx,tyt_x,t_ytx,ty 外面要套一个 sigmoid 函数，否则坐标就不是 (0,1)(0,1)(0,1) 范围内的，一旦套了 sigmoid，就只能用 BCE 损失函数去反向传播，这样第一步算出来的才是 tx−t^xt_x-\hat{t}_xtx−t^x；(tw,th)(t_w,t_h)(tw,th) 的预测没有使用 sigmoid 函数，所以损失使用 MSEMSEMSE。

t^x\hat{t}_xt^x 是预测坐标偏移的真实值（ground truth）。

YOLOv3 使用逻辑回归来预测每个边界框的 objectness score（置信度分数）。如果当前先验框和 ground truth 的 IOU 超过了前面的先验框，那么它的分数就是 1。和 Faster RCNN 论文一样，如果先验框和 ground truth 的 IOU不是最好的，那么即使它超过了阈值，我们还是会忽略掉这个 box，正负样本判断的阈值取 0.5。YOLOv3 检测系统只为每个 ground truth 对象分配一个边界框。如果先验框（bonding box prior，其实就是聚类得到的 anchors）未分配给 ground truth 对象，则不会造成位置和分类预测损失，只有置信度损失（only objectness）。

将 coco 数据集的标签编码成 (tx,ty,tw,th)(t_x,t_y,t_w,t_h)(tx,ty,tw,th) 形式的代码如下：

def get_target(self, target, anchors, in_w, in_h, ignore_threshold):
    """
    Maybe have problem.
    target: original coco dataset label.
    in_w, in_h: feature map size.
    """
    bs = target.size(0)
    mask = torch.zeros(bs, self.num_anchors, in_h, in_w, requires_grad=False)
    noobj_mask = torch.ones(bs, self.num_anchors, in_h, in_w, requires_grad=False)
    tx = torch.zeros(bs, self.num_anchors, in_h, in_w, requires_grad=False)
    ty = torch.zeros(bs, self.num_anchors, in_h, in_w, requires_grad=False)
    tw = torch.zeros(bs, self.num_anchors, in_h, in_w, requires_grad=False)
    th = torch.zeros(bs, self.num_anchors, in_h, in_w, requires_grad=False)
    tconf = torch.zeros(bs, self.num_anchors, in_h, in_w, requires_grad=False)
    tcls = torch.zeros(bs, self.num_anchors, in_h, in_w, self.num_classes, requires_grad=False)
    for b in range(bs):
        for t in range(target.shape[1]):
            if target[b, t].sum() == 0:
                continue
            # Convert to position relative to box
            gx = target[b, t, 1] * in_w
            gy = target[b, t, 2] * in_h
            gw = target[b, t, 3] * in_w
            gh = target[b, t, 4] * in_h
            # Get grid box indices
            gi = int(gx)
            gj = int(gy)
            # Get shape of gt box
            gt_box = torch.FloatTensor(np.array([0, 0, gw, gh])).unsqueeze(0)
            # Get shape of anchor box
            anchor_shapes = torch.FloatTensor(np.concatenate((np.zeros((self.num_anchors, 2)),
                                                                np.array(anchors)), 1))
            # Calculate iou between gt and anchor shapes
            anch_ious = bbox_iou(gt_box, anchor_shapes)
            # Where the overlap is larger than threshold set mask to zero (ignore)
            noobj_mask[b, anch_ious > ignore_threshold, gj, gi] = 0
            # Find the best matching anchor box
            best_n = np.argmax(anch_ious)
            # Masks
            mask[b, best_n, gj, gi] = 1
            # Coordinates
            tx[b, best_n, gj, gi] = gx - gi
            ty[b, best_n, gj, gi] = gy - gj
            # Width and height
            tw[b, best_n, gj, gi] = math.log(gw/anchors[best_n][0] + 1e-16)
            th[b, best_n, gj, gi] = math.log(gh/anchors[best_n][1] + 1e-16)
            # object
            tconf[b, best_n, gj, gi] = 1
            # One-hot encoding of label
            tcls[b, best_n, gj, gi, int(target[b, t, 0])] = 1
    return mask, noobj_mask, tx, ty, tw, th, tconf, tcls
复制代码

另一个复习版本关于数据集标签的处理代码如下：

def build_targets(p, targets, model):
    # Build targets for compute_loss(), input targets(image,class,x,y,w,h)
    na, nt = 3, targets.shape[0]  # number of anchors, targets #TODO
    tcls, tbox, indices, anch = [], [], [], []
    gain = torch.ones(7, device=targets.device)  # normalized to gridspace gain
    # Make a tensor that iterates 0-2 for 3 anchors and repeat that as many times as we have target boxes
    ai = torch.arange(na, device=targets.device).float().view(na, 1).repeat(1, nt)
    # Copy target boxes anchor size times and append an anchor index to each copy the anchor index is also expressed by the new first dimension
    targets = torch.cat((targets.repeat(na, 1, 1), ai[:, :, None]), 2)
    for i, yolo_layer in enumerate(model.yolo_layers):
        # Scale anchors by the yolo grid cell size so that an anchor with the size of the cell would result in 1
        anchors = yolo_layer.anchors / yolo_layer.stride
        # Add the number of yolo cells in this layer the gain tensor
        # The gain tensor matches the collums of our targets (img id, class, x, y, w, h, anchor id)
        gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]  # xyxy gain
        # Scale targets by the number of yolo layer cells, they are now in the yolo cell coordinate system
        t = targets * gain
        # Check if we have targets
        if nt:
            # Calculate ration between anchor and target box for both width and height
            r = t[:, :, 4:6] / anchors[:, None]
            # Select the ratios that have the highest divergence in any axis and check if the ratio is less than 4
            j = torch.max(r, 1. / r).max(2)[0] < 4  # compare #TODO
            # Only use targets that have the correct ratios for their anchors
            # That means we only keep ones that have a matching anchor and we loose the anchor dimension
            # The anchor id is still saved in the 7th value of each target
            t = t[j]
        else:
            t = targets[0]
        # Extract image id in batch and class id
        b, c = t[:, :2].long().T
        # We isolate the target cell associations.
        # x, y, w, h are allready in the cell coordinate system meaning an x = 1.2 would be 1.2 times cellwidth
        gxy = t[:, 2:4]
        gwh = t[:, 4:6]  # grid wh
        # Cast to int to get an cell index e.g. 1.2 gets associated to cell 1
        gij = gxy.long()
        # Isolate x and y index dimensions
        gi, gj = gij.T  # grid xy indices
        # Convert anchor indexes to int
        a = t[:, 6].long()
        # Add target tensors for this yolo layer to the output lists
        # Add to index list and limit index range to prevent out of bounds
        indices.append((b, a, gj.clamp_(0, gain[3] - 1), gi.clamp_(0, gain[2] - 1)))
        # Add to target box list and convert box coordinates from global grid coordinates to local offsets in the grid cell
        tbox.append(torch.cat((gxy - gij, gwh), 1))  # box
        # Add correct anchor for each target to the list
        anch.append(anchors[a])
        # Add class for each target to the list
        tcls.append(c)
    return tcls, tbox, indices, anch
复制代码

关于更多模型推理部分代码的复现和理解，可阅读这个 github项目代码。

2.2，分类预测

每个框使用多标签分类来预测边界框可能包含的类。我们不使用 softmax 激活函数，因为我们发现它对模型的性能影响不好。相反，我们只是使用独立的逻辑分类器。在训练过程中，我们使用二元交叉熵损失来进行类别预测。

在这个数据集 Open Images Dataset 中有着大量的重叠标签。如果使用 softmax ，意味着强加了一个假设，即每个框只包含一个类别，但通常情况并非如此。多标签方法能更好地模拟数据。

2.3，跨尺度预测

YOLOv3 可以预测 3 种不同尺度（scale）的框。总的来说是，引入了类似 FPN 的多尺度特征图融合，从而加强小目标检测。与原始的 FPN 不同，YOLOv3 的 Neck 网络只输出 3 个分支，分别对应 3 种尺度，高层网络输出的特征图经过上采样后和低层网络输出的特征图融合是使用 concat 方式拼接，而不是使用 element-wise add 的方法。

首先检测系统利用和特征金字塔网络[8]（FPN 网络）类似的概念，来提取不同尺度的特征。我们在基础的特征提取器基础上添加了一些卷积层。这些卷积层的最后会预测一个 3 维张量，其是用来编码边界框，框中目标和分类预测。在 COCO 数据集的实验中，我们每个输出尺度都预测 3 个 boxes，所以模型最后输出的张量大小是 N×N×[3∗(4+1+80)]N \times N \times [3*(4+1+80)]N×N×[3∗(4+1+80)]，其中包含 4 个边界框offset、1 个 objectness 预测（前景背景预测）以及 80 种分类预测。

objectness 预测其实就是前景背景预测，有些类似 YOLOv2 的置信度 c 的概念。

然后我们将前面两层输出的特征图上采样 2 倍，并和浅层中的特征图，用 concatenation 方式把高低两种分辨率的特征图连接到一起，这样做能使我们同时获得上采样特征的有意义的语义信息和来自早期特征的细粒度信息。之后，再添加几个卷积层来处理这个融合后的特征，并输出大小是原来高层特征图两倍的张量。

按照这种设计方式，来预测最后一个尺度的 boxes。可以知道，对第三种尺度的预测也会从所有先前的计算中（多尺度特征融合的计算中）获益，同时能从低层的网络中获得细粒度（ finegrained ）的特征。

显而易见，低层网络输出的特征图语义信息比较少，但是目标位置准确；高层网络输出的特征图语义信息比较丰富，但是目标位置比较粗略。

依然使用 k-means 聚类来确定我们的先验边界框（box priors，即选择的 anchors），但是选择了 9 个聚类（clusters）和 3 种尺度（scales，大、中、小三种 anchor 尺度），然后在整个尺度上均匀分割聚类。在COCO 数据集上，9 个聚类是：（10×13）;（16×30）;（33×23）;（30×61）;（62×45）;（59×119）;（116×90）;（156×198）;（373×326）。

从上面的描述可知，YOLOv3 的检测头变成了 3 个分支，对于输入图像 shape 为 (3, 416, 416)的 YOLOv3 来说，Head 各分支的输出张量的尺寸如下：

[13, 13, 3*(4+1+80)]
[26, 2, 3*(4+1+80)]
[52, 52, 3*(4+1+80)]

3 个分支分别对应 32 倍、16 倍、8倍下采样，也就是分别预测大、中、小目标。32 倍下采样的特征图的每个点感受野更大，所以用来预测大目标。

每个 sacle 分支的每个 grid 都会预测 3 个框，每个框预测 5 元组+ 80 个 one-hotvector类别，所以一共 size 是：3*(4+1+80)。

根据前面的内容，可以知道，YOLOv3 总共预测 (13×13+26×26+52×52)×3=10467(YOLOv3)≫845=13×13×5(YOLOv2)(13 \times 13 + 26 \times 26 + 52 \times 52) \times 3 = 10467(YOLOv3) \gg 845 = 13 \times 13 \times 5(YOLOv2)(13×13+26×26+52×52)×3=10467(YOLOv3)≫845=13×13×5(YOLOv2) 个边界框。

2.4，新的特征提取网络

我们使用一个新的网络来执行特征提取。它是 Darknet-19和新型残差网络方法的融合，由连续的 3×33\times 33×3 和 1×11\times 11×1 卷积层组合而成，并添加了一些 shortcut connection，整体体量更大。因为一共有 53=(1+2+8+8+4)×2+4+2+153 = (1+2+8+8+4)\times 2+4+2+1 53=(1+2+8+8+4)×2+4+2+1 个卷积层，所以我们称为 Darknet-53。

网络异常，图片无法展示

总的来说，DarkNet-53 不仅使用了全卷积网络，将 YOLOv2 中降采样作用 pooling 层都换成了 convolution(3x3，stride=2) 层；而且引入了残差（residual）结构，不再使用类似 VGG 那样的直连型网络结构，因此可以训练更深的网络，即卷积层数达到了 53 层。（更深的网络，特征提取效果会更好）

Darknet53 网络的 Pytorch 代码如下所示。

代码来源这里。

import torch
import torch.nn as nn
import math
from collections import OrderedDict
__all__ = ['darknet21', 'darknet53']
class BasicBlock(nn.Module):
    """basic residual block for Darknet53，卷积层分别是 1x1 和 3x3
    """
    def __init__(self, inplanes, planes):
        super(BasicBlock, self).__init__()
        self.conv1 = nn.Conv2d(inplanes, planes[0], kernel_size=1,
                               stride=1, padding=0, bias=False)
        self.bn1 = nn.BatchNorm2d(planes[0])
        self.relu1 = nn.LeakyReLU(0.1)
        self.conv2 = nn.Conv2d(planes[0], planes[1], kernel_size=3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes[1])
        self.relu2 = nn.LeakyReLU(0.1)
    def forward(self, x):s
        residual = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu1(out)
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu2(out)
        out += residual
        return out
class DarkNet(nn.Module):
    def __init__(self, layers):
        super(DarkNet, self).__init__()
        self.inplanes = 32
        self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(self.inplanes)
        self.relu1 = nn.LeakyReLU(0.1)
        self.layer1 = self._make_layer([32, 64], layers[0])
        self.layer2 = self._make_layer([64, 128], layers[1])
        self.layer3 = self._make_layer([128, 256], layers[2])
        self.layer4 = self._make_layer([256, 512], layers[3])
        self.layer5 = self._make_layer([512, 1024], layers[4])
        self.layers_out_filters = [64, 128, 256, 512, 1024]
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()
    def _make_layer(self, planes, blocks):
        layers = []
        # 每个阶段的开始都要先 downsample，然后才是 basic residual block for Darknet53
        layers.append(("ds_conv", nn.Conv2d(self.inplanes, planes[1], kernel_size=3,
                                stride=2, padding=1, bias=False)))
        layers.append(("ds_bn", nn.BatchNorm2d(planes[1])))
        layers.append(("ds_relu", nn.LeakyReLU(0.1)))
        #  blocks
        self.inplanes = planes[1]
        for i in range(0, blocks):
            layers.append(("residual_{}".format(i), BasicBlock(self.inplanes, planes)))
        return nn.Sequential(OrderedDict(layers))
    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu1(x)
        x = self.layer1(x)
        x = self.layer2(x)
        out3 = self.layer3(x)
        out4 = self.layer4(out3)
        out5 = self.layer5(out4)
        return out3, out4, out5
def darknet21(pretrained, **kwargs):
    """Constructs a darknet-21 model.
    """
    model = DarkNet([1, 1, 2, 2, 1])
    if pretrained:
        if isinstance(pretrained, str):
            model.load_state_dict(torch.load(pretrained))
        else:
            raise Exception("darknet request a pretrained path. got [{}]".format(pretrained))
    return model
def darknet53(pretrained, **kwargs):
    """Constructs a darknet-53 model.
    """
    model = DarkNet([1, 2, 8, 8, 4])
    if pretrained:
        if isinstance(pretrained, str):
            model.load_state_dict(torch.load(pretrained))
        else:
            raise Exception("darknet request a pretrained path. got [{}]".format(pretrained))
    return model
复制代码

3 个预测分支，对应预测 3 种尺度（大、种、小），也都采用了全卷积的结构。

YOLOv3 的 backbone 选择 Darknet-53后，其检测性能远超 Darknet-19，同时效率上也优于 ResNet-101 和 ResNet-152，对比实验结果如下：

网络异常，图片无法展示

在对比实验中，每个网络都使用相同的设置进行训练和测试。运行速度 FPS 是在 Titan X 硬件上，输入图像大小为 256×256256 \times 256256×256 上测试得到的。从上表可以看出，Darknet-53 和 state-of-the-art 分类器相比，有着更少的 FLOPs 和更快的速度。和 ResNet-101 相比，精度更高并且速度是前者的 1.5 倍；和 ResNet-152 相比，精度相似，但速度是它的 2 倍以上。

Darknet-53 也可以实现每秒最高的测量浮点运算。这意味着其网络结构可以更好地利用 GPU，从而使其评估效率更高，速度更快。这主要是因为 ResNets 的层数太多，效率不高。

2.5，训练

和 YOLOv2 一样，我们依然训练所有图片，没有 hard negative mining or any of that stuff。我们依然使用多尺度训练，大量的数据增强操作和 BN 层以及其他标准操作。我们使用之前的 Darknet 神经网络框架进行训练和测试[12]。

损失函数的计算公式如下。

网络异常，图片无法展示

YOLO v3 使用多标签分类，用多个独立的 logistic 分类器代替 softmax 函数，以计算输入属于特定标签的可能性。在计算分类损失进行训练时，YOLOv3 对每个标签使用二元交叉熵损失。

正负样本的确定：

正样本：与 GT 的 IOU 最大的框。
负样本：与 GT 的 IOU<0.5 的框。
忽略的样本：与 GT 的 IOU>0.5 但不是最大的框。
使用 txt_xtx 和 tyt_yty （而不是 bxb_xbx 和 byb_yby ）来计算损失。

注意：每个 GT 目标仅与一个先验边界框相关联。如果没有分配先验边界框，则不会导致分类和定位损失，只会有目标的置信度损失。

YOLOv3 网络结构图如下所示（这里输入图像大小为 608*608，来源这里）。

网络异常，图片无法展示

2.5，推理

总的来说还是将输出的特侦图划分成 S*S（这里的S和特征图大小一样）的网格，通过设置置信度阈值对网格进行筛选，只有大于指定阈值的网格才认为存在目标，即该网格会输出目标的置信度、bbox 坐标和类别信息，并通过 NMS 操作筛选掉重复的框。

值得注意的是，模型推理的 bbox 的 xywhxywhxywh 值是对应 feature map 尺度的，所以后面还需要将 xywh 的值 * 特征图的下采样倍数。

# 将 bbox 预测值, box 置信度, box 分类结果的矩阵拼接成一个新的矩阵
# * _scale 是为了将预测的 box 对应到原图尺寸, _scale 是特征图下采样倍数。
# 对于大目标检测分支 pred_boxes.view(bs, -1, 4) 后的 shape 为 [1, 507, 4], output 的 shape 为 [1, 507, 85]
# bs 是 batch_size，即一次推理多少张图片。
output = torch.cat((pred_boxes.view(bs, -1, 4) * _scale,
                    conf.view(bs, -1, 1), pred_cls.view(bs, -1, self.num_classes)), -1)
复制代码

万字长文解析 YOLOv1-v5 模型（中）

三，YOLOv3

摘要

1，介绍

2，改进

2.1，边界框预测

2.2，分类预测

2.3，跨尺度预测

2.4，新的特征提取网络

2.5，训练

2.5，推理

ModelScope模型即服务

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像