DL之YoloV3：YoloV3论文《YOLOv3: An Incremental Improvement》的翻译与解读（二）-阿里云开发者社区

DL之YoloV3：YoloV3论文《YOLOv3: An Incremental Improvement》的翻译与解读（二）

2021-10-29 151

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： DL之YoloV3：YoloV3论文《YOLOv3: An Incremental Improvement》的翻译与解读

2.3. Predictions Across Scales

YOLOv3 predicts boxes at 3 different scales. Our system extracts features from those scales using a similar concept to feature pyramid networks [8]. From our base feature extractor we add several convolutional layers. The last of these predicts a 3-d tensor encoding bounding box, objectness, and class predictions. In our experiments with COCO [10] we predict 3 boxes at each scale so the tensor is N × N × [3 ∗ (4 + 1 + 80)] for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions.

YOLOv3预测了三种不同尺度的盒子。我们的系统从这些尺度中提取特征，使用类似于特征金字塔网络[8]的概念。从我们的基本特征提取器，我们添加了几个卷积层。最后一个预测了一个三维张量编码的边界框、对象和类预测。在COCO[10]的实验中，我们在每个尺度上预测3个盒子，因此对于4个边界盒偏移量、1个对象预测和80个类预测，张量是N×N×[3(4 + 1 + 80)]。

Next we take the feature map from 2 layers previous and upsample it by 2×. We also take a feature map from earlier in the network and merge it with our upsampled features using concatenation. This method allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map. We then add a few more convolutional layers to process this combined feature map, and eventually predict a similar tensor, although now twice the size. We perform the same design one more time to predict boxes for the final scale. Thus our predictions for the 3rd scale benefit from all the prior computation as well as finegrained features from early on in the network. We still use k-means clustering to determine our bounding box priors. We just sort of chose 9 clusters and 3 scales arbitrarily and then divide up the clusters evenly across scales. On the COCO dataset the 9 clusters were: (10×13),(16×30),(33×23),(30×61),(62×45),(59× 119),(116 × 90),(156 × 198),(373 × 326).

接下来，我们从之前的两层中提取特征图，并将其向上采样2×。我们还从网络的早期获取一个feature map，并使用连接将其与我们的上采样特性合并。该方法允许我们从上采样的特征中获取更有意义的语义信息，并从早期的特征图中获取更细粒度的信息。然后，我们再添加几个卷积层来处理这个组合的特征图，并最终预测出一个类似的张量，尽管现在张量是原来的两倍。我们再次执行相同的设计来预测最终规模的盒子。因此，我们对第三尺度的预测得益于所有的先验计算以及网络早期的细粒度特性。我们仍然使用k-means聚类来确定我们的边界框先验。我们只是随意选择了9个簇和3个尺度然后在尺度上均匀地划分簇。在COCO数据集中，9个簇分别为(10×13)、(16×30)、(33×23)、(30×61)、(62×45)、(59×119)、(116×90)、(156×198)、(373×326)。

2.4. Feature Extractor

We use a new network for performing feature extraction. Our new network is a hybrid approach between the network used in YOLOv2, Darknet-19, and that newfangled residual network stuff. Our network uses successive 3 × 3 and 1 × 1 convolutional layers but now has some shortcut connections as well and is significantly larger. It has 53 convolutional layers so we call it.... wait for it..... Darknet-53!

我们使用一个新的网络来进行特征提取。我们的新网络是YOLOv2、Darknet-19中使用的网络和新颖的剩余网络之间的混合方法。我们的网络使用连续的3×3和1×1卷积层，但现在也有一些快捷连接，而且明显更大。它有53个卷积层。等待.....Darknet-53 !

This new network is much more powerful than Darknet- 19 but still more efficient than ResNet-101 or ResNet-152. Here are some ImageNet results:

Table 2. Comparison of backbones. Accuracy, billions of operations, billion floating point operations per second, and FPS for various networks.

表2，backbones的比较，精确度，数十亿次运算，每秒数十亿次浮点运算，以及各种网络的FPS。

Each network is trained with identical settings and tested at 256×256, single crop accuracy. Run times are measured on a Titan X at 256 × 256. Thus Darknet-53 performs on par with state-of-the-art classifiers but with fewer floating point operations and more speed. Darknet-53 is better than ResNet-101 and 1.5× faster. Darknet-53 has similar performance to ResNet-152 and is 2× faster. Darknet-53 also achieves the highest measured floating point operations per second. This means the network structure better utilizes the GPU, making it more efficient to evaluate and thus faster. That’s mostly because ResNets have just way too many layers and aren’t very efficient.

每个网络都以相同的设置进行训练，并以256×256的单次裁剪精度进行测试。运行时间是在泰坦X上以256×256的速度测量的。因此，Darknet-53的性能与最先进的分类器相当，但浮点运算更少，速度更快。Darknet-53比ResNet-101好，并且1.5×更快。Darknet-53的性能与ResNet-152相似，并且速度是后者的2倍。Darknet-53还实现了每秒最高的浮点运算。这意味着网络结构更好地利用GPU，使其更有效地评估，从而更快。这主要是因为ResNets层太多，效率不高。

2.5. Training

We still train on full images with no hard negative mining or any of that stuff. We use multi-scale training, lots of data augmentation, batch normalization, all the standard stuff. We use the Darknet neural network framework for training and testing [14].

我们仍然训练完整的图像没有硬负面挖掘或任何东西。我们使用多尺度训练，大量的数据扩充，批量标准化，所有标准的东西。我们使用Darknet神经网络框架来训练和测试[14]。