DL之YoloV3：YoloV3论文《YOLOv3: An Incremental Improvement》的翻译与解读（一）-阿里云开发者社区

DL之YoloV3：YoloV3论文《YOLOv3: An Incremental Improvement》的翻译与解读（一）

2021-10-29 373

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： DL之YoloV3：YoloV3论文《YOLOv3: An Incremental Improvement》的翻译与解读

论文地址：https://arxiv.org/pdf/1804.02767.pdf

YoloV3论文翻译与解读

Abstract

We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that’s pretty swell. It’s a little bigger than last time but more accurate. It’s still fast though, don’t worry. At 320 × 320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 AP50 in 51 ms on a Titan X, compared to 57.5 AP50 in 198 ms by RetinaNet, similar performance but 3.8× faster. As always, all the code is online at https://pjreddie.com/yolo/.

我们对YOLO系列算法进行一些最新情况介绍！我们做了一些小的设计更改以使它更好。我们还培训了这个非常棒的新网络。比上次大一点，但更准确。不过还是很快，别担心。在320×320处，Yolov3在22毫秒内以28.2 mAP的速度运行，与SSD一样精确，但速度快了三倍。当我们看到旧的.5 IOU地图检测标准yolov3是相当不错的。在Titan X上，51 ms内可达到57.9 AP50，而在198 ms内，Retinanet可达到57.5 AP50，性能相似，但速度快3.8倍。与往常一样，所有代码都在 https://pjreddie.com/yolo/.

1. Introduction

Sometimes you just kinda phone it in for a year, you know? I didn’t do a whole lot of research this year. Spent a lot of time on Twitter. Played around with GANs a little. I had a little momentum left over from last year [12] [1]; I managed to make some improvements to YOLO. But, honestly, nothing like super interesting, just a bunch of small changes that make it better. I also helped out with other people’s research a little. Actually, that’s what brings us here today. We have a camera-ready deadline [4] and we need to cite some of the random updates I made to YOLO but we don’t have a source. So get ready for a TECH REPORT! The great thing about tech reports is that they don’t need intros, y’all know why we’re here. So the end of this introduction will signpost for the rest of the paper. First we’ll tell you what the deal is with YOLOv3. Then we’ll tell you how we do. We’ll also tell you about some things we tried that didn’t work. Finally we’ll contemplate what this all means.

有时候你只需要打一年电话就行了，你知道吗？今年我没有做很多研究。在Twitter上花了很多时间。和GANs 玩了一会儿。去年我有一点动力，我设法对YOLO做了一些改进。但是，老实说，没有什么比这更有趣的了，只是一些小的改变让它变得更好。我也在其他人的研究上做了一点帮助。事实上，这就是我们今天来到这里的原因。我们有一个摄像头准备就绪的最后期限[4]，我们需要引用我对Yolo所做的一些随机更新，但我们没有来源。所以准备一份技术报告吧！关于技术报告，最重要的是他们不需要介绍，你们都知道为什么我们会在这里。因此，本导言的结尾将为论文的其余部分做上标记。首先，我们会告诉你YOLOV3上处理了什么。然后我们会告诉你我们是怎么做的。我们也会告诉你一些我们尝试过但不起作用的事情。最后，我们将思考这一切意味着什么。

2. The Deal

So here’s the deal with YOLOv3: We mostly took good ideas from other people. We also trained a new classifier network that’s better than the other ones. We’ll just take you through the whole system from scratch so you can understand it all.

所以YOLOv3是这样的:我们主要从别人那里获得好主意。我们还训练了一个新的分类器网络，它比其他分类器更好。我们将从头开始介绍整个系统，这样您就能理解所有内容。

Figure 1. We adapt this figure from the Focal Loss paper [9]. YOLOv3 runs significantly faster than other detection methods with comparable performance. Times from either an M40 or Titan X, they are basically the same GPU.

图1.我们根据Focal Loss报告[9]调整了这个数字。Yolov3的运行速度明显快于其他具有类似性能的检测方法。从M40或Titan X获得的时间，都是基于相同的GPU。

2.1. Bounding Box Prediction

Following YOLO9000 our system predicts bounding boxes using dimension clusters as anchor boxes [15]. The network predicts 4 coordinates for each bounding box, tx, ty, tw, th. If the cell is offset from the top left corner of the image by (cx, cy) and the bounding box prior has width and height pw, ph, then the predictions correspond to:

按照YOLO9000，我们的系统预测使用维度集群作为锚定框[15]的边界框。网络为每个边界框预测4个坐标，分别为tx、ty、tw、th。如果单元格距图像左上角偏移(cx, cy)，且边界框先验有宽和高pw, ph，则预测对应:

During training we use sum of squared error loss. If the ground truth for some coordinate prediction is tˆ * our gradient is the ground truth value (computed from the ground truth box) minus our prediction: tˆ * − t* . This ground truth value can be easily computed by inverting the equations above.

在训练中，我们使用误差损失的平方和。如果地面真理协调预测tˆ*我们的梯度是地面真值(从地面实况框计算)-我们的预测:tˆ*−t *。这一地面真值可以很容易地计算通过反演上述方程。

Figure 2. Bounding boxes with dimension priors and location prediction. We predict the width and height of the box as offsets from cluster centroids. We predict the center coordinates of the box relative to the location of filter application using a sigmoid function. This figure blatantly self-plagiarized from [15].

图2.带有尺寸优先和位置预测的边界框。我们预测了盒子的宽度和高度作为与簇形心的偏移。我们使用一个sigmoid函数来预测盒子相对于过滤器应用程序位置的中心坐标。这个数字公然自抄自[15]。

YOLOv3 predicts an objectness score for each bounding box using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. If the bounding box prior is not the best but does overlap a ground truth object by more than some threshold we ignore the prediction, following [17]. We use the threshold of .5. Unlike [17] our system only assigns one bounding box prior for each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objectness.

YOLOv3使用逻辑回归预测每个边界框的客观得分。如果边界框先验与地面真值对象的重叠超过任何其他边界框先验，则该值应为1。如果边界框先验不是最好的，但是重叠了超过某个阈值的地面真值对象，我们忽略预测，跟随[17]。我们使用的阈值是。5。与[17]不同的是，我们的系统只为每个地面真值对象分配一个边界框。如果一个边界框先验没有分配给一个地面真值对象，它不会导致坐标或类预测的损失，只会导致对象性的损失。

2.2. Class Prediction

Each box predicts the classes the bounding box may contain using multilabel classification. We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifiers. During training we use binary cross-entropy loss for the class predictions. This formulation helps when we move to more complex domains like the Open Images Dataset [7]. In this dataset there are many overlapping labels (i.e. Woman and Person). Using a softmax imposes the assumption that each box has exactly one class which is often not the case. A multilabel approach better models the data.

每个框使用多标签分类预测边界框可能包含的类。我们没有使用softmax，因为我们发现它对于良好的性能是不必要的，相反，我们只是使用独立的逻辑分类器。在训练过程中，我们使用二元交叉熵损失进行类预测。当我们移动到更复杂的领域，比如开放图像数据集[7]时，这个公式会有所帮助。在这个数据集中有许多重叠的标签(即女人和人)。使用softmax会假定每个框只有一个类，而通常情况并非如此。多标签方法可以更好地对数据建模。

DL之YoloV3：YoloV3论文《YOLOv3: An Incremental Improvement》的翻译与解读（一）