1. Square Inference理论概要

为了说明什么是Rectangular inference（矩形推理），就得先说说什么是 Square Inference（正方形推理）。

YOLOv3下采样了32倍，因此输入网络的长宽需要是32的倍数，最常用的分辨率就是416了。Square Inference就是输入为正方形，具体过程为：求得较长边缩放到416的比例，然后对图片长宽按这个比例缩放，使得较长边达到416再对较短边进行填充使得较短边也达到416。

2. Rectangular inference理论概要

可以看到，上面的图片中存在大量的冗余部分，一个很自然的想法就是，能不能去掉这些填充的部分但是又要满足长宽是32的倍数？

具体过程：将较长边设定为目标尺寸416/512…(必须是32的倍数)，短边按比例缩放，再对短边进行较少填充使短边满足32的倍数。

Padding逻辑：（固定目标size）

确定目标size的宽高比P。新图宽高比大于P，则宽resize到目标尺寸，上下padding黑边；新图宽高比小于P，则高resize到目标尺寸，左右padding黑边。

在yolov3spp中，在训练的时候使用的马赛克mosaic来进行数据增强；Rectangular inference只是在推理阶段来使用，目的是显著的减少推理时间。推理阶段的Rectangular inference主要是通过letterbox函数来实现的。

3. Rectangular inference实现代码

YOLOv3-SPP代码：

def letterbox(img: np.ndarray, new_shape=(416, 416), color=(114, 114, 114),
              auto=True, scale_fill=False, scale_up=True):
    """
    将图片缩放调整到指定大小
    :param img: 原图 hwc=(375,500,3)
    :param new_shape: 缩放后的最长边大小
    :param color: pad的颜色
    :param auto: True 保证缩放后的图片保持原图的比例 即 将原图最长边缩放到指定大小，再将原图较短边按原图比例缩放（不会失真）
                 False 将原图最长边缩放到指定大小，再将原图较短边按原图比例缩放,最后将较短边两边pad操作缩放到最长边大小（不会失真）
    :param scale_fill: True 简单粗暴的将原图resize到指定的大小 相当于就是resize 没有pad操作（失真）
    :param scale_up: True  对于小于new_shape的原图进行缩放,大于的不变
                     False 对于大于new_shape的原图进行缩放,小于的不变
    :return: img: letterbox后的图片 HWC
             ratio: wh ratios
             (dw, dh): w和h的pad
    """
    shape = img.shape[:2]  # 原图大小[h, w] = [375, 500]
    if isinstance(new_shape, int):
        new_shape = (new_shape, new_shape)  # (512, 512)
    # scale ratio (new / old)   1.024
    r = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
    if not scale_up:  # (for better test mAP) scale_up = False 对于大于new_shape（r<1）的原图进行缩放,小于new_shape（r>1）的不变
        r = min(r, 1.0)
    # compute padding
    ratio = r, r  # width, height ratios  (1.024, 1.024)
    new_unpad = int(round(shape[1] * r)), int(round(shape[0] * r))  # wh(512, 384) 保证缩放后图像比例不变
    dw, dh = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1]  # wh padding dw=0 dh=128
    if auto:  # minimun rectangle 保证原图比例不变，将图像最大边缩放到指定大小
        # 这里的取余操作可以保证padding后的图片是32的整数倍(416x416)，如果是(512x512)可以保证是64的整数倍
        dw, dh = np.mod(dw, 64), np.mod(dh, 64)  # wh padding dw=0 dh=0
    elif scale_fill:  # stretch 简单粗暴的将图片缩放到指定尺寸
        dw, dh = 0, 0
        new_unpad = new_shape
        ratio = new_shape[0] / shape[1], new_shape[1] / shape[0]  # wh ratios
    dw /= 2  # divide padding into 2 sides 将padding分到上下，左右两侧
    dh /= 2
    # shape:[h, w]  new_unpad:[w, h]
    if shape[::-1] != new_unpad:  # 将原图resize到new_unpad（长边相同，比例相同的新图）
        img = cv2.resize(img, new_unpad, interpolation=cv2.INTER_LINEAR)
    top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))  # 计算上下两侧的padding
    left, right = int(round(dw - 0.1)), int(round(dw + 0.1))  # 计算左右两侧的padding
    img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)  # add border/pad
    return img, ratio, (dw, dh)
# 推理阶段的自定义数据集
class LoadImagesAndLabels(Dataset):  # for training/testing
    def __init__(self,
                 path,   # 指向data/my_train_data.txt路径或data/my_val_data.txt路径
                 # 这里设置的是预处理后输出的图片尺寸
                 # 当为训练集时，设置的是训练过程中(开启多尺度)的最大尺寸
                 # 当为验证集时，设置的是最终使用的网络大小
                 img_size=416,
                 batch_size=16,
                 augment=False,  # 训练集设置为True(augment_hsv)，验证集设置为False
                 hyp=None,  # 超参数字典，其中包含图像增强会使用到的超参数
                 rect=False,  # 是否使用rectangular training
                 cache_images=False,  # 是否缓存图片到内存中
                 single_cls=False, pad=0.0, rank=-1):
    ...
    # Rectangular Training https://github.com/ultralytics/yolov3/issues/232
    # 如果为ture，训练网络时，会使用类似原图像比例的矩形(让最长边为img_size)，而不是img_size x img_size
    # 注意: 开启rect后，mosaic就默认关闭
    if self.rect:
        # Sort by aspect ratio
        s = self.shapes  # wh
        # 计算每个图片的高/宽比
        ar = s[:, 1] / s[:, 0]  # aspect ratio
        # argsort函数返回的是数组值从小到大的索引值
        # 按照高宽比例进行排序，这样后面划分的每个batch中的图像就拥有类似的高宽比
        irect = ar.argsort()
        # 根据排序后的顺序重新设置图像顺序、标签顺序以及shape顺序
        self.img_files = [self.img_files[i] for i in irect]
        self.label_files = [self.label_files[i] for i in irect]
        self.shapes = s[irect]  # wh
        ar = ar[irect]
        # set training image shapes
        # 计算每个batch采用的统一尺度
        # 这部分的代码可能会比较难以理解，但是简单来说就是将较长边设置为指定的尺度大小，然后较短边按比例缩放
        shapes = [[1, 1]] * nb  # nb: number of batches
        for i in range(nb):
            ari = ar[bi == i]  # bi: batch index
            # 获取第i个batch中，最小和最大高宽比
            mini, maxi = ari.min(), ari.max()
            # 如果高/宽小于1(w > h)，将w设为img_size
            if maxi < 1:
                shapes[i] = [maxi, 1]
            # 如果高/宽大于1(w < h)，将h设置为img_size
            elif mini > 1:
                shapes[i] = [1, 1 / mini]
        # 计算每个batch输入网络的shape值(向上设置为32的整数倍)
        self.batch_shapes = np.ceil(np.array(shapes) * img_size / 32. + pad).astype(np.int) * 32
    ...
    # 这里分别命名是为了防止出现rect为False/True时混用导致计算的mAP错误
    # 当rect为True时会对self.images和self.labels进行从新排序
    if rect is True:
        np_labels_path = str(Path(self.label_files[0]).parent) + ".rect.npy"  # saved labels in *.npy file
    else:
        np_labels_path = str(Path(self.label_files[0]).parent) + ".norect.npy"
    ...
    def __getitem__(self, index):
        hyp = self.hyp
        # train: 训练阶段使用mosaic来数据增强
        if self.mosaic:
            # load mosaic
            img, labels = load_mosaic(self, index)
            shapes = None
        # inference：推理阶段使用Rectangular inference加快推理时间
        else:
            # load image
            img, (h0, w0), (h, w) = load_image(self, index)
            # 通过letterbox来对一批比较相近的数据进行填充处理
            shape = self.batch_shapes[self.batch[index]] if self.rect else self.img_size  # final letterboxed shape
            img, ratio, pad = letterbox(img, shape, auto=False, scale_up=self.augment)
            shapes = (h0, w0), ((h / h0, w / w0), pad)  # for COCO mAP rescaling
    ...
    # 对一个batch的数据进行处理，当调用了batch_size次 getitem 函数后才会调用一次这个函数，对batch_size张图片和对应的label进行打包。
    @staticmethod
    def collate_fn(batch):
        """
        :param batch: 里面有batch_size个元组 对应的是调用了batch_size次getitem函数的返回值
        :return: img=[batch_size, 3, 736, 736]
                 label=[target_sums, 6]  6：表示当前target属于哪一张图+class+x+y+w+h
                 path     shapes      index
        """
        # img: 一个tuple 由batch_size个tensor组成 每个tensor表示一张图片
        # label: 一个tuple 由batch_size个tensor组成 每个tensor存放一张图片的所有的target信息
        #        label[6, object_num] 6中的第一个数代表一个batch中的第几张图
        # path: 一个tuple 由4个str组成, 每个str对应一张图片的地址信息
        # index: 一个tuple (index1, index2, index3...) 存放着当前batch中每张图片的index
        img, label, path, shapes, index = zip(*batch)  # transposed
        for i, l in enumerate(label):
            l[:, 0] = i  # add target image index for build_targets()
        # 返回的img=[batch_size, 3, 736, 736]
        #      torch.stack(img, 0): 将batch_size个[3, 736, 736]的矩阵拼成一个[batch_size, 3, 736, 736]
        # label=[target_sums, 6]  6：表示当前target属于哪一张图+class+x+y+w+h
        #      torch.cat(label, 0): 将[n1,6]、[n2,6]、[n3,6]...拼接成[n1+n2+n3+..., 6]
        # 这里之所以拼接的方式不同是因为img拼接的时候它的每个部分的形状是相同的，都是[3, 736, 736]
        # 而我label的每个部分的形状是不一定相同的，每张图的目标个数是不一定相同的（label肯定也希望用stack,更方便,但是不能那样拼）
        # 如果每张图的目标个数是相同的，那我们就可能不需要重写collate_fn函数了
        return torch.stack(img, 0), torch.cat(label, 0), path, shapes, index

既然可以 Rectangular Inference ，很自然又想到训练的时候能不能也这样？但是显然，训练的时候情况就比较复杂了，因为我们是批次训练的，要保证这个batch里面的图片shape是一样的，所以最常见的就是Square training，例如YOLO默认就是图片416x416输入，SSD300x300等等。而Faster RCNN系列、Retinanet就不是固定尺寸的正方形输入，因此一开始Faster RCNN实现往往都是将batch size = 1。而现在有些实现是可以batch size > 1，因为可以取这个batch中的最大的长和宽，然后图片都填充到这个max width和max height。但是这样显然还是比较浪费的，因为如果一个batch中的不同图片wh差距很大，小图片就太吃亏了啊。

所以如果训练的图片尺寸都是相同的或者相近的，那这个就有很大优势了。这就是一个小trick，在YOLOv3spp中实现主要就是优化了一下如何将比例相近的图片放在一个batch，这样显然填充的就更少一些了。这一部分是在自定义数据集中的__init__部分来实现的。作者说是这样做在混合了不同长宽比例的COCO数据集上快了1/3。

这里将batchsize设置为4的推理信息如下所示：

完整输出：

Using cuda device training.
Using 4 dataloader workers
Caching labels (5823 found, 0 missing, 0 empty, 0 duplicate, for 5823 images): 1
Model Summary: 225 layers, 6.26756e+07 parameters, 6.26756e+07 gradients, 117.2 GFLOPS
loading eval info for coco tools.: 100%|██| 5823/5823 [00:00<00:00, 6763.89it/s]
creating index...
index created!
validation...: 100%|████████████████████████| 1456/1456 [02:47<00:00,  8.67it/s]
Accumulating evaluation results...
DONE (t=1.19s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.597
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.816
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.658
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.221
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.688
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.477
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.684
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.698
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.352
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.587
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.769
 aeroplane      : 0.8407435368970886
 bicycle        : 0.841055733098646
 bird           : 0.8191967600842852
 boat           : 0.6946511199916087
 bottle         : 0.7231628774949694
 bus            : 0.8887323806647173
 car            : 0.8511586393361052
 cat            : 0.9073166258716362
 chair          : 0.699768122727234
 cow            : 0.8731324121446061
 diningtable    : 0.7171696892135374
 dog            : 0.8655392489181959
 horse          : 0.8825577798170706
 motorbike      : 0.9061020838132525
 person         : 0.8901775880333556
 pottedplant    : 0.6108321500054713
 sheep          : 0.8503812741255414
 sofa           : 0.7055372676420307
 train          : 0.9066429729563691
 tvmonitor      : 0.8405858721013346

4. Rectangular inference顺序排序

在yolo的代码中，其实Rectangular inference还结合了其他的tirck。比如说，会统计数据集中所有图像的宽高，来按宽高比进行排序，这样确保了比例类似的图片会在一个batch中，然后对这个batch的图像进行填充到一个相同的维度，再拼接成一起成为一个大的tensor。这样就减少了需要填充处理的时间，因为比例都是类似的，就说明不需要填充太多的，填充所需要的时间也将更少。填充处理是通过letterbox函数来实现的。

不过，需要注意，因为这里yolo是对batch是做了一点处理，也就是将数据是按宽高比来进行排序的，所以在推理阶段的时候不能低数据进行随机采样，我想到了这一点，那么dataloader进行采样的时候参数Shuffle应该为Fasle的，而且def getitem(self, index)中的参数index就是应该是从0开始，对排列好的数据进行顺序采样进行的，查看了代码已经调试了一下，证实了我的想法是正确的。

validation.py

# inference阶段的数据集定义与采集
val_dataset = LoadImagesAndLabels(test_path, parser_data.img_size, batch_size,
                                  hyp=parser_data.hyp,
                                  # 将每个batch的图像调整到合适大小，可减少运算量(并不是512x512标准尺寸)
                                  rect=True)  # 注意到这里是使用了Rectangular inference
val_dataset_loader = torch.utils.data.DataLoader(val_dataset,
                                                 batch_size=batch_size,
                                                 shuffle=False,    # 注意到这里是设置为False的
                                                 num_workers=nw,
                                                 pin_memory=True,
                                                 collate_fn=val_dataset.collate_fn)

debug结果：

开始从0开始：

接着顺序进行：

证实了我的想法是正确的。

参考资料：

https://blog.csdn.net/songwsx/article/details/102639770

https://blog.csdn.net/qq_38253797/article/details/116611767

目标检测的Tricks | 【Trick6】加快推理速度——Rectangular inference

1. Square Inference理论概要

2. Rectangular inference理论概要

3. Rectangular inference实现代码

4. Rectangular inference顺序排序

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

目标检测的Tricks | 【Trick6】加快推理速度——Rectangular inference

1. Square Inference理论概要

2. Rectangular inference理论概要

3. Rectangular inference实现代码

4. Rectangular inference顺序排序

热门文章

最新文章

相关电子书