PP-YoLoE | PP-YoLov2全面升级Anchor-Free,速度精度完美超越YoLoX和YoLov5(二)

2.1 ATSS Assigner思想

ATSS论文指出One-Stage Anchor-Based和Center-Based Anchor-Free检测算法间的差异主要来自于正负样本的选择,基于此提出ATSS(Adaptive Training Sample Selection)方法,该方法能够自动根据GT的相关统计特征选择合适的Anchor Box作为正样本,在不带来额外计算量和参数的情况下,能够大幅提升模型的性能。


  1. 计算每个 gt bbox 和多尺度输出层的所有 anchor 之间的 IoU
  2. 计算每个 gt bbox 中心坐标和多尺度输出层的所有 anchor 中心坐标的 l2 距离
  3. 遍历每个输出层,遍历每个 gt bbox,找出当前层中 topk (超参,默认是 9 )个最小 l2 距离的 anchor 。假设一共有 l 个输出层,那么对于任何一个 gt bbox,都会挑选出 topk×l 个候选位置
  4. 对于每个 gt bbox,计算所有候选位置 IoU 的均值和标准差,两者相加得到该 gt bbox 的自适应阈值
  5. 遍历每个 gt bbox,选择出候选位置中 IoU 大于阈值的位置,该位置认为是正样本,负责预测该 gt bbox
  6. 如果 topk 参数设置过大,可能会导致某些正样本位置不在 gt bbox 内部,故需要过滤掉这部分正样本,设置为背景样本


  1. 保证了所有的正样本Anchor都是在Ground Truth的周围。
  2. 最主要是根据不同层的特性对不同层的正样本的阈值进行了微调。


  • 指出Anchor-Base检测器和Anchor-Free检测器之间的本质区别实际上是如何定义正训练样本和负训练样本;
  • 提出自适应训练样本选择,以根据目标的统计特征自动选择正负样本;
  • 证明了在图像上的每个位置上平铺多个Anchor来提升检测的性能是没效果的;
class ATSSAssigner(nn.Layer):
    """Bridging the Gap Between Anchor-based and Anchor-free Detection
     via Adaptive Training Sample Selection
    __shared__ = ['num_classes']
    def __init__(self, topk=9, num_classes=80, force_gt_matching=False, eps=1e-9):
        super(ATSSAssigner, self).__init__()
        self.topk = topk
        self.num_classes = num_classes
        self.force_gt_matching = force_gt_matching
        self.eps = eps
    def _gather_topk_pyramid(self, gt2anchor_distances, num_anchors_list, pad_gt_mask):
        pad_gt_mask = pad_gt_mask.tile([1, 1, self.topk]).astype(paddle.bool)
        gt2anchor_distances_list = paddle.split(gt2anchor_distances, num_anchors_list, axis=-1)
        num_anchors_index = np.cumsum(num_anchors_list).tolist()
        num_anchors_index = [0, ] + num_anchors_index[:-1]
        is_in_topk_list = []
        topk_idxs_list = []
        for distances, anchors_index in zip(gt2anchor_distances_list, num_anchors_index):
            num_anchors = distances.shape[-1]
            topk_metrics, topk_idxs = paddle.topk(distances, self.topk, axis=-1, largest=False)
            topk_idxs_list.append(topk_idxs + anchors_index)
            topk_idxs = paddle.where(pad_gt_mask, topk_idxs, paddle.zeros_like(topk_idxs))
            is_in_topk = F.one_hot(topk_idxs, num_anchors).sum(axis=-2)
            is_in_topk = paddle.where(is_in_topk > 1, paddle.zeros_like(is_in_topk), is_in_topk)
        is_in_topk_list = paddle.concat(is_in_topk_list, axis=-1)
        topk_idxs_list = paddle.concat(topk_idxs_list, axis=-1)
        return is_in_topk_list, topk_idxs_list
    def forward(self, anchor_bboxes, num_anchors_list, gt_labels, gt_bboxes, pad_gt_mask, bg_index, gt_scores=None, pred_bboxes=None):
        1. 计算所有预测bbox与GT之间的IoU
        2. 计算所有预测bbox与GT之间的距离
        3. 在每个pyramid level上,对于每个gt,选择k个中心距离gt中心最近的bbox,总共选择k*l个bbox作为每个gt的候选框
        4. 获取这些候选框对应的iou,计算mean和std,设 mean + std为 iou 阈值
        5. 选择iou大于或等于阈值的样本为正样本
        6. 将正样本的中心限制在gt内
        7. 如果Anchor框被分配到多个gts,则选择具有最高的IoU的那个。
            anchor_bboxes (Tensor, float32): pre-defined anchors, shape(L, 4),
                    "xmin, xmax, ymin, ymax" format
            num_anchors_list (List): num of anchors in each level
            gt_labels (Tensor, int64|int32): Label of gt_bboxes, shape(B, n, 1)
            gt_bboxes (Tensor, float32): Ground truth bboxes, shape(B, n, 4)
            pad_gt_mask (Tensor, float32): 1 means bbox, 0 means no bbox, shape(B, n, 1)
            bg_index (int): background index
            gt_scores (Tensor|None, float32) Score of gt_bboxes,
                    shape(B, n, 1), if None, then it will initialize with one_hot label
            pred_bboxes (Tensor, float32, optional): predicted bounding boxes, shape(B, L, 4)
            assigned_labels (Tensor): (B, L)
            assigned_bboxes (Tensor): (B, L, 4)
            assigned_scores (Tensor): (B, L, C), if pred_bboxes is not None, then output ious
        assert gt_labels.ndim == gt_bboxes.ndim and gt_bboxes.ndim == 3
        num_anchors, _ = anchor_bboxes.shape
        batch_size, num_max_boxes, _ = gt_bboxes.shape
        # 1. 计算所有预测bbox与GT之间的IoU, [B, n, L]
        ious = iou_similarity(gt_bboxes.reshape([-1, 4]), anchor_bboxes)
        ious = ious.reshape([batch_size, -1, num_anchors])
        # 2. 计算所有预测bbox与GT之间的距离, [B, n, L]
        gt_centers = bbox_center(gt_bboxes.reshape([-1, 4])).unsqueeze(1)
        anchor_centers = bbox_center(anchor_bboxes)
        gt2anchor_distances = (gt_centers - anchor_centers.unsqueeze(0)).norm(2, axis=-1).reshape([batch_size, -1, num_anchors])
        # 3. 在每个pyramid level上,对于每个gt,选择k个中心距离gt中心最近的bbox,总共选择k*l个bbox作为每个gt的候选框
        # based on the center distance, [B, n, L]
        is_in_topk, topk_idxs = self._gather_topk_pyramid(gt2anchor_distances, num_anchors_list, pad_gt_mask)
        # 4. 获取这些候选框对应的iou,计算mean和std,设 mean + std为 iou 阈值
        iou_candidates = ious * is_in_topk
        iou_threshold = paddle.index_sample(iou_candidates.flatten(stop_axis=-2), topk_idxs.flatten(stop_axis=-2))
        iou_threshold = iou_threshold.reshape([batch_size, num_max_boxes, -1])
        iou_threshold = iou_threshold.mean(axis=-1, keepdim=True) + iou_threshold.std(axis=-1, keepdim=True)
        is_in_topk = paddle.where(iou_candidates > iou_threshold.tile([1, 1, num_anchors]), is_in_topk, paddle.zeros_like(is_in_topk))
        # 6. 将正样本的中心限制在gt内, [B, n, L]
        is_in_gts = check_points_inside_bboxes(anchor_centers, gt_bboxes)
        # 选择正样本, [B, n, L]
        mask_positive = is_in_topk * is_in_gts * pad_gt_mask
        # 7. 如果Anchor框被分配到多个gts,则选择具有最高的IoU的那个。
        mask_positive_sum = mask_positive.sum(axis=-2)
        if mask_positive_sum.max() > 1:
            mask_multiple_gts = (mask_positive_sum.unsqueeze(1) > 1).tile([1, num_max_boxes, 1])
            is_max_iou = compute_max_iou_anchor(ious)
            mask_positive = paddle.where(mask_multiple_gts, is_max_iou, mask_positive)
            mask_positive_sum = mask_positive.sum(axis=-2)
        # 8. 确认每个gt_bbox 都匹配到了 anchor
        if self.force_gt_matching:
            is_max_iou = compute_max_iou_gt(ious) * pad_gt_mask
            mask_max_iou = (is_max_iou.sum(-2, keepdim=True) == 1).tile([1, num_max_boxes, 1])
            mask_positive = paddle.where(mask_max_iou, is_max_iou, mask_positive)
            mask_positive_sum = mask_positive.sum(axis=-2)
        assigned_gt_index = mask_positive.argmax(axis=-2)
        # 匹配目标
        batch_ind = paddle.arange(end=batch_size, dtype=gt_labels.dtype).unsqueeze(-1)
        assigned_gt_index = assigned_gt_index + batch_ind * num_max_boxes
        assigned_labels = paddle.gather(gt_labels.flatten(), assigned_gt_index.flatten(), axis=0)
        assigned_labels = assigned_labels.reshape([batch_size, num_anchors])
        assigned_labels = paddle.where(mask_positive_sum > 0, assigned_labels, paddle.full_like(assigned_labels, bg_index))
        assigned_bboxes = paddle.gather(gt_bboxes.reshape([-1, 4]), assigned_gt_index.flatten(), axis=0)
        assigned_bboxes = assigned_bboxes.reshape([batch_size, num_anchors, 4])
        assigned_scores = F.one_hot(assigned_labels, self.num_classes)
        if pred_bboxes is not None:
            # assigned iou
            ious = batch_iou_similarity(gt_bboxes, pred_bboxes) * mask_positive
            ious = ious.max(axis=-2).unsqueeze(-1)
            assigned_scores *= ious
        elif gt_scores is not None:
            gather_scores = paddle.gather(gt_scores.flatten(), assigned_gt_index.flatten(), axis=0)
            gather_scores = gather_scores.reshape([batch_size, num_anchors])
            gather_scores = paddle.where(mask_positive_sum > 0, gather_scores, paddle.zeros_like(gather_scores))
            assigned_scores *= gather_scores.unsqueeze(-1)
        return assigned_labels, assigned_bboxes, assigned_scores

2.2、Task-aligned Assigner思想(TOOD)

TOOD提出了Task Alignment Learning (TAL) 来显式的把2个任务的最优Anchor拉近。这是通过设计一个样本分配策略和任务对齐loss来实现的。样本分配计算每个Anchor的任务对齐度,同时任务对齐loss可以逐步将分类和定位的最佳Anchor统一起来。


类似于近期提出的One-Stage检测器,所提TOOD采用了类似的架构:Backbone-FPN-Head。考虑到效率与简单性,类似ATSS, TOOD在每个位置放置一个Anchor。

正如所讨论的,由于分类与定位任务的发散性,现有One-Stage检测器存在任务不对齐(Task Mis-Alignment)约束问题。本文提出通过显式方式采用T-head+TAL对2个任务进行对齐,见上图。T-headTAL通过协同工作方式改善2个任务的对齐问题;


  • 首先,T-head在FPN特征基础上进行分类与定位预测;
  • 然后,TAL基于所提任务对齐测度计算任务对齐信息;
  • 最后,T-head根据从TAL传回的信息自动调整分类概率与定位预测。

1、Task-Aligned Head





2、Task-Aligned Sample Assignment


  • 正常对齐的Anchor应当可以预测高分类得分,同时具有精确定位;
  • 不对齐的Anchor应当具有低分类得分,并在NMS阶段被抑制。


Anchor Alignment metric 考虑到分类得分与IoU表征了预测质量,我们采用2者的高阶组合度量任务对齐度,公式定义如下:


Training sample assignment 正如已有研究表明,训练样例分配对于检测器的训练非常重要


class TaskAlignedAssigner(nn.Layer):
    def __init__(self, topk=13, alpha=1.0, beta=6.0, eps=1e-9):
        super(TaskAlignedAssigner, self).__init__()
        self.topk = topk
        self.alpha = alpha
        self.beta = beta
        self.eps = eps
    def forward(self, pred_scores, pred_bboxes, anchor_points, num_anchors_list, gt_labels, gt_bboxes, pad_gt_mask, bg_index, gt_scores=None):
        Task-Aligned Assigner计算步骤如下:
        1. 计算所有 bbox与 gt 之间的对齐度
        2. 选择 top-k bbox 作为每个 gt 的候选项
        3. 将正样品的中心限制在 gt 内(因为Anchor-Free检测器只能预测大于0的距离)
        4. 如果一个Anchor被分配给多个gt,将选择IoU最高的那个。
            pred_scores (Tensor, float32): predicted class probability, shape(B, L, C)
            pred_bboxes (Tensor, float32): predicted bounding boxes, shape(B, L, 4)
            anchor_points (Tensor, float32): pre-defined anchors, shape(L, 2), "cxcy" format
            num_anchors_list (List): num of anchors in each level, shape(L)
            gt_labels (Tensor, int64|int32): Label of gt_bboxes, shape(B, n, 1)
            gt_bboxes (Tensor, float32): Ground truth bboxes, shape(B, n, 4)
            pad_gt_mask (Tensor, float32): 1 means bbox, 0 means no bbox, shape(B, n, 1)
            bg_index (int): background index
            gt_scores (Tensor|None, float32) Score of gt_bboxes, shape(B, n, 1)
            assigned_labels (Tensor): (B, L)
            assigned_bboxes (Tensor): (B, L, 4)
            assigned_scores (Tensor): (B, L, C)
        assert pred_scores.ndim == pred_bboxes.ndim
        assert gt_labels.ndim == gt_bboxes.ndim and gt_bboxes.ndim == 3
        batch_size, num_anchors, num_classes = pred_scores.shape
        _, num_max_boxes, _ = gt_bboxes.shape
        # 计算GT与预测box之间的iou, [B, n, L]
        ious = iou_similarity(gt_bboxes, pred_bboxes)
        # 获取预测bboxes class score
        pred_scores = pred_scores.transpose([0, 2, 1])
        batch_ind = paddle.arange(end=batch_size, dtype=gt_labels.dtype).unsqueeze(-1)
        gt_labels_ind = paddle.stack([batch_ind.tile([1, num_max_boxes]), gt_labels.squeeze(-1)], axis=-1)
        bbox_cls_scores = paddle.gather_nd(pred_scores, gt_labels_ind)
        # 计算bbox与 gt 之间的对齐度, [B, n, L]
        alignment_metrics = bbox_cls_scores.pow(self.alpha) * ious.pow(self.beta)
        # check the positive sample's center in gt, [B, n, L]
        is_in_gts = check_points_inside_bboxes(anchor_points, gt_bboxes)
        # 选择 top-k 预测 bbox 作为每个 gt 的候选项
        is_in_topk = gather_topk_anchors(alignment_metrics * is_in_gts, self.topk, topk_mask=pad_gt_mask.tile([1, 1, self.topk]).astype(paddle.bool))
        # select positive sample, [B, n, L]
        mask_positive = is_in_topk * is_in_gts * pad_gt_mask
        # 如果一个Anchor被分配给多个gt,将选择IoU最高的那个, [B, n, L]
        mask_positive_sum = mask_positive.sum(axis=-2)
        if mask_positive_sum.max() > 1:
            mask_multiple_gts = (mask_positive_sum.unsqueeze(1) > 1).tile([1, num_max_boxes, 1])
            is_max_iou = compute_max_iou_anchor(ious)
            mask_positive = paddle.where(mask_multiple_gts, is_max_iou, mask_positive)
            mask_positive_sum = mask_positive.sum(axis=-2)
        assigned_gt_index = mask_positive.argmax(axis=-2)
        # assigned target
        assigned_gt_index = assigned_gt_index + batch_ind * num_max_boxes
        assigned_labels = paddle.gather(gt_labels.flatten(), assigned_gt_index.flatten(), axis=0)
        assigned_labels = assigned_labels.reshape([batch_size, num_anchors])
        assigned_labels = paddle.where(mask_positive_sum > 0, assigned_labels, paddle.full_like(assigned_labels, bg_index))
        assigned_bboxes = paddle.gather(gt_bboxes.reshape([-1, 4]), assigned_gt_index.flatten(), axis=0)
        assigned_bboxes = assigned_bboxes.reshape([batch_size, num_anchors, 4])
        assigned_scores = F.one_hot(assigned_labels, num_classes)
        # rescale alignment metrics
        alignment_metrics *= mask_positive
        max_metrics_per_instance = alignment_metrics.max(axis=-1, keepdim=True)
        max_ious_per_instance = (ious * mask_positive).max(axis=-1, keepdim=True)
        alignment_metrics = alignment_metrics / (max_metrics_per_instance + self.eps) * max_ious_per_instance
        alignment_metrics = alignment_metrics.max(-2).unsqueeze(-1)
        assigned_scores = assigned_scores * alignment_metrics
        return assigned_labels, assigned_bboxes, assigned_scores
