FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding

简介: 新出现的兴趣已被用于识别以前看不见的对象,因为训练示例很少,称为少镜头对象检测 (FSOD)。最近的研究表明,良好的特征嵌入是获得良好的少样本学习性能的关键。我们观察到具有不同 Intersection-ofUnion (IoU) 分数的对象提议类似于对比方法中使用的图像内增强。我们利用这种类比并结合监督对比学习,在 FSOD 中实现更稳健的对象表示。

FSCE:通过对比建议编码进行少样本目标检测

https://github.com/MegviiDetection/FSCE

Emerging interests have been brought to recognize previously unseen objects given very few training examples, known as few-shot object detection (FSOD). Recent researches demonstrate that good feature embedding is the key to reach favorable few-shot learning performance.We observe object proposals with different Intersection-ofUnion (IoU) scores are analogous to the intra-image augmentation used in contrastive approaches. And we exploit this analogy and incorporate supervised contrastive learning to achieve more robust objects representations in FSOD.


新出现的兴趣已被用于识别以前看不见的对象,因为训练示例很少,称为少镜头对象检测 (FSOD)。最近的研究表明,良好的特征嵌入是获得良好的少样本学习性能的关键。我们观察到具有不同 Intersection-ofUnion (IoU) 分数的对象提议类似于对比方法中使用的图像内增强。我们利用这种类比并结合监督对比学习,在 FSOD 中实现更稳健的对象表示。


We present Few-Shot object detection via Contrastive proposals Encoding (FSCE), a simple yet effective approach to learning contrastive-aware object proposal encodings that facilitate the classification of detected objects. We notice the degradation of average precision (AP) for rare objects mainly comes from misclassifying novel instances as confusable classes. And we ease the misclassification issues by promoting instance level intra-class compactness and interclass variance via our contrastive proposal encoding loss (CPE loss). Our design outperforms current state-of-theart works in any shot and all data splits, with up to +8:8% on standard benchmark PASCAL VOC and +2:7% on challenging COCO benchmark.


我们通过对比提议编码 (FSCE) 提出 Few-Shot 对象检测,这是一种简单而有效的学习对比感知对象提议编码的方法,有助于对检测到的对象进行分类。我们注意到稀有对象的平均精度(AP)的下降主要来自于将新实例错误分类为易混淆的类。我们通过对比提议编码损失(CPE 损失)提高实例级别的类内紧凑性和类间方差,从而缓解错误分类问题。 我们的设计在任何镜头和所有数据拆分中都优于当前的最新工作,在标准基准 PASCAL VOC 上高达 +8:8%,在具有挑战性的 COCO 基准上 +2:7%。


1. Introduction


Development of modern convolutional neural networks (CNNs) [1, 2, 3] give rise to great advances in general object detection [4, 5, 6]. Deep detectors demand a large amount of annotated training data to saturate its performance [7, 8].

In few-shot learning scenarios, deep detectors suffer severer over-fitting and the gap between few-shot detection and general object detection is larger than the corresponding gap in few-shot image classification [9, 10, 11]. On the contrary,a child can rapidly comprehend new visual concepts and recognize objects from a newly learned category given very few examples. Closing such gap is therefore an important step towards more successful machine perception [12].


现代卷积神经网络 (CNN) [1, 2, 3] 的发展在一般目标检测 [4, 5, 6] 方面取得了巨大进步。深度检测器需要大量带注释的训练数据才能使其性能饱和 [7, 8]。在few-shot学习场景中,深度检测器遭受更严重的过拟合,few-shot检测与一般目标检测之间的差距大于few-shot图像分类中相应的差距[9,10,11]。相反,孩子可以快速理解新的视觉概念,并通过很少的例子从新学习的类别中识别对象。因此,缩小这种差距是朝着更成功的机器感知迈出的重要一步[12]。


Precedented by few-shot image classification, earlier attempts in few-shot object detection utilize meta-learning strategy [13, 14, 15]. Meta-learners are trained with an episode of individual tasks, meta-task samples from common objects (base class) to pair with rare objects (novel class) to simulate few-shot detection tasks. Recently, the two-stage fine-tune based approach (TFA) reveals more potential in improving few-shot detection. Baseline TFA [16] simply freeze all base class trained parameters and fine-tune only box classifier and box regressor with novel data, yet outperforms previous meta-learners. MPSR [17] improves upon TFA by alleviating the scale bias inherent to few-shot dataset, but their positive refinement branch demands manual selection, which is somewhat less neat. In this work, we observe and address the essential weakness of the finetuning based approach – constantly mislabeling novel instances as confusable categories, and improve the few-shot detection performance to the new state-of-the-art (SOTA).


在少镜头图像分类之前,早先的少镜头目标检测尝试利用元学习策略[13,14,15]。元学习者接受一系列单独任务的训练,来自常见对象(基类)的元任务样本与稀有对象(新类)配对以模拟小样本检测任务。最近,基于两阶段微调的方法 (TFA) 在改进少样本检测方面显示出更大的潜力。基线 TFA [16] 简单地冻结所有基类训练的参数并仅用新数据微调框分类器和框回归器,但优于以前的元学习器。 MPSR [17] 通过减轻少样本数据集固有的尺度偏差来改进 TFA,但它们的正细化分支需要手动选择,这有点不那么整洁。在这项工作中,我们观察并解决了基于微调的方法的基本弱点——不断将新实例错误地标记为易混淆的类别,并将少数样本检测性能提高到新的最先进技术 (SOTA)。


Object detection involves localization and classification of appeared objects. In few-shot detection, one might naturally conjecture the localization of novel objects is going to under-perform its base categories counterpart, with the concern that rare objects would be deemed as background [14, 13, 18]. However, based on our experiments with Faster R-CNN [4], the commonly adopted detector in few-shot detection, class-agonistic region proposal network (RPN) is able to make foreground proposals for novel instances, and the final box regressor can localize novel instances quite accurately. In comparison, as demonstrated in Figure 2, misclassifying detected novel instances as confusable base classes is indeed the main source of error. We visualize the pairwise cosine similarity between class prototypes [19, 20, 21] of a Faster R-CNN box classifier trained with PASCAL VOC [22, 23]. The cosine similarity between prototypes from resembled categories can be 0:39, whereas the similarity between objects and background is on average -0.21. In few-shot setting, the similarity between cluster centers can go as high as 0.59, e.g., between sheep and cow, bicycle and motorbike, making classification for similar objects error-prone. We make a calculation upon baseline TFA, manually correcting misclassified yet accurately localized box predictions can increase novel class average precision (nAP) by over 20 points.


对象检测涉及出现对象的定位和分类。在少镜头检测中,人们可能会自然地推测新物体的定位将低于其基本类别对应物,担心稀有物体会被视为背景 [14,13,18]。然而,基于我们对 Faster R-CNN [4] 的实验,在少样本检测中常用的检测器、类竞争区域提议网络 (RPN) 能够为新实例提出前景提议,并且最终的框回归器可以非常准确地定位新实例。相比之下,如图 2 所示,将检测到的新实例错误分类为可混淆的基类确实是主要的错误来源。我们可视化使用 PASCAL VOC [22, 23] 训练的 Faster R-CNN 盒分类器的类原型 [19, 20, 21] 之间的成对余弦相似度。来自相似类别的原型之间的余弦相似度可以为 0:39,而对象和背景之间的相似度平均为 -0.21。在few-shot设置中,聚类中心之间的相似度可以高达0.59,例如羊和牛、自行车和摩托车之间,使得相似对象的分类容易出错。我们基于基线 TFA 进行计算,手动纠正错误分类但准确定位的框预测可以将新的类平均精度 (nAP) 提高 20 多个点。


dbd55f90bc8f4f15a694a8f4e9481a2e.png


Figure 2. We find in fine-tuning based few-shot object detector, classification is more error-prone than localization. In the fine-tuning stage, RPN is able to make good enough foreground proposals for novel instances, hence novel objects are often accurately localized but mis-classified as confusable base classes. Here shows 20 top-scoring RPN proposals and example detection results from PASCAL VOC Split 1, wherein bird, sofa and cow are novel categories. The left panel shows the pair-wise cosine similarity between the class prototypes learned in the bounding box classifier. For example, the similarity between bus and bird is -0.10, but the similarity between cow and horse is 0.39. Our goal is to decrease the instance-level similarity between similar objects that are from different categories.


图 2. 我们发现在基于微调的小样本目标检测器中,分类比定位更容易出错。在微调阶段,RPN 能够为新实例提供足够好的前景建议,因此新对象通常被准确定位但被错误分类为可混淆的基类。这里展示了来自 PASCAL VOC Split 1 的 20 个得分最高的 RPN 提议和示例检测结果,其中鸟、沙发和牛是新类别。左图显示了在边界框分类器中学习的类原型之间的成对余弦相似度。比如bus和bird的相似度是-0.10,但是cow和horse的相似度是0.39。我们的目标是降低来自不同类别的相似对象之间的实例级相似性。


A common approach to learn well-separated decision boundary is to use a large margin classifier [24], but with our trials, category-level positive-margin based classifiers does not work in this data-hunger setting [20, 25]. To learn instance-level discriminative feature representations, contrastive learning [26, 27] has demonstrated its effectiveness in tasks including recognition [28], identification [29] and the recent successful self-supervised models [30, 31, 32, 33]. In supervised contrastive learning for image classification [34], intra-image augmentations of images from the same class are used to enrich the positive example pairs.We think region proposals with different Intersection-overUnion (IoU) for an object are naturally analogous to the intra-image augmentation cropping, as illustrated in Figure 1. Therefore in this work, we explore to extend the supervised batch contrastive approach [34] to few-shot object detection. We believe the contrastively learned object representations aware of the intra-class compactness and the inter-class difference can ease the misclassification of unseen objects as similar categories.


学习分离良好的决策边界的一种常用方法是使用大边距分类器 [24],但在我们的试验中,基于类别级正边距的分类器在这种数据饥饿环境中不起作用 [20, 25]。为了学习实例级的判别特征表示,对比学习 [26, 27] 已经证明了它在识别 [28]、识别 [29] 和最近成功的自监督模型 [30, 31, 32, 33] 等任务中的有效性。在图像分类的监督对比学习[34]中,来自同一类的图像的图像内增强被用来丰富正样本对。 我们认为对于一个对象具有不同 Intersection-overUnion (IoU) 的区域提议自然类似于图像内增强裁剪,如图 1 所示。因此,在这项工作中,我们探索将监督批量对比方法 [34] 扩展到少镜头目标检测。我们相信对比学习的对象表示意识到类内紧凑性和类间差异可以缓解将看不见的对象误分类为相似类别的问题。


d83d24dd5c634e36981aed1165b22899.png


Figure 1. Conceptualization of our contrastive object proposals encoding. We introduce a score function which measures the semantic similarity between region proposals. Positive proposals (x+) refer to region proposals from the same category or the same object. Negative proposals (😆 refer to proposals from different categories. We encourage the object encodings to have the prop-erty that score(f(x); f(x+)) >> score(f(x); f(😆), such that our contrastively learned object proposals have smaller intra-class variance and larger inter-class difference


图 1. 我们对比对象建议编码的概念化。我们引入了一个评分函数来衡量区域提案之间的语义相似性。正提议(x+)是指来自同一类别或同一对象的区域提议。否定提案(x-)是指来自不同类别的提案。我们鼓励对象编码具有 score(f(x); f(x+)) >> score(f(x); f(😆) 的属性,这样我们对比学习的对象建议具有更小的内部-类方差和更大的类间差异


We present Few-Shot object detection via Contrastive proposals Encoding (FSCE), a simple yet effective fine-tune based approach for few-shot object detection. When transfer the base detector to few-shot novel data, we augment the primary Region-of-Interest (RoI) head with a contrastive branch, the contrastive branch measures the similarity between object proposal encodings. A supervised contrastive objective with specific considerations for detection will be optimized to reduce the variance of object proposal embeddings from the same category, while pushing differentcategory instances away from each other. The proposed contrastive objective, contrastive proposal encoding (CPE) loss, is employed to the original classification and localiza-tion objective in a multi-task fashion. The end-to-end training of our proposed method is identical to vanilla Faster R-CNN.


我们通过对比提议编码 (FSCE) 提出了 Few-Shot 对象检测,这是一种简单而有效的基于微调的方法,用于小样本对象检测。当将基本检测器转移到少数新数据时,我们使用对比分支来增加主要感兴趣区域 (RoI) 头,对比分支测量对象提议编码之间的相似性。将优化具有特定检测考虑的监督对比目标,以减少来自同一类别的对象建议嵌入的方差,同时将不同类别的实例推离彼此。提出的对比目标,对比提议编码(CPE)损失,以多任务方式用于原始分类和定位目标。 我们提出的方法的端到端训练与 vanilla Faster R-CNN 相同。


To our best knowledge, we are the first to bring contrastive learning into few-shot object detection. Our simple design sets the new state-of-the-art in any shot (1, 2, 3, 5, 10, and 30), with up to +8:8% on the standard PASCAL VOC benchmark and +2:7% on the challenging COCO benchmark.


据我们所知,我们是第一个将对比学习引入小样本目标检测的人。我们简单的设计在任何镜头(1、2、3、5、10 和 30)中设置了新的最先进技术,标准 PASCAL VOC 基准高达 +8:8% 和 +2:7 % 在具有挑战性的 COCO 基准上。


2. Related Work


Few-shot learning. Few-shot learning aims to recognize new concepts given limited labeled examples. Metalearning approaches aim at training a meta-model on episodes of individual tasks such that it can adapt to new tasks with few samples [35, 11, 36, 10, 37, 38, 39], known as “learning-to-learn”. Deep metric-learning based approaches emphasize learning good feature representation embeddings that facilitate downstream tasks. The most intuitive metrics including cosine similarity [20, 40, 41, 21], euclidean distance to class center [19], and graph distances [42]. Interestingly, hallucinator-based methods solve the data deficiency via learning to generate fake-data [9].Existing few-shot learners are mostly developed in the context of classification. In comparison, few-shot detection is more challenging as it involves both classification and localization, yet under-researched.


Few-shot learning。 Few-shot learning 旨在识别给定有限标记示例的新概念。元学习方法旨在针对单个任务的情节训练元模型,使其能够适应具有少量样本的新任务 [35、11、36、10、37、38、39],称为“学习到学习” .基于深度度量学习的方法强调学习有助于下游任务的良好特征表示嵌入。最直观的指标包括余弦相似度 [20、40、41、21]、到类中心的欧几里德距离 [19] 和图距离 [42]。有趣的是,基于幻觉器的方法通过学习生成假数据来解决数据不足 [9]。现有的小样本学习器大多是在分类的背景下开发的。相比之下,few-shot 检测更具挑战性,因为它涉及分类和定位,但研究不足


Few-shot object detection. There are two lines of work addressing the challenging few-shot object detection (FSOD) problem. First, meta-learning based approaches devise a stage-wise and periodic meta-training paradigm to train a meta-learner to help knowledge transfer from base classes.Meta R-CNN [13] meta-learns channel-wise attention layer for remodeling the RoI head. MetaDet [14] applies a weight prediction meta-model to dynamically transfer categoryspecific parameters from the base detector. FSIW [15] improves upon Meta R-CNN and FSRW [43] by more complex feature aggregation and meta-training on a balanced dataset. With the balanced dataset introduced in TFA [16], fine-tune based detectors are rowing over meta-learning based methods in performance, MPSR [17] sets the current state-of-the-art by mitigating the scale scarcity in few-shot datasets, but its generalizability is limited because the positive refinement branch contains manual decisions. RepMet [44] attaches an embedding sub-net in RoI head to model a posterior class distribution. It utilizes advanced tricks including OHEM [45] and SoftNMS [46] but fails to catch up with current SOTA. We criticize complex al-gorithms as they can easily overfit and exhibit poor test results in FSOD. Instead, our insight here is that the degeneration of average precision (AP) for novel categories mainly comes from misclassifying novel instances as confusable categories, and we resort to contrastive learning to learn discriminative object proposal representations without complexing the model.


Few-shot 目标检测。 有两条工作线可以解决具有挑战性的few-shot 目标检测 (FSOD) 问题。首先,基于元学习的方法设计了一种阶段性和周期性的元训练范式来训练元学习器,以帮助从基类转移知识。Meta R-CNN [13] 元学习通道注意层用于重塑投资回报率头。 MetaDet [14] 应用权重预测元模型从基础检测器动态传输特定于类别的参数。 FSIW [15] 通过在平衡数据集上进行更复杂的特征聚合和元训练,改进了 Meta R-CNN 和 FSRW [43]。借助 TFA [16] 中引入的平衡数据集,基于微调的检测器在性能上优于基于元学习的方法,MPSR [17] 通过减轻少样本中的规模稀缺性来设置当前最先进的技术数据集,但其泛化性有限,因为正细化分支包含手动决策。 RepMet [44] 在 RoI 头中附加了一个嵌入子网,以对后验类分布进行建模。它利用了包括 OHEM [45] 和 SoftNMS [46] 在内的高级技巧,但未能赶上当前的 SOTA。我们批评复杂的算法,因为它们很容易过拟合并在 FSOD 中表现出较差的测试结果。相反,我们在这里的见解是,新类别的平均精度 (AP) 退化主要来自将新实例错误分类为可混淆类别,我们诉诸对比学习来学习有区别的对象提议表示,而不会使模型复杂化。


Contrastive learning The recent success of self-supervised models can be attributed to the renewed interest in exploring contrastive learning. [47, 30, 48, 49, 32, 50, 33, 51].Optimizing the contrastive objectives [48, 20, 21, 34] simultaneously maximize the agreement between similar instances defined as positive pairs and encourage the difference among dissimilar instances or negative pairs. With contrastive learning, the algorithm learns to build representations that do not concentrate on pixel-level details, but encoding high-level features effective enough to distinguish different images [33, 32, 50, 51]. Supervised contrastive learning [34] extends the batch contrastive approach to supervised setting, but for image classification.


对比学习 最近自我监督模型的成功可归因于对探索对比学习的重新兴趣。[47,30,48,49,32,50,33,51]。优化对比目标 [48,20,21,34] 同时最大化定义为正对的相似实例之间的一致性,并鼓励不同实例之间的差异或负对。通过对比学习,该算法学习构建不专注于像素级细节的表示,而是编码足够有效以区分不同图像的高级特征[33、32、50、51]。监督对比学习 [34] 将批量对比方法扩展到监督设置,但用于图像分类。


To our best knowledge, this work is the first to integrate supervised contrastive learning [29, 34] into few-shot object detection. The state-of-the-art few-shot detection performance in any shot and all benchmarks demonstrate the effectiveness of our proposed method.


据我们所知,这项工作是第一个将监督对比学习 [29, 34] 集成到小样本目标检测中的工作。在任何镜头和所有基准测试中最先进的少镜头检测性能证明了我们提出的方法的有效性。


3. Method


Our proposed method FSCE involves a simple two-stage training. First, the standard Faster R-CNN detection model is trained with abundant base-class data (Dtrain = Dbase).Then, the base detector is transferred to novel data through fine-tuning on a balanced dataset [8] with novel instances and randomly sampled base instances (Dtrain = Dnovel ∪ Dbase). The backbone feature extractor is frozen during fine-tuning while the RoI feature extractor is supervised by a contrastive objective. We jointly optimize the contrastive proposal encoding (CPE) loss we proposed with the original classification and regression objectives in a multi-task fashion. Overview of our method is shown in Figure 3.


我们提出的方法 FSCE 涉及一个简单的两阶段训练。首先,使用丰富的基类数据(Dtrain = Dbase)训练标准的 Faster R-CNN 检测模型。然后,通过对具有新实例和随机的平衡数据集 [8] 进行微调,将基检测器转移到新数据采样的基础实例(Dtrain = Dnovel ∪ Dbase)。主干特征提取器在微调期间被冻结,而 RoI 特征提取器由对比目标监督。 我们以多任务方式与原始分类和回归目标共同优化了我们提出的对比建议编码(CPE)损失。我们的方法概述如图 3 所示。


1330924970a64620aa7579da3da7cc90.png


Figure 3. Overview of our proposed FSCE. In our method, we jointly fine-tune the FPN pathway and RPN while fixing the backbone.We find this is effective in coordinating backbone feature maps to activate on novel objects yet still avoid the risk of overfitting. To learn contrastive object proposal encodings, we introduce a contrastive branch to guide the RoI features to learn contrastive-aware proposal embeddings. We design a contrastive objective to maximize the within-category agreement and cross-category disagreement.


图 3. FSCE 概述。在我们的方法中,在固定主干的同时联合微调 FPN 路径和 RPN。我们发现这在协调主干特征图以激活新对象上是有效的,并避免了过度拟合的风险。为了学习对比对象提议编码,我们引入了一个对比分支来指导 RoI 特征学习contrastive-aware proposal embeddings。我们设计了一个对比目标,以最大化类别内一致和跨类别分歧。


3.1. Preliminary


Rethink the two-stage fine-tuning approach. Original TFA [16] only fine-tunes the last two fc layers–box classifier and box regressor–with novel data, the rest structures are frozen and taken as a fixed feature extractor. This could be viewed as an approach to counter the over-fitting of lim-ited novel data. However it is counter-intuitive that Feature Pyramid Network (FPN [52]), RPN, especially the RoI feature extractor which contain semantic information learned from base classes only, could be transferred directly to novel classes without any form of training.


重新考虑两阶段微调方法。 原始 TFA [16] 仅使用新数据微调最后两个 fc 层框分类器和框回归器,其余结构被冻结并作为固定特征提取器。这可以被视为一种对抗有限新数据过度拟合的方法。然而,特征金字塔网络(FPN [52])、RPN,尤其是仅包含从基类中学习到的语义信息的 RoI 特征提取器,可以直接转移到新的类而无需任何形式的训练,这是违反直觉的。


In baseline TFA, unfreezing RPN and RoI feature extractor leads to degraded results for novel classes. However, we find this behavior is reversible and can benefit novel detection results if trained properly. We propose a stronger baseline which adapts much better to novel data with jointly fine-tuned feature extractors and box predictors


在基线 TFA 中,解冻 RPN 和 RoI 特征提取器会导致新类的结果下降。然而,我们发现这种行为是可逆的,如果训练得当,可以使新的检测结果受益。我们提出了一个更强大的基线,它通过联合微调的特征提取器和框预测器更好地适应新数据


Strong baseline. We establish our strong baseline from the following observations. Initially, the detection performance for novel classes decreases as more network components are fine-tuned with novel shots. However, we notice a significant gap in the key RPN and RoI statistics between the data-abundant base training stage and the novel fine-tuning stage. As shown in Figure 4, the number proposals from positive anchors in novel fine-tuning is only 1/4 of its base training counterpart and the number of foreground proposals decreases consequently. We observe, especially at the beginning of fine-tuning, the positive anchors for novel objects receive comparatively low scores from RPN. Due to the low objectness scores, less positive anchors can pass non-max suppression (NMS) and become proposals that provide actual learning opportunities in RoI head for novel objects. Our insight is to rescue the low objectness positive anchors that are suppressed. Besides, re-balancing the foreground proposals fraction is also critical to prevent the diffusive yet easy backgrounds from dominating the gradient descent for novel instances in fine-tuning.


**强基线。**我们从以下观察中建立了强大的基线。最初,随着越来越多的网络组件使用新镜头进行微调,新类别的检测性能会降低。然而,我们注意到在数据丰富的基础训练阶段和新的微调阶段之间的关键 RPN 和 RoI 统计数据存在显着差距。如图 4 所示,在新的微调中,来自正锚点的提议数量仅为其基础训练对应项的 1/4,因此前景提议的数量减少了。我们观察到,特别是在微调开始时,新物体的正锚从 RPN 获得的分数相对较低。由于对象性分数低,较少的正锚可以通过非最大抑制(NMS)并成为在 RoI 头部为新对象提供实际学习机会的提议。我们的见解是拯救被抑制的低客观性正锚。此外,重新平衡前景建议部分对于防止在微调中的新实例中分散但容易的背景主导梯度下降也很关键。


00dc06ecf7ae42c38bcba4f03d4e3e37.png


Figure 4. Key detection statistics. Left shows the average number of positive anchors per image in RPN in base training and novel fine-tuning stage. Right shows the average number of foreground proposals per image during fine-tuning. In the left, orange line shows the original TFA setting, which use the same specs as base training. In the right, the blue line shows double the number of anchors kept after NMS in RPN, the gray line shows reducing RoI head batch size by half.


图 4. 关键检测统计。左图显示了 RPN 在基础训练和新的微调阶段每个图像的平均正锚数。右图显示微调期间每张图像的平均前景建议数。在左侧,橙色线显示了原始 TFA 设置,它使用与基础训练相同的规范。在右侧,蓝线表示在 RPN 中 NMS 后保留的锚点数量增加了一倍,灰线表示将 RoI head batch size 减少了一半。


We use unfrozen RPN and ROI with two modifications, (1) double the maximum number of proposals kept after NMS, this brings more foreground proposals for novel instances, and (2) halving the number of sampled proposals in RoI head used for loss computation, as in fine-tuning stage the discarded half contains only backgrounds (standard RoI batch size is 512, and the number of foreground proposals are far less than half of it). As shown in Table 1, our strong baseline boosts the baseline TFA by non-trivial margins. Moreover, the tunable RoI feature extractor opens up room for realizing our proposed contrastive object proposal encoding.


我们使用未冻结的 RPN 和 ROI,并进行了两次修改,(1)将 NMS 之后保留的最大提案数量增加一倍,这为新实例带来了更多的前景提案,(2)将用于损失计算的 RoI 头中的采样提案数量减半,与微调阶段一样,丢弃的一半仅包含背景(标准 RoI 批量大小为 512,前景提议的数量远少于其一半)。如表 1 所示,我们强大的基线将基线 TFA 提高了非平凡的幅度。此外,可调 RoI 特征提取器为实现我们提出的对比对象建议编码开辟了空间。


f1e527322e914282a3dc3d8ce6b9413b.png


Table 1. Novel detection performance of our strong baseline on PASCAL VOC Novel Split 1.


表 1. 我们在 PASCAL VOC Novel Split 1 上的强基线的新检测性能。


3.2. Contrastive object proposal encoding


In two-stage detection frameworks, RPN takes backbone feature maps as inputs and generates region proposals, RoI head then classifies each region proposal and regresses a bounding box if it is predicted to contain an object. In Faster R-CNN pipeline, RoI head feature extractor first pools the region proposals to fixed size and then encodes them as vector embeddings x∈RDR known as the RoI features. Typically DR = 1024 in Faster R-CNN w/ FPN. General detectors fail to establish robust feature representations for region proposals from limited shots, resulting in mislabeling localized objects and low average precision. The idea is to learn more discriminative object proposal embeddings, but according to our experiments, the category-level positivemargin classifier [20, 25] does not work in this data-hungry setting. In order to learn more robust object feature representations from fewer shots, we propose to apply batch contrastive learning [34] to explicitly model instance-level intra-class similarity and inter-class distinction [29, 26] of object proposal embeddings.


在两阶段检测框架中,RPN 将主干特征图作为输入并生成区域提议,然后 RoI 头对每个区域提议进行分类,如果预测包含对象,则回归边界框。 在 Faster R-CNN 管道中,RoI 头部特征提取器首先将区域建议池化为固定大小,然后将它们编码为向量嵌入x∈RDR ,称为 RoI 特征。在 Faster R-CNN w/FPN 中,通常 DR = 1024。一般检测器无法从有限的镜头中为区域提议建立鲁棒的特征表示,导致错误标记局部对象和低平均精度。这个想法是学习更多有区别的对象建议嵌入,但根据我们的实验,类别级正边距分类器 [20, 25] 在这种数据密集型设置中不起作用。为了从更少的镜头中学习更稳健的对象特征表示,我们建议应用批量对比学习 [34] 来显式地建模对象提议嵌入的实例级类内相似性和类间区别 [29, 26]。


To incorporate contrastive representation learning into the Faster R-CNN framework, we introduce a contrastive branch to the primary RoI head, parallel to the classification and regression branches. The RoI feature vector x contains post-ReLU [53] activations thus is truncated at zero, so the similarity between two proposals embeddings can not be measured directly. Therefore, the contrastive branch applies a 1-layer multi-layer-perceptron (MLP) head with negligible cost to encode the RoI feature to contrastive feature z ∈RDC, by default DC = 128. Subsequently, we measure similarity scores between object proposal representations on the MLP-head encoded RoI features and optimize a contrastive objective to maximize the agreement between object proposals from the same category and promote the distinctiveness of proposals from different categories. The proposed contrastive loss for object detection is described in the next section.


为了将对比表示学习纳入 Faster R-CNN 框架,我们在主要 RoI 头中引入了一个对比分支,与分类和回归分支平行。 RoI 特征向量 x 包含 post-ReLU [53] 激活,因此被截断为零,因此无法直接测量两个提案嵌入之间的相似性。因此,对比分支应用一个成本可忽略不计的 1 层多层感知器 (MLP) 头将 RoI 特征编码为对比特征 z ∈RDC,默认情况下 DC = 128。随后,我们在 MLP-head 编码的 RoI 特征上测量对象建议表示之间的相似性分数,并优化对比目标以最大化来自同一类别的对象建议之间的一致性,并提高来自不同类别的建议的独特性。 下一节将描述所提出的用于对象检测的对比损失。


We adopt a cosine similarity based bounding box classifier, where the logit to predict i-th instance as j-th class is computed by the scaled cosine similarity between the RoI feature xi and the class weight wj in the hypersphere,


f95a0d16696a4eec929a0c94aeca3166.png


α is a scaling factor to enlarge the gradient. We empirically fix α= 20 in our experiments. The proposed contrastive branch guides the RoI head to learn contrastive-aware object proposal embeddings which ease the discrimination between different categories. In the cosine projected hypersphere, our contrastive object proposal embeddings form tighter clusters with enlarged distances between different clusters, therefore increasing the generalizability of the detection model in the few-shot setting.


我们采用基于余弦相似度的边界框分类器,其中将第 i 个实例预测为第 j 类的 logit 是通过 RoI 特征xi 和类权重 wj 在超球体中,α 是放大梯度的比例因子。我们在实验中根据经验固定 α=20。提出的对比分支引导 RoI 头部学习对比感知对象建议嵌入,从而简化不同类别之间的区分。在余弦投影超球面中,我们的对比对象提议嵌入形成更紧密的集群,不同集群之间的距离扩大,因此增加了检测模型在少样本设置中的泛化性。


3.3. Contrastive Proposal Encoding (CPE) Loss


Inspired by supervised contrastive objectives in classification [34] and identification [29], our CPE loss is defined as follows with considerations tailored for detection. Concretely, for a mini-batch of N RoI box featuresddb31f80222944d291f20ed7b7026b88.png


, where zi is contrastive head encoded RoI feature for i-th region proposal, ui denotes its Intersectionover-Union (IOU) score with matched ground truth bounding box, and yi denotes the label of the ground truth,


受分类 [34] 和识别 [29] 中的监督对比目标的启发,我们的CPE损失定义如下,并考虑了为检测量身定制的考虑。具体来说,对于一个mini-batch的N个RoI box的特征


ba25437e83ad4dd9a7a2295afd0cbc73.png


,其中 zi 是第 i 个区域提议的对比头部编码 RoI 特征,ui 表示其与匹配的 ground truth 边界框的 Intersectionover-Union (IOU) 得分,yi 表示 ground truth 的标签,


5e60c3734e014062bbcf7d753c8794ac.png


In the above formula,f6bafcca7df44f8991d49893d1366af7.pngdenotes normalized features hencefce9b1b6ede04e7b98a74b2b4f6a3ed8.pngmeasures the cosine similarity between the i-th and j-th proposal in the projected hypersphere. The optimization of the above loss function increases the instancelevel similarity between object proposals with the same label and spaces proposals with different labels apart in the projection space. As a result, instances from each category will form a tighter cluster, and the margins around the periphery of the clusters are enlarged. The effectiveness of our CP E loss has been confirmed by t-SNE visualization, as shown in Figure 5 (a) and (b).


在上面的公式中,e9aea2a205694fcdb541af51fb1b9e44.png表示归一化特征,因此d5c99755713c47fabe672b1cbe9aba2e.png测量投影超球面中第 i 个和第 j 个提议之间的余弦相似度。上述损失函数的优化增加了在投影空间中具有相同标签的对象提案和具有不同标签的空间提案之间的实例级相似性。结果,来自每个类别的实例将形成一个更紧密的集群,并且集群外围的边缘被扩大。我们的CPE 损失的有效性已通过 t-SNE 可视化得到证实,如图 5 (a) 和 (b) 所示。


Proposal consistency control. Unlike image classification where semantic information comes from the entire image, classification signals in detection come from region proposals. We propose to use an IoU threshold to assure the consistency of proposals that are used to be contrasted, with the consideration that low IoU proposals deviate too much from the center of regressed objects, therefore might contain irrelevant semantics. In the formula above, f(ui) controls the consistency of proposals, defined with proposal consistency threshold Φ, and a re-weighting function g(·),


**提案一致性控制。**与语义信息来自整个图像的图像分类不同,检测中的分类信号来自区域提案。我们建议使用 IoU 阈值来确保用于对比的提案的一致性,考虑到低 IoU 提案偏离回归对象的中心太多,因此可能包含不相关的语义。 在上面的公式中,f(ui) 控制提案的一致性,由提案一致性阈值 Φ 和重新加权函数 g(·) 定义,


011dbb1e9eeb498f8253acf7c7b77769.png


g(·) assigns different weight coefficients for object proposals with different level of IoU scores. We find Φ=0.7 is a good cut-off such that the contrastive head is trained with most centered object proposals. Ablations regarding Φ and g are shown in Sec. 4.3.


g(·) 为具有不同 IoU 分数级别的对象建议分配不同的权重系数。我们发现 Φ=0.7 是一个很好的截断值,这样对比头部就可以使用大多数居中的对象建议进行训练。关于 Φ 和 g 的消融显示在 Sec 4.3.


Training objectives. In the first stage, the base detector is trained with a standard Faster R-CNN loss [4], a binary cross-entropy loss Lrpn to make foreground proposals from anchors, a cross-entropy loss Lcls for bounding box classifier, and a smoothed-L1 loss Lreg for box regression deltas.When transfer to novel data in the fine-tuning stage, we find the contrastive loss can be added to the primary Faster RCNN loss in a multi-task fashion without destabilizing the training,


a54ed010070f49df8660585e1a9af082.png


λ is set to 0.5 to balance the scale of the losses.


**训练目标。**在第一阶段,基础检测器使用标准 Faster R-CNN 损失 [4]、二元交叉熵损失 Lrpn 以从锚点生成前景建议、交叉熵损失 Lcls 用于边界框分类器和平滑-L1 loss Lreg for box regression deltas。当在微调阶段转移到新数据时,我们发现对比损失可以以多任务方式添加到主要的 Faster RCNN 损失中,而不会破坏训练的稳定性,λ设置为 0.5 以平衡损失的规模。


4. Experiments


Extensive experiments are performed in both PASCAL VOC [22, 23] and COCO [55] benchmarks. Our FSCE forms an upper envelope for all fine-tuning based methods and memory-inefficient meta-learns with large margins in any shots in all data splits. We strictly follow the consistent few-shot detection data construction and evaluation protocol [43, 16, 17, 15] to ensure fair and direct comparison. In this section, we first describe the few-shot detection settings, then provide complete comparisons of contemporary few-shot detection works on PASCAL VOC and COCO benchmarks, and provide ablation studies.


在 PASCAL VOC [22, 23] 和 COCO [55] 基准中进行了广泛的实验。我们的 FSCE 为所有基于微调的方法和内存效率低的元学习形成了一个上限,在所有数据拆分的任何镜头中都有很大的余量。我们严格遵循一致的少样本检测数据构建和评估协议[43、16、17、15],以确保公平和直接的比较。在本节中,我们首先描述few-shot 检测设置,然后在PASCAL VOC 和COCO 基准上提供当代few-shot 检测工作的完整比较,并提供消融研究。


Implementation Details. For the detection model, we use Faster-RCNN [4] with Resnet-101 [1] and Feature Pyramid Network [52]. All experiments are run on 8 GPUs with standard batch-size 16. The solver is standard SGD with momentum 0.9 and weight decay 1e-4. Naturally, we scale the training steps when training number of shots. Every detail will be open-sourced in a self-contained codebase to facilitate future research.


实现细节。 对于检测模型,我们使用 Faster-RCNN [4] 与 Resnet-101 [1] 和特征金字塔网络 [52]。所有实验都在 8 个 GPU 上运行,标准批量大小为 16。求解器是标准 SGD,动量为 0.9,权重衰减为 1e-4。当然,我们在训练镜头数量时会缩放训练步骤。每个细节都将在一个独立的代码库中开源,以促进未来的研究。


4.1. Few-shot detection benchmarks


PASCAL VOC. The overall 20 categories in PASCAL VOC are divided into 15 base categories and 5 novel categories. All base category data from PASCAL VOC 07+12 trainval sets are considered available, and K-shot of novel instances are randomly sampled from previously unseen novel classes for K = 1; 2; 3; 5 and 10. Following existing works [16, 43, 15], we consider the same three random partitions of base and novel categories and samplings introduced in [43], referred as Novel Split 1, 2, and 3. And we report AP50 for novel predictions (nAP50) on PASCAL VOC 2007 test set. Note, this is different from the N-Way Kshot settings commonly used in meta-learning based methods [44]. The huge variance between different random runs make the N-Way K-shot evaluation protocol unsuitable for few-shot object detection. For methods that provide results over 10 random seeds, we provide the corresponding results to compare with.


PASCAL VOC. PASCAL VOC 总共 20 个类别,分为 15 个基础类别和 5 个新类别。 来自 PASCAL VOC 07+12 trainval 集的所有基本类别数据都被认为是可用的,并且新实例的 K-shot 是从 K = 1 ; 2; 3; 5 和 1 的先前未见过的新类别中随机抽样的。继现有工作 [16,43,15] 之后,我们考虑在 [43] 中引入的相同的三个基本和新类别和抽样的随机分区,称为 Novel Split 1、2 和 3。我们报告 AP50用于 PASCAL VOC 2007 测试集的新颖预测 (nAP50)。请注意,这与基于元学习的方法 [44] 中常用的 N-Way Kshot 设置不同。不同随机运行之间的巨大差异使得 N-Way K-shot 评估协议不适合小样本目标检测。对于提供超过 10 个随机种子的结果的方法,我们提供相应的结果进行比较。


MS COCO. Similarly, for the 80 categories in COCO, 20 categories in common with PASCAL VOC are reserved as novel classes, the rest 60 categories are used as base classes.The K = 10 and 30 shots detection performance are evaluated on 5K images from COCO 2014 val dataset, COCOstyle AP and AP75 for novel categories are reported by convention.


MS COCO. 同样,对于COCO中的80个类别,保留20个与PASCAL VOC相同的类别作为新类别,其余60个类别作为基类。评估K = 10和30个镜头的检测性能在来自 COCO 2014 val 数据集的 5K 图像上,按照惯例报告了新类别的 COCOstyle AP 和 AP75。


4.2. Few-shot detection results


0f8c7cdd7e214a048e46c8d636c7e67b.png


Table 2. Performance evaluation (nAP 50) of existing few-shot detection methods on three PASCAL VOC Novel Split sets. + marks metalearning based methods. ★ represents average over 10 random seeds. + marks methods use N-way K-Shot meta-testing, which is a different evaluation protocol, see in Sec. 4.1.


表 2. 现有小样本检测方法在三个 PASCAL VOC Novel Split 集上的性能评估 (nAP 50)。 + 标记基于元学习的方法。 ★代表平均超过10个随机种子。 + 标记方法使用 N-way K-Shot 元测试,这是一种不同的评估协议,请参阅第 2 节。 4.1。


PASCAL VOC Results. Results for all three random novel splits from PASCAL VOC are shown in Table 2. Our FSCE outperforms all existing works in any shot and all splits.The effectiveness of our method is fully demonstrated. We are the first to achieve >50 nAP50 on split 2 and split 3, with up to +8.8 nAP50 above current SOTA on split 3. At the same time, our contrastive proposal encodings powered FSCE persists the less base forgetting property as in TFA.

Demonstrated below in Table 4.


PASCAL VOC 结果。 PASCAL VOC 的所有三个随机新拆分的结果如表 2 所示。我们的 FSCE 在任何镜头和所有拆分中都优于所有现有工作。我们方法的有效性得到充分证明。我们是第一个在拆分 2 和拆分 3 上实现 >50 nAP50 的人,在拆分 3 上比当前 SOTA 高达 +8.8 nAP50。同时,我们支持 FSCE 的对比提议编码保持了与 TFA 中较少的基本遗忘属性。

如下表 4 所示。


2dfeefef7bb94dce8a55e3333bd135ea.png


Table 4. Base forgetting comparisons on PASCAL VOC Split 1.Before fine-tuning, the base AP50 in base training is 80.8.


表 4. PASCAL VOC Split 1 的基础遗忘比较 1. 在微调之前,基础训练中的基础 AP50 为 80.8。


COCO Results. Few-shot detection results for COCO are shown in Table 3. Our FSCE set new state-of-the-art for all shots, under the same testing protocol and same metrics. Our proposed methods gain +1.7 nAP and +2.7 nAP75 above current SOTA, which is more significant than the gaps between any previous advancements.


COCO 结果。 COCO 的少量镜头检测结果如表 3 所示。我们的 FSCE 在相同的测试协议和相同的指标下为所有镜头设置了新的最新技术。我们提出的方法比当前的 SOTA 获得了 +1.7 nAP 和 +2.7 nAP75,这比以前任何进步之间的差距更重要。


f38f2d8b0b6740b79bcbd42923d7e618.png


Table 3. Few-shot detection evaluation results on COCO. ★ represents average over 10 random seeds. + marks meta-learning based methods.


表 3 COCO 上的few-shot 检测评估结果。 ★表示平均超过 10 个随机种子。+ 标记基于元学习的方法。


4.3. Ablation


Components of our proposed FSCE. First, with our mod-ified training specification for fine-tune stage, the classagnostic RPN and RoI head can be directly transferred to novel data and incur huge performance gain, this is because we utilize more low-quality RPN proposals that would normally be suppressed by NMS and provide more foregrounds to learn given the limited optimization opportunity in fewshot setting. And the jointly fine-tuned FPN top-down convolution and RoI feature extractor opens up room for better representation learning. Second, our CP E loss guides the RoI feature extractor to establish contrastive-aware objects embeddings, intra-class compactness and inter-class variance ease the classification task and rescue misclassifications. The whole system benefits from the proposal consistency control by employing only high-IoU region proposals that are less deviated from objects center to contrast. All ablation studies are done with PASCAL VOC Novel Split 1 unless otherwise specified.


我们提出的 FSCE 的组成部分。 首先,通过我们修改后的微调阶段训练规范,与类别无关的 RPN 和 RoI 头可以直接转移到新数据并产生巨大的性能提升,这是因为我们利用了更多考虑到few shot设置中有限的优化机会,通常会被NMS抑制并提供更多学习前景的低质量RPN提议。而联合微调的 FPN 自上而下卷积和 RoI 特征提取器为更好的表示学习开辟了空间。其次,我们的 CPE 损失引导 RoI 特征提取器建立对比感知对象嵌入,类内紧凑性和类间方差简化了分类任务并挽救了错误分类。整个系统通过仅使用从对象中心到对比度的偏差较小的高 IoU 区域建议受益于建议一致性控制。除非另有说明,否则所有消融研究均使用 PASCAL VOC Novel Split 1 完成。


Ablation for contrastive branch hyper-parameters. Pri-mary RoI feature vector contains post-ReLU activations truncated at zero, we therefore encode the RoI feature with a contrastive head to z ∈ RDC such that similarity can be meaningfully measured. Based on our ablations, the few-shot detection performance is not sensitive to the contrastive head dimension. And among the commonly used temperature used in contrastive objectives [34, 32, 50], a medium temperature τ= 0.2 works better than relatively small value 0.07 and large value 0.5.


对比分支超参数的消融。 主 RoI 特征向量包含在零处截断的后 ReLU 激活,因此我们将具有对比头的 RoI 特征编码到z ∈ RDC,以便可以有意义地测量相似性。根据我们的消融,few-shot 检测性能对对比头部尺寸不敏感。在对比物镜中使用的常用温度 [34, 32, 50] 中,τ= 0:2 比相对较小的值 0.07 和较大的值 0.5 效果更好。


Ablation for Proposal Consistency Control. In equa-tion (3) and (4), we propose a compound proposal consis-tency control mechanism, comprised of an indicator function with an IoU cut-off threshold ∮ , and a function g(·) for re-weighting proposals with different level of IoU. Turns out a re-weighting is not necessary and a simple high-IoU cut-off works the best for 5 and 10 shots, but when number of shots is low, simply filtering out proposals with IoU less than ∮ becomes less favorable as the data sparsity is too severe. In low-shot cases, keeping all proposals but down-weight low-IoU ones make more sense, and empirically, exponential decay (easy mining) does worse than a simple linear weighting.


Ablation for Proposal Consistency Control. 在等式 (3) 和 (4) 中,我们提出了一种复合提案一致性控制机制,由一个具有 IoU 截止阈值 ∮ 的指标函数和一个函数组成g(·) 用于重新加权具有不同 IoU 级别的提案。事实证明,不需要重新加权,简单的高 IoU 截止对于 5 和 10 个镜头效果最好,但是当镜头数量较少时,简单地过滤掉 IoU 小于 ∮ 的提案变得不太有利,因为数据稀疏性太严重了。在低样本情况下,保留所有提案,但降低低 IoU 的权重更有意义,并且根据经验,指数衰减(易于挖掘)比简单的线性加权更差。


Visual inspections and analysis. Figure 5 shows visual in-spections of our proposed FSCE. We find in data-abundant general detection, the saturated performance of fc classifier and cosine classifier are essentially equal. fc layer can learn sophisticated decision boundary from enough data.Existing literature and we all confirm that cosine box classifier excels in few shot object detection, this can be attributed to the explicitly modeled similarity helps form tighter instances clusters on the projected unit hypersphere.The intuition to spacing different categories is trivial, but per our experiments well-established margin-based classifiers [20, 21] does not work in this data-hunger setting (2 nAP compared to FSCE in 10 shots and worse in lower shots). Instead of adding a margin to classifier, FSCE mod-els the instance-level intra-class similarity and inter-class via CP E loss and guide RoI head to learn contrastiveaware object proposal representations. t-SNE [56] visualization of objects proposal embeddings affirms the effectiveness of our CP E loss in reducing intra-class variance and form more defined decision boundaries, this aligns well with our proposition. Figure 5 © shows example bad cases from TFA that are rescued by our FSCE including, missing detection for novel instances, low confidence scores for novel instances, and the pervasive misclassifications.


**目视检查和分析。**图 5 显示了我们提出的 FSCE 的视觉检查。我们发现在数据丰富的通用检测中,fc分类器和余弦分类器的饱和性能基本相等。 fc 层可以从足够的数据中学习复杂的决策边界。现有文献和我们都证实余弦框分类器在少数镜头目标检测中表现出色,这可以归因于显式建模的相似性有助于在投影的单位超球面上形成更紧密的实例集群。直觉间隔不同的类别是微不足道的,但根据我们的实验,成熟的基于边距的分类器 [20, 21] 在这种数据饥饿设置中不起作用(2 nAP 与 FSCE 在 10 个镜头中相比,在较低镜头中更差)。 FSCE 不是为分类器添加边距,而是通过 CPE 损失对实例级的类内相似性和类间进行建模,并引导 RoI 头部学习对比感知对象建议表示。 t-SNE [56] 对象提议嵌入的可视化证实了我们的 CPE 损失在减少类内方差和形成更明确的决策边界方面的有效性,这与我们的提议非常吻合。图 5 © 显示了由我们的 FSCE 挽救的 TFA 中的不良案例示例,包括新实例的缺失检测、新实例的低置信度分数以及普遍的错误分类。


4756cd0ee114457db7895811d0455be2.png


Figure 5. Conceptually and t-SNE visualization of the object proposal embeddings learned with and without our CP E loss, our CP E loss explicitly model the within-class similarity and cross-class distance. t-SNE here shows the proposal encodings from randomly selected 200 PASCAL VOC images. Right panel shows bad cases rescued by our contrastive-aware representations.


图 5. 使用和不使用我们的 CPE 损失学习的对象建议嵌入的概念和 t-SNE 可视化,我们的 CPE 损失显式地模拟了类内相似性和跨类距离。此处的 t-SNE 显示了来自随机选择的 200个PASCAL VOC 图像的建议编码。右侧面板显示了通过我们的对比感知表示拯救的不良案例。


5. Conclusion


In this work, we propose a new perspective of solving FSOD via contrastive proposals encoding. Effectively saving accurately localized objects from being misclassified, our method achieves state-of-the-art results in any shot and both benchmarks, with up to +8:8% on PASCAL VOC and +2:7% on COCO. Our proposed contrastive proposal encoding head has a negligible cost and is generally applicable. It can be chipped into any two-stage detectors without interfering with the training pipeline. Also, we provide a strong baseline comparable to contemporary SOTA to facilitate future research in FSOD. For a broader impact, FSOD is of great worth considering the vast amount of objects in the real world. Our work proves the plausibility of incorporating contrastive learning into object detection frameworks. We hope our work can inspire more researches in contrastive visual embedding and few-shot object detection.


在这项工作中,我们提出了通过对比提议编码解决 FSOD 的新视角。我们的方法有效地避免了准确定位的对象被错误分类,在任何镜头和两个基准测试中都取得了最先进的结果,PASCAL VOC 高达 +8:8%,COCO 高达 +2:7%。我们提出的对比提议编码头的成本可以忽略不计,并且普遍适用。它可以嵌入到任何两级检测器中,而不会干扰训练管道。 此外,我们提供了与当代 SOTA 相当的强大基线,以促进 FSOD 的未来研究。对于更广泛的影响,FSOD 非常值得考虑现实世界中的大量对象。我们的工作证明了将对比学习纳入目标检测框架的合理性。 我们希望我们的工作能够激发更多关于对比视觉嵌入和小样本目标检测的研究。


Acknowledgement. This work was supported by grants from the National Key R & D Program of China. Grant number: 52019YFB1600500.


该工作得到了国家重点研发计划的资助。授权号:52019YFB1600500。


Supplementary materials


1. Average results over random runs


Few-shot object detection performance is inherently unstable and heavily depends on the randomly sampled training shots. Hence, [1] suggests evaluating few-shot detection performance over a series of random runs to obtain statistically reliable comparisons. In this supplementary material, we provide full benchmark results over n random runs for PASCAL VOC and COCO. We report averaged AP, AP50 and AP75 for novel classes (nAP, nAP50, nAP75) from all three splits from PASCAL VOC. For COCO, we also report the averaged AP for small (nAPs), medium (nAPm) and large (nAP l) novel objects. Following the practices in TFA [1], we calculate and report the 95% confidence interval (CI) for each metric we reported. The 95% CI is given by


少镜头目标检测性能本质上是不稳定的,并且在很大程度上取决于随机采样的训练镜头。因此,[1] 建议在一系列随机运行中评估小样本检测性能,以获得统计上可靠的比较。在此补充材料中,我们提供了针对 PASCAL VOC 和 COCO 的 n 次随机运行的完整基准测试结果。我们报告了来自 PASCAL VOC 的所有三个拆分的新类别(nAP、nAP50、nAP75)的平均 AP、AP50 和 AP75。对于 COCO,我们还报告了小型 (nAPs)、中型 (nAPm) 和大型 (nAP l) 新物体的平均 AP。按照 TFA [1] 中的做法,我们计算并报告我们报告的每个指标的 95% 置信区间 (CI)。 95% CI 由下式给出


874649d9adbf472f85d0b550ff745f38.png


where Z0.95 = 1.96 is the Z-score for 95% CI, σ is the standard deviation, and n is the number of random runs. In our experiments, we perform n=10 random runs for both PASCAL VOC and COCO datasets.


其中 Z0.95 = 1.96 是 95% CI 的 Z-score,σ 是标准差,n 是随机运行次数。在我们的实验中,我们对 PASCAL VOC 和 COCO 数据集执行 n=10 随机运行。


2. Results for PASCAL VOC and COCO


We present the complete few-shot object detection benchmark results of our proposed FSCE over random runs. The main baseline we are comparing with is the baseline twostage fine-tuning approach (TFA [1]). As shown in Table 1 for PASCAL VOC and Table 2 for COCO, FSCE significantly outperforms baseline TFA and other methods in almost all shots from all data splits. With up to +9% nAP50 on PASCAL VOC, and +3.2% nAP on COCO. Results averaged over repeated runs with randomly selected training shots, which are statistically stable and reliable, demonstrate the state-of-the-art few-shot object detection performance of our proposed method.


我们展示了我们提出的 FSCE 在随机运行中的完整的少样本目标检测基准结果。我们比较的主要基线是基线两阶段微调方法(TFA [1])。如 PASCAL VOC 的表 1 和 COCO 的表 2 所示,FSCE 在几乎所有数据拆分的所有镜头中都显着优于基线 TFA 和其他方法。在 PASCAL VOC 上具有高达 +9% 的 nAP50,在 COCO 上具有 +3.2% 的 nAP。使用随机选择的训练镜头重复运行的平均结果在统计上稳定可靠,证明了我们提出的方法的最先进的少镜头目标检测性能。


c32cf48a578c4432acd814f502f2e6e5.png


Table 1. The averaged few-shot object detection performance on PASCAL VOC. For each metric, we report the mean and 95% confidence interval over 10 random runs.


表 1. PASCAL VOC 上的平均少镜头目标检测性能。对于每个指标,我们报告了 10 次随机运行的平均值和 95% 置信区间。


524d4be055d341a2881484f7852783c7.png


Table 2. The averaged few-shot object detection performance on COCO. For each metric, we report the mean and 95% confidence interval over 10 random runs. The “n/a” here indicates the confidence intervals are not reported by the authors.


表 2. COCO 上的平均少样本目标检测性能。对于每个指标,我们报告了 10 次随机运行的平均值和 95% 置信区间。这里的“n/a”表示作者没有报告置信区间。

目录
相关文章
|
1月前
|
存储 NoSQL Redis
Redis第四弹,Redis实现list时候做出的优化ziplist(压缩链表,元素少的情况),可更好的节省空间list——(内部编码:quicklist)Object encoding
Redis第四弹,Redis实现list时候做出的优化ziplist(压缩链表,元素少的情况),可更好的节省空间list——(内部编码:quicklist)Object encoding
|
机器学习/深度学习 人工智能 并行计算
【YOLOv5】LabVIEW+YOLOv5快速实现实时物体识别(Object Detection)含源码
在LabVIEW中调用YOLOv5快速实现实时物体识别,感受丝滑般物体识别
361 0
|
11月前
|
XML TensorFlow API
TensorFlow Object Detection API 超详细教程和踩坑过程
TensorFlow Object Detection API 超详细教程和踩坑过程
170 1
|
12月前
|
数据可视化 数据挖掘 测试技术
【计算机视觉】Open-Vocabulary Object Detection 论文工作总结
Open-Vocabulary Object Detection (OVD)可以翻译为**“面向开放词汇下的目标检测”,**该任务和 zero-shot object detection 非常类似,核心思想都是在可见类(base class)的数据上进行训练,然后完成对不可见类(unseen/ target)数据的识别和检测,除了核心思想类似外,很多论文其实对二者也没有进行很好的区分。
|
12月前
|
机器学习/深度学习 计算机视觉
【计算机视觉 | 目标检测】Open-Vocabulary Object Detection Using Captions
出发点是制定一种更加通用的目标检测问题,目的是借助于大量的image-caption数据来覆盖更多的object concept,使得object detection不再受限于带标注数据的少数类别,从而实现更加泛化的object detection,识别出更多novel的物体类别。
|
12月前
|
机器学习/深度学习 算法 计算机视觉
【计算机视觉 | 目标检测】Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
在这项工作中,我们考虑借用预训练的开放词汇分类模型中的知识来实现open vocabulary检测。
|
12月前
|
算法 计算机视觉
|
机器学习/深度学习 并行计算 算法
手把手教你使用LabVIEW OpenCV dnn实现物体识别(Object Detection)含源码
今天和大家一起分享如何使用LabVIEW调用pb模型实现物体识别
107 0
|
机器学习/深度学习 人工智能 数据可视化
【文章阅读】Frustratingly Simple Few-Shot Object Detection
从几个例子中检测稀有物体是一个新出现的问题。先前的研究表明,元学习是一种很有前途的方法。但是,微调技术几乎没有引起人们的注意。我们发现,仅对稀有类现有检测器的最后一层进行微调对于少镜头目标检测任务是至关重要的。在当前的基准测试中,这种简单的方法比元学习方法高出大约2~20个百分点,有时甚至会使以前的方法的准确率翻一番。
179 0
|
机器学习/深度学习 XML 人工智能
基于Tensorflow2.x Object Detection API构建自定义物体检测器
基于Tensorflow2.x Object Detection API构建自定义物体检测器的保姆级教程,详细地描述了代码框架结构、数据集的标准方法,标注文件的数据处理、模型流水线的配置、模型的训练、评估、推理全流程。
314 1