1、简介
Two Stage的目标检测算法的网络结构分成2部分,一部分是Region Proposal的生成过程,另一部分是基于ROI预测框的过程(head部分)。对于像Faster RCNN、R-FCN这样的Two Stage的目标检测算法,第2部分有一些操作耗时且存储量较大,因此称为Heavy head。而本文的Light head RCNN则是通过对第2部分的修改减少了许多复杂操作,所以称之为Light head。
2、结构分析
2.1、Faster R-CNN结构分析
Faster RCNN中,先用分类网络得到2048维的Feature map;然后是一个ROI Pooling层,该层的输入包括前面提取到的2048维的Feature map,还包括RPN网络输出的ROI,根据这两个输入得到ROI尺寸统一的Feature map输出;然后经过两个全连接层,最后有两个分支分别做目标分类和边界框回归。
Faster RCNN因为要将每个ROI都单独传递给后面的RCNN Subnet做计算,因此存在较多重复计算,所以比较耗时;另外全连接层的存储空间占用也比较大。
2.2、R-FCN结构分析
R-FCN主要解决了Faster RCNN中ROI Pooling层后重复计算的问题。在R-FCN中,先用Backbone网络提取特征2048维的feature map。然后用channel数为的卷积来生成position-sensitive的score map,也就是B中彩色的那些feature map,这是用于目标分类的分支网络。
然后经过一个PSROI Pooling层生成C+1维,大小为的feature map,该层的输入包含前面一层生成的score map,还包括RPN网络生成的ROI。最后经过全剧平均池化得到C+1维的的feature map,也就是一个vote过程,这C+1维就是对应该ROI的类别概率信息。
R-FCN因为要生成额外的一个Channel数量较多的score map作为PSRoI Pooling的输入,所以在存储和时间上依然有不少消耗。
3、Light head RCNN
Light-Head RCNN基本上是在R-FCN基础上做的修改。针对R-FCN中score map的channel数量较大,采用一个large separable convolution生成thinner feature map代替,其实就是将原来的channel数量用来代替,差不多是从3969降低到490,这样就降低了后续Pooling和其他操作的计算开销。
同时,作者使用了2种不同的Backbone,从而验证了Two Stage模型更加灵活,一个以ResNet101为Backbone;另一个是作者设计的一个145M的网络Xception。Xception的网络结构如下:
class Xception(nn.Module): def __init__(self, num_classes=1000): super(Xception, self).__init__() self.num_classes = num_classes self.conv1 = nn.Conv2d(3, 24, kernel_size=3, stride=2, padding=1, bias=False) # 224 x 224 -> 112 x 112 self.bn1 = nn.BatchNorm2d(24) self.relu = nn.ReLU(inplace=True) self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=0, ceil_mode=True) # -> 56 x 56 # Stage 2 self.block1 = _Block(24, 144, 1 + 3, 2, start_with_relu=False, grow_first=True) # -> 28 x 28 # Stage 3 self.block2 = _Block(144, 288, 1 + 7, 2, start_with_relu=True, grow_first=True) # -> 14 x 14 # Stage 4 self.block3 = _Block(288, 576, 1 + 3, 2, start_with_relu=True, grow_first=True) # -> 7 x 7 self.avgpool = nn.AvgPool2d(7) self.fc = nn.Linear(576, num_classes) #------- init weights -------- for m in self.modules(): if isinstance(m, nn.Conv2d): n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels m.weight.data.normal_(0, math.sqrt(2. / n)) elif isinstance(m, nn.BatchNorm2d): m.weight.data.fill_(1) m.bias.data.zero_() #----------------------------- def forward(self, x): x = self.conv1(x) x = self.bn1(x) x = self.relu(x) x = self.maxpool(x) x = self.block1(x) x = self.block2(x) x = self.block3(x) x = self.avgpool(x) x = x.view(x.size(0), -1) x = self.fc(x) return x class xception(_fasterRCNN): def __init__(self, classes, pretrained=False, class_agnostic=False, lighthead=True): self.dout_base_model = 576 # Output channel at Stage4 self.dout_lh_base_model = 576 self.class_agnostic = class_agnostic self.pretrained = pretrained _fasterRCNN.__init__(self, classes, class_agnostic, lighthead, compact_mode=True) def _init_modules(self): xception = Xception() # Check pretrained if self.pretrained == True: print("Loading pretrained weights from %s" % (self.model_path)) if torch.cuda.is_available(): state_dict = torch.load(self.model_path) else: state_dict = torch.load( self.model_path, map_location=lambda storage, loc: storage) xception.load_state_dict({ k: v for k, v in state_dict.items() if k in xception.state_dict() }) # Build xception-like network. self.RCNN_base = nn.Sequential( xception.conv1, xception.bn1, xception.relu, xception.maxpool, # Conv1 xception.block1, xception.block2, xception.block3) self.RCNN_top = nn.Sequential(nn.Linear(490 * 7 * 7, 2048), nn.ReLU(inplace=True)) self.RCNN_cls_score = nn.Linear(2048, self.n_classes) if self.class_agnostic: self.RCNN_bbox_pred = nn.Linear(2048, 4) else: self.RCNN_bbox_pred = nn.Linear(2048, 4 * self.n_classes) # Fix blocks if self.pretrained: for layer in range(len(self.RCNN_base)): for p in self.RCNN_base[layer].parameters(): p.requires_grad = False def _head_to_tail(self, pool5): pool5 = pool5.view(pool5.size(0), -1) fc7 = self.RCNN_top(pool5) # or two large fully-connected layers return fc7
3.1、Large Separable Convolution Layers
从图中可以看出,这种结构是借鉴了Inception中将卷积核大小为的卷积操作用和的两层卷积来代替的思想,计算结果是一致的,但是却可以减少计算量。
class LightHead(torch.nn.Module): def __init__(self, in_, backbone, mode="L"): super(LightHead, self).__init__() self.backbone = backbone if mode == "L": self.out = 256 else: self.out = 64 self.conv1 = torch.nn.Conv2d(in_channels=in_, out_channels=self.out, kernel_size=(15, 1), stride=1, padding=(7, 0), bias=True) self.relu = torch.nn.ReLU(inplace=False) self.conv2 = torch.nn.Conv2d(in_channels=self.out, out_channels=10*7*7, kernel_size=(1, 15), stride=1, padding=(0, 7), bias=True) self.conv3 = torch.nn.Conv2d(in_channels=in_, out_channels=self.out, kernel_size=(15, 1), stride=1, padding=(7, 0), bias=True) self.conv4 = torch.nn.Conv2d(in_channels=self.out, out_channels=10*7*7, kernel_size=(1, 15), stride=1, padding=(0, 7), bias=True) def forward(self, input): x_backbone = self.backbone(input) x = self.conv1(x_backbone) x = self.relu(x) x = self.conv2(x) x_relu_2 = self.relu(x) x = self.conv3(x_backbone) x = self.relu(x) x = self.conv4(x) x_relu_4 = self.relu(x) return x_relu_2 + x_relu_4
4、实验
4.1、各种方法指标对比结果
4.2、各种方法速度精度对比结果
4.3、Light-Head R-CNN L检测结果
4.3、Light-Head R-CNN S检测结果