「CaDDN:Categorical Depth Distribution Network for Monocular 3D Object Detection」是发表在CVPR 2021上利用单目相机实现3D目标检测的一篇工作。
论文及官方repo链接如下:
- 论文:
https://arxiv.org/pdf/2103.01100.pdf
- 官方repo:
https://github.com/TRAILab/CaDDN
目前基于视觉的3D目标检测算法按照输入图像的个数可以分为「单目相机」和「多目相机」(通常为6个视角的环视相机)两类。 从目前论文发表的趋势来看,多目相机(环视相机)的检测算法更加主流,因为其可以将环视相机采集到的图像信息投影到BEV空间下实现自车周围环境的整体感知,但是,单目相机算法对于物体深度信息的预测方式依旧可以被多目相机算法借鉴,所以本文就介绍下这篇基于单目相机的3D目标检测算法——CaDDN。
图一是CaDDN算法的整体流程图,本文也将基于该流程图来梳理CaDDN的算法实现过程。
图一:CaDDN算法的整体流程图
通过CaDDN的算法流程图来看,CaDDN算法模型整体包含四个部分:
- Frustum Feature Network:生成相机视锥特征;
- Frustum to Voxel Transform:相机视锥坐标点向体素坐标点进行转换;
- Voxel Collapse:得到的体素特征去掉Z轴,构建BEV空间特征;
- 3D Object Detector:基于生成的BEV特征实现3D目标检测;
接下来就基于上面的四步依次基于源码和论文进行介绍~
1、Frustum Feature Network
该网络的作用是将输入的单目相机采集到的图片构建出相机视锥特征。通过图一可以比较清晰的看出,Frustum Feature Network主要包括三个子模块:「Image Backbone」、「Image Channel Reduce」 和 「Depth Distribution Network」。
接下来就梳理一下如何利用这三个子模块得到单张图像的图像视锥特征。为了方便后文统一,这里将输入的单目图像张量记作Tensor([bs,3,H,W]),其中bs代表batch size,H, W代表输入单目图像的宽和高。
1.1 Image Backbone
原论文采用ResNet-101作为主干网络提取输入图像的多尺度特征,输出的多尺度特征的尺寸如下
- Tensor([bs, 2048, H / 8, W / 8]) 该降采样8倍的特征图输入到后面的Depth Distribution Network预测深度方向的信息;
- Tensor([bs, 256, H / 4, W / 4]) 该降采样4倍的特征图输入到后面的Image Channel Reduce进行降维得到相应的语义特征;
1.2 Image Channel Reduce
该模块的作用是对降采样4倍的特征图进行降维,得到对应的语义特征,该模块的实现逻辑如下:
输入张量: Tensor([bs, 256, H / 4, W / 4]); 输出张量: Tensor([bs, 64, H / 4, W / 4]); Image Channel Reduce对应模块为: BasicBlock2D( (conv): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) )
1.3 Depth Distribution Network
该模块的作用是对降采样8倍的特征图估计深度信息,该模块整体采用了ASPP的实现逻辑来估计深度信息的,具体实现逻辑如下:
- 利用ASPP模块提取特征,扩大感受野(ASPP模块整体是个并行分支结构);
- 将ASPP模块的各个分支输出结果concat到一起;
- 将concat后的特征降维;
- 对降维后的特征利用3x3卷积层提取特征;
- 对3x3卷积提取的特征估计深度信息;
- 对预测的深度信息特征上采样2倍,使其与Image Channel Reduce模块输出特征的分辨率保持一致;
Step1. ASPP模块各个分支结构及分支输出结果整理如下; 第一条分支:b0 = (bs, 2048, H / 8, W / 8) -> (bs, 256, H / 8, W / 8) Sequential( (0): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() ) 第二条分支:b1 = (bs, 2048, H / 8, W / 8) -> (bs, 256, H / 8, W / 8) ASPPConv( (0): Conv2d(2048, 256, kernel_size=(3, 3), stride=(1, 1), padding=(12, 12), dilation=(12, 12), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() ) 第三条分支:b2 = (bs, 256, H / 8, W / 8) -> (bs, 256, H / 8, W / 8) ASPPConv( (0): Conv2d(2048, 256, kernel_size=(3, 3), stride=(1, 1), padding=(24, 24), dilation=(24, 24), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() ) 第四条分支:b3 = (bs, 256, H / 8, W / 8) -> (bs, 256, H / 8, W / 8) ASPPConv( (0): Conv2d(2048, 256, kernel_size=(3, 3), stride=(1, 1), padding=(36, 36), dilation=(36, 36), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() ) 第五条分支:b4 = (bs, 256, H / 8, W / 8) -> (bs, 256, H / 8, W / 8) ASPPPooling( (0): AdaptiveAvgPool2d(output_size=1) (1): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (3): ReLU() ) Step2. ASPP模块所有输出特征concat到一起; concat(b0, b1, b2, b3, b4) = (bs, 1280, H / 8, W / 8) Stp3. 将concat后的特征做降维映射; (bs, 1280, H / 8, W / 8) -> (bs, 256, H / 8, W / 8) Sequential( (0): Conv2d(1280, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Dropout(p=0.5, inplace=False) ) Step4. Conv3x3 + BN() + ReLU()提取特征; (bs, 256, H / 8, W / 8) -> (bs, 256, H / 8, W / 8) Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Step5. 对提取后的特征估计深度信息; (bs, 256, H / 8, W / 8) -> (bs, 81, H / 8, W / 8) # 81 = depth + 1 Conv2d(256, 81, kernel_size=(1, 1), stride=(1, 1)) Step6. 深度信息特征图升采样2倍与主干网络输出降采样4倍的语义特征图保持一致; (bs, 81, H / 8, W / 8) -> (bs, 81, H / 4, W / 4) F.interpolate(x, size=feat_shape, mode='bilinear', align_corners=False)
利用Image Channel Reduce 和 Depth Distribution Network的输出构建相机视锥特征
CaDDN构建相机视锥特征的过程整体与LSS算法相同,大家可以看下我专栏里对LSS文章的介绍。CaDDN与LSS算法在深度估计上的主要区别是:CaDDN是有显式监督的,但是LSS是隐式监督的。
- 将Depth Distribution Network预测得到的深度信息利用Softmax()函数估计深度方向的概率密度;
- 计算出的深度方向概率与Image Channel Reduce得到的语义特征做外积得到视锥特征;
构建相机视锥特征的整体pipeline如下:
def create_frustum_features(self, image_features, depth_logits): """ Create image depth feature volume by multiplying image features with depth distributions Args: image_features: (N, C, H, W), Image features depth_logits: (N, D+1, H, W), Depth classification logits Returns: frustum_features: (N, C, D, H, W), Image features """ channel_dim = 1 depth_dim = 2 # Resize to match dimensions image_features = image_features.unsqueeze(depth_dim) depth_logits = depth_logits.unsqueeze(channel_dim) # Apply softmax along depth axis and remove last depth category (> Max Range) depth_probs = F.softmax(depth_logits, dim=depth_dim) depth_probs = depth_probs[:, :, :-1] # Multiply to form image depth feature volume frustum_features = depth_probs * image_features return frustum_features
2、Frustum to Voxel Transform
该模块的作用是根据点云感知范围以及体素大小在BEV坐标系下构建3D坐标,然后根据坐标系变换,将BEV空间下的3D坐标点转换到相机视锥坐标系下,进行采样,从而构建出BEV空间特征。
- BEV坐标系下的3D坐标生成
源码中的点云范围:[2, -30.08, -3.0, 46.8, 30.08, 1.0] 源码中的体素大小:[0.16, 0.16, 0.16] def create_meshgrid3d(width, height, depth): xs: Tensor = torch.linspace(0, width - 1, width, device=device, dtype=dtype) ys: Tensor = torch.linspace(0, height - 1, height, device=device, dtype=dtype) zs: Tensor = torch.linspace(0, depth - 1, depth, device=device, dtype=dtype) # generate grid by stacking coordinates base_grid = stack(torch_meshgrid([zs, xs, ys], indexing="ij"), dim=-1) # DxWxHx3 return base_grid.permute(0, 2, 1, 3).unsqueeze(0) # 1xDxHxWx3
- 进行坐标系转换,实现BEV空间坐标向相机视锥坐标转换 在坐标系转换过程中,横纵坐标的转换就是遵循正常的转换关系。但由于BEV空间下存在Z轴,其投影到视锥坐标系下需要调整到估计的离散深度对应的范围内,这里使用的是LID转换,对应的可视化和公式如下:
图二:LID公式(公式来源:https://arxiv.org/pdf/2005.13423.pdf) 图三:LID公式可视化
- 对转换后的相机视锥坐标以及估计出来的相机视锥特征进行采样,得到BEV特征 这里采样用到的函数为F.grid_sample(),对应的pipeline如下:
feature_frustum: Tensor([bs, 64, depth, H / 4, W / 4]) # 相机视锥特征 frustum_grid: Tensor([bs, X, Y, Z]) # 转换后的相机视锥3D坐标 output_feature = F.grid_sample(feature_frustum, frustum_grid) output_feature: Tensor([bs, 64, X, Y, Z]) # BEV空间特征图
3、Voxel Collapse
由于上一步得到的BEV空间特征存在Z轴,需要去掉Z轴方向的信息,论文中采用了Conv2DCollapse模块,具体逻辑如下:
输入张量: Tensor([bs, 64 * Z, X, Y]) 输出张量: Tensor([bs, 64, X, Y]) Conv2DCollapse( (block): BasicBlock2D( (conv): Conv2d(1600, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True)) )
4、3D Object Detector
通过图一的整体结构可以看到,3D Object Detector包括两个子模块
- BEV Backbone:对上一步得到的BEV特征实现进一步的特征融合;
- 3D 检测头:实现最终的3D目标检测;
4.1 BEV Backbone
BEV Backbone部分的逻辑如下
concat(deblocks[0], deblocks[1], deblocks[2]) self.blocks[0]:Tensor([bs, 64, 376, 280]) -> Tensor([bs, 64, 188, 140]) Sequential( (0): ZeroPad2d(padding=(1, 1, 1, 1), value=0.0) (1): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), bias=False) # DownSample. (2): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (3): ReLU() (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (5): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (6): ReLU() (7): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (8): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (9): ReLU() (10): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (11): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (12): ReLU() (13): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (14): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (15): ReLU() (16): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (17): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (18): ReLU() (19): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (20): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (21): ReLU() (22): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (23): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (24): ReLU() (25): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (26): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (27): ReLU() (28): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (29): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (30): ReLU() (31): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (32): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (33): ReLU() ) self.deblocks[0]:Tensor([bs, 64, 188, 140]) -> Tensor([bs, 128, 188, 140]) Sequential( (0): ConvTranspose2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) self.blocks[1]:Tensor([bs, 64, 188, 140]) -> Tensor([bs, 128, 94, 70]) Sequential( (0): ZeroPad2d(padding=(1, 1, 1, 1), value=0.0) (1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), bias=False) (2): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (3): ReLU() (4): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (5): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (6): ReLU() (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (8): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (9): ReLU() (10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (11): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (12): ReLU() (13): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (14): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (15): ReLU() (16): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (17): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (18): ReLU() (19): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (20): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (21): ReLU() (22): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (23): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (24): ReLU() (25): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (26): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (27): ReLU() (28): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (29): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (30): ReLU() (31): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (32): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (33): ReLU() ) self.deblocks[1] Tensor([bs, 128, 94, 70]) -> Tensor([bs, 128, 188, 140]) Sequential( (0): ConvTranspose2d(128, 128, kernel_size=(2, 2), stride=(2, 2), bias=False) (1): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) self.blocks[2] Tensor([bs, 128, 94, 70]) -> Tensor([bs, 256, 47, 35]) Sequential( (0): ZeroPad2d(padding=(1, 1, 1, 1), value=0.0) (1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), bias=False) (2): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (3): ReLU() (4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (5): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (6): ReLU() (7): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (8): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (9): ReLU() (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (11): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (12): ReLU() (13): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (14): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (15): ReLU() (16): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (17): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (18): ReLU() (19): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (20): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (21): ReLU() (22): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (23): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (24): ReLU() (25): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (26): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (27): ReLU() (28): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (29): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (30): ReLU() (31): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (32): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (33): ReLU() ) self.deblocks[2] Tensor([bs, 256, 47, 35]) -> Tensor([bs, 128, 188, 140]) Sequential( (0): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(4, 4), bias=False) (1): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() )
4.2 3D 检测头
根据图一的结构图可以看出,检测头主要是对物体的类别、框的属性以及物体朝向进行预测。