大家好,我是极智视界,本文讲解一下 TensorRT Fully Connected 算子。
Fully Connected 也即 全连接层, 一般作为分类头或特征头使用。全连接层是个经典层,并不复杂,若没有偏置的话就是一个矩阵乘,如有偏置的话,就是一个矩阵乘然后接一个矩阵加。这里我们来看看 TensorRT 中 Fully Connected 的几种实现方式。
1 TensorRT 原生算子实现
用 TensorRT Fully Connected 原生算子来实现肯定是最方便的,关键的几步如下:
placeHolder = np.zeros(1, dtype=np.float32) # 添加全连接层 fullyConnectedLayer = network.add_fully_connected(inputT0, 1, placeHolder, placeHolder) # 重设输出通道数 fullyConnectedLayer.num_output_channels = cOut # 重设全连接权值 fullyConnectedLayer.kernel = weight # 重设全连接偏置,bias 为可选参数,默认值 None fullyConnectedLayer.bias = bias
来用一个完整的示例进行展示:
import numpy as np from cuda import cudart import tensorrt as trt # 输入张量 NCHW nIn, cIn, hIn, wIn = 1, 3, 4, 5 # 输出张量 C cOut = 2 # 输入数据 data = np.arange(cIn * hIn * wIn, dtype=np.float32).reshape(cIn, hIn, wIn) # 全连接权值 weight = np.ones(cIn * hIn * wIn, dtype=np.float32) weight = np.concatenate([weight, -weight], 0).reshape(cOut, cIn, hIn, wIn) # 全连接偏置 bias = np.zeros(cOut, dtype=np.float32) np.set_printoptions(precision=8, linewidth=200, suppress=True) cudart.cudaDeviceSynchronize() logger = trt.Logger(trt.Logger.ERROR) builder = trt.Builder(logger) network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) config = builder.create_builder_config() inputT0 = network.add_input('inputT0', trt.DataType.FLOAT, (nIn, cIn, hIn, wIn)) #-----------------------------------------------------------------------# 替换部分 # 添加全连接层 fullyConnectedLayer = network.add_fully_connected(inputT0, cOut, weight, bias) #-----------------------------------------------------------------------# 替换部分 network.mark_output(fullyConnectedLayer.get_output(0)) engineString = builder.build_serialized_network(network, config) engine = trt.Runtime(logger).deserialize_cuda_engine(engineString) context = engine.create_execution_context() _, stream = cudart.cudaStreamCreate() inputH0 = np.ascontiguousarray(data.reshape(-1)) outputH0 = np.empty(context.get_binding_shape(1), dtype=trt.nptype(engine.get_binding_dtype(1))) _, inputD0 = cudart.cudaMallocAsync(inputH0.nbytes, stream) _, outputD0 = cudart.cudaMallocAsync(outputH0.nbytes, stream) cudart.cudaMemcpyAsync(inputD0, inputH0.ctypes.data, inputH0.nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, stream) context.execute_async_v2([int(inputD0), int(outputD0)], stream) cudart.cudaMemcpyAsync(outputH0.ctypes.data, outputD0, outputH0.nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost, stream) cudart.cudaStreamSynchronize(stream) print("inputH0 :", data.shape) print(data) print("outputH0:", outputH0.shape) print(outputH0) cudart.cudaStreamDestroy(stream) cudart.cudaFree(inputD0) cudart.cudaFree(outputD0)
- 输入张量形状 (1,3,4,5)
- 输出张量形状 (1,2,1,1)
- 计算过程:
- Dynamic Shape 模式下,最低 3 维尺寸必须是构建期常量,不可为 -1
2 TensorRT 矩阵乘加实现
然而全连接层又可以看成 一个矩阵乘接一个矩阵加。来看怎么做的:
# 矩阵乘 factorShape0 = weight.shape constantLayer0 = network.add_constant(factorShape0, np.ones(factorShape0factorShape0 = data.shape, dtype=np.float32)) matrixMultiplyLayer = network.add_matrix_multiply(inputT0, trt.MatrixOperation.NONE, constantLayer0.get_output(0), trt.MatrixOperation.NONE) matrixMultiplyLayer.op0 = trt.MatrixOperation.NONE matrixMultiplyLayer.op1 = trt.MatrixOperation.TRANSPOSE # 矩阵加 (偏置) factorShape1 = bias.shape constantLayer1 = network.add_constant(factorShape1, np.ones(factorShape1, dtype=np.float32)) biasLayer = network.add_elementwise(matrixMultiplyLayer.get_output(0), constantLayer1.get_output(0), trt.ElementWiseOperation.SUM) # get output output = biasLayer.get_output(0)
这样就用 TensorRT 的乘加实现了 Fully Connected 算子。
好了,以上分享了 讲解 TensorRT Fully Connected 算子,希望我的分享能对你的学习有一点帮助。