XuanTie C908 Accelerates AI with Software and Hardware Fusion

简介: XuanTie C908 Accelerates AI with Software and Hardware Fusion

1. Introduction

XuanTie C908 is the latest RISC-V processor released by T-Head Semiconductor, It has a frequency of up to 2 GHz. which allows it to be widely used in visual AI, intelligent interaction, and other advanced technologies. This article focuses on an array of topics from processor micro-architecture, to convolution acceleration algorithm, to optimized operators for XuanTie C908. We are also showcasing the AI inference performance of XuanTie C908 by using the T-Head open source AI deployment kit for the first time.


2. AI acceleration of hardware and software integration

2.1 Processor micro-architecture

  • Support instruction fusion technology
  • Compliant with RISC-V vector extension 1.0
  • Support 128/256 configurable vector register bit width VLEN
  • The vector execution unit supports FP16/BFP16/FP32 floating point and INT8/INT32/INT64 integer operations
  • Support INT8/INT4 vector dot product operations.


2.2 Software algorithm optimization

Structure of Heterogeneous Library (SHL) is a set of neural network library APIs for XuanTie CPU platform. It abstracts common neural network operator interfaces. For the newly released XuanTie C908, SHL provides the inference acceleration of multiple data types (fp32/fp16/int8). Combined with the processor pipeline, instruction fusion, and high-speed cache technology, it offers deep assembly optimization for core operators in neural networks.


Convolution has been the most crucial operator in CNN models. Currently, im2col + GEMM and Winograd are supported in SHL to accelerate convolution calculations. The main steps of Winograd are:


  • Input padding
  • Input transformation
  • Input reordering
  • Batch GEMM operations
  • Output transformation
  • Output cropping


The core computing of the two algorithms is gemm. The following figure uses vlen128/fp16 as an example to show the calculation process of gemm.

image.png

Vector load (vle) is used for weight data, while scalar load (flh) for input data. This design takes 16*12 register blocks to improve computational efficiency by performing outer product matrix. We manually remove read-after-write and write-after-write data dependencies to adjust instruction flow. Last but not least, we have incorporated advanced instruction fusion technology to fully optimize performance of XuanTie C908. (The arrows in the figure indicate the arrangement order of the data in the memory.)


The list of optimized operators supported by SHL for XuanTie C908 is as follows:


  • conv2d
  • depthwiseconv2d
  • maxpool2d
  • avgpool2d
  • global_maxpool2d
  • global_avgpool2d
  • fullyconnected
  • relu
  • relu6
  • leaky_relu
  • prelu
  • sigmoid
  • softmax
  • concat
  • pad
  • elementwise_add
  • elementwise_mul
  • sum


2.3 Model deployment

Heterogenous Honey Badger (HHB) has been adapted to the latest XuanTie C908 processor. It supports weight symmetric, activation asymmetric int8 data type quantization and fp16 data type quantization. One only needs a simple command to generate the C code model file for inference on XuanTie C908. While calling on the SHL XuanTie C908 high-performance inference computing library, you can achieve the best performance experience of model inference on XuanTie C908.

hhb -C –calibrate-dataset ./cat.jpg –model-file ./mobilenetv1.prototxt ./mobilenetv1.caffemodel –data-scale 0.017 –data-mean ‘104 117 124’ –output . –board c908 –quantization-scheme=”int8_asym_w_sym” –pixel-format BGR –fuse-conv-relu –channel-quantization


3. Performance

We tested the AI inference performance of some common CNN models on XuanTie C908 using HHB and SHL. After adding the int8 vector dot product instruction, we improved XuanTie C908 performance by 3.35 times on mobilenet. This step enables us to expand the vector length to 256 results in a speedup ratio of 1.55 to 1.68. The AI performance provided by XuanTie C908 (@vlen128) has been increased by 3.75 to 4.57 times compared with that of the previous generation XuanTie C906 (@D1).

image.pngimage.pngimage.png

4. Conclusion

XuanTie C908 greatly improves AI computing power and performance. We have followed the standard RISC-V vector extension 1.0 and supported int8/int4 vector dot product extensions. Thus, we have provided 256-bit wide vector register configurable options for Xuantie C908. This article describes the specific steps of integration with micro-architecture and instruction characteristics of the XuanTie C908 processor. By doing so, we are able to accelerate the convolution operator in CNN and introduces the SHL high-performance computing library GEMM optimization ideas and the list of optimized operators. Moreover, we have compared the AI performance of XuanTie C908 and the previous generation XuanTie C906, further highlighting the potential and advantages of the RISIC-V processor architecture in the field of AIOT through software and hardware joint optimization.




相关文章
|
传感器 安全 API
SCP Firmware入门一篇就够啦
SCP Firmware入门一篇就够啦
1002 0
|
存储 缓存 人工智能
玄铁 E906处理器简介|学习笔记
快速学习玄铁 E906处理器简介
2825 0
玄铁 E906处理器简介|学习笔记
|
前端开发 Linux 程序员
成功解决:如何修改CentOS 7中终端背景色和字体颜色
这篇文章介绍了如何在CentOS 7中修改终端的背景色和字体颜色。文章提供了修改步骤,包括打开终端、选择配置文件首选项、取消默认颜色设置并自定义颜色,以及修改后的效果展示。作者表达了对默认白色主题的不习惯,并鼓励读者学习无止境。
成功解决:如何修改CentOS 7中终端背景色和字体颜色
|
机器学习/深度学习 人工智能 数据可视化
号称能打败MLP的KAN到底行不行?数学核心原理全面解析
Kolmogorov-Arnold Networks (KANs) 是一种新型神经网络架构,挑战了多层感知器(mlp)的基础,通过在权重而非节点上使用可学习的激活函数(如b样条),提高了准确性和可解释性。KANs利用Kolmogorov-Arnold表示定理,将复杂函数分解为简单函数的组合,简化了神经网络的近似过程。与mlp相比,KAN在参数量较少的情况下能达到类似或更好的性能,并能直观地可视化,增强了模型的可解释性。尽管仍需更多研究验证其优势,KAN为深度学习领域带来了新的思路。
5034 5
|
Java 关系型数据库 API
使用Spring Boot和PostgreSQL构建高级查询
使用Spring Boot和PostgreSQL构建高级查询
|
机器学习/深度学习 存储 Shell
shell 中匹配正则 字符串处理【整理版】
shell 中匹配正则 字符串处理【整理版】
288 4
okcc呼叫中心系统开启外呼不进线的原因以及解决方法
OKCC呼叫中心外呼不进线可能由线路配置、网络问题、系统设置错误、线路故障、权限配置、软硬件故障或特定情况(如分机、坐席未注册等)引起。需全面诊断,检查相关方面并修复。关注博主了解更多技术细节。
|
人工智能 Linux vr&ar
High-performance RISC-V Processor Xuantie C908
High-performance RISC-V Processor Xuantie C908
640 0
High-performance RISC-V Processor Xuantie C908
|
存储 Kubernetes API
kubernetes集群测试方案及工具?
kubernetes集群测试方案及工具?
405 1