XuanTie C908 Accelerates AI with Software and Hardware Fusion

简介: XuanTie C908 Accelerates AI with Software and Hardware Fusion

1. Introduction

XuanTie C908 is the latest RISC-V processor released by T-Head Semiconductor, It has a frequency of up to 2 GHz. which allows it to be widely used in visual AI, intelligent interaction, and other advanced technologies. This article focuses on an array of topics from processor micro-architecture, to convolution acceleration algorithm, to optimized operators for XuanTie C908. We are also showcasing the AI inference performance of XuanTie C908 by using the T-Head open source AI deployment kit for the first time.


2. AI acceleration of hardware and software integration

2.1 Processor micro-architecture

  • Support instruction fusion technology
  • Compliant with RISC-V vector extension 1.0
  • Support 128/256 configurable vector register bit width VLEN
  • The vector execution unit supports FP16/BFP16/FP32 floating point and INT8/INT32/INT64 integer operations
  • Support INT8/INT4 vector dot product operations.


2.2 Software algorithm optimization

Structure of Heterogeneous Library (SHL) is a set of neural network library APIs for XuanTie CPU platform. It abstracts common neural network operator interfaces. For the newly released XuanTie C908, SHL provides the inference acceleration of multiple data types (fp32/fp16/int8). Combined with the processor pipeline, instruction fusion, and high-speed cache technology, it offers deep assembly optimization for core operators in neural networks.


Convolution has been the most crucial operator in CNN models. Currently, im2col + GEMM and Winograd are supported in SHL to accelerate convolution calculations. The main steps of Winograd are:


  • Input padding
  • Input transformation
  • Input reordering
  • Batch GEMM operations
  • Output transformation
  • Output cropping


The core computing of the two algorithms is gemm. The following figure uses vlen128/fp16 as an example to show the calculation process of gemm.

image.png

Vector load (vle) is used for weight data, while scalar load (flh) for input data. This design takes 16*12 register blocks to improve computational efficiency by performing outer product matrix. We manually remove read-after-write and write-after-write data dependencies to adjust instruction flow. Last but not least, we have incorporated advanced instruction fusion technology to fully optimize performance of XuanTie C908. (The arrows in the figure indicate the arrangement order of the data in the memory.)


The list of optimized operators supported by SHL for XuanTie C908 is as follows:


  • conv2d
  • depthwiseconv2d
  • maxpool2d
  • avgpool2d
  • global_maxpool2d
  • global_avgpool2d
  • fullyconnected
  • relu
  • relu6
  • leaky_relu
  • prelu
  • sigmoid
  • softmax
  • concat
  • pad
  • elementwise_add
  • elementwise_mul
  • sum


2.3 Model deployment

Heterogenous Honey Badger (HHB) has been adapted to the latest XuanTie C908 processor. It supports weight symmetric, activation asymmetric int8 data type quantization and fp16 data type quantization. One only needs a simple command to generate the C code model file for inference on XuanTie C908. While calling on the SHL XuanTie C908 high-performance inference computing library, you can achieve the best performance experience of model inference on XuanTie C908.

hhb -C –calibrate-dataset ./cat.jpg –model-file ./mobilenetv1.prototxt ./mobilenetv1.caffemodel –data-scale 0.017 –data-mean ‘104 117 124’ –output . –board c908 –quantization-scheme=”int8_asym_w_sym” –pixel-format BGR –fuse-conv-relu –channel-quantization


3. Performance

We tested the AI inference performance of some common CNN models on XuanTie C908 using HHB and SHL. After adding the int8 vector dot product instruction, we improved XuanTie C908 performance by 3.35 times on mobilenet. This step enables us to expand the vector length to 256 results in a speedup ratio of 1.55 to 1.68. The AI performance provided by XuanTie C908 (@vlen128) has been increased by 3.75 to 4.57 times compared with that of the previous generation XuanTie C906 (@D1).

image.pngimage.pngimage.png

4. Conclusion

XuanTie C908 greatly improves AI computing power and performance. We have followed the standard RISC-V vector extension 1.0 and supported int8/int4 vector dot product extensions. Thus, we have provided 256-bit wide vector register configurable options for Xuantie C908. This article describes the specific steps of integration with micro-architecture and instruction characteristics of the XuanTie C908 processor. By doing so, we are able to accelerate the convolution operator in CNN and introduces the SHL high-performance computing library GEMM optimization ideas and the list of optimized operators. Moreover, we have compared the AI performance of XuanTie C908 and the previous generation XuanTie C906, further highlighting the potential and advantages of the RISIC-V processor architecture in the field of AIOT through software and hardware joint optimization.




相关文章
|
6月前
|
搜索推荐 Android开发 iOS开发
AppsFlyer 研究(四)OneLink Deep Linking Guide
AppsFlyer 研究(四)OneLink Deep Linking Guide
232 0
|
3月前
|
存储 人工智能 机器人
Azure Machine Learning - 聊天机器人构建
Azure Machine Learning - 聊天机器人构建
26 0
|
9月前
|
人工智能 自然语言处理 JavaScript
UCAI(Unified Cognitive AI)
UCAI(Unified Cognitive AI)是一个人工智能服务平台,由阿里云提供,旨在为开发者提供智能化应用开发的一站式服务。
339 2
|
11月前
|
机器学习/深度学习 Web App开发 人工智能
Meta AI & UC Berkeley 新作 | Early Dropout: Make Hinton‘s Drouout Great Again!
Meta AI & UC Berkeley 新作 | Early Dropout: Make Hinton‘s Drouout Great Again!
100 0
|
Devops 测试技术 C++
《Software Engineering at Google》免费开放
《Software Engineering at Google》免费开放
173 0
|
机器学习/深度学习 SQL 人工智能
AI:Algorithmia《2021 enterprise trends in machine learning 2021年机器学习的企业趋势》翻译与解读
AI:Algorithmia《2021 enterprise trends in machine learning 2021年机器学习的企业趋势》翻译与解读
AI:Algorithmia《2021 enterprise trends in machine learning 2021年机器学习的企业趋势》翻译与解读
|
人工智能 异构计算
Heterogeneous Computing for AI and Big Data – Alibaba Cloud Computing Conference
Alibaba Cloud heterogeneous platform for elastic computing aims to provide high-quality services for organizations to realize scientific and technological innovations.
1683 0
Heterogeneous Computing for AI and Big Data – Alibaba Cloud Computing Conference
|
人工智能 分布式计算 MaxCompute
The Cloud and AI: A Marriage Made In Heaven For Big Data Analytics?
Cloud-based solutions are saving Big Data from itself with smart, secure and scalable offline data developments to realize impressive ROIs.
1700 0
The Cloud and AI: A Marriage Made In Heaven For Big Data Analytics?
|
机器学习/深度学习 算法 API
Splunk Machine Learning Toolkit 3.0.0介绍
如果你了解机器学习确不知道如何使用、想避免复杂的算法,减少学习成本、想快速知道哪种算法最适合你的数据、想让自己算法和一套成熟可视化数据平台相结合、使用了开源算法想减少开发成本,那么使用Splunk Machine Learning Toolkit是不错的选择。
1897 0
|
机器学习/深度学习 算法 测试技术
Splunk Machine Learning Toolkit在回归问题上的应用
Splunk MLTK实现了模型准确性的验证,并且可以通过特征系数来解释为什么会产生这样的模型
1555 0