XuanTie C908 Accelerates AI with Software and Hardware Fusion

简介: XuanTie C908 Accelerates AI with Software and Hardware Fusion

1. Introduction

XuanTie C908 is the latest RISC-V processor released by T-Head Semiconductor, It has a frequency of up to 2 GHz. which allows it to be widely used in visual AI, intelligent interaction, and other advanced technologies. This article focuses on an array of topics from processor micro-architecture, to convolution acceleration algorithm, to optimized operators for XuanTie C908. We are also showcasing the AI inference performance of XuanTie C908 by using the T-Head open source AI deployment kit for the first time.


2. AI acceleration of hardware and software integration

2.1 Processor micro-architecture

  • Support instruction fusion technology
  • Compliant with RISC-V vector extension 1.0
  • Support 128/256 configurable vector register bit width VLEN
  • The vector execution unit supports FP16/BFP16/FP32 floating point and INT8/INT32/INT64 integer operations
  • Support INT8/INT4 vector dot product operations.


2.2 Software algorithm optimization

Structure of Heterogeneous Library (SHL) is a set of neural network library APIs for XuanTie CPU platform. It abstracts common neural network operator interfaces. For the newly released XuanTie C908, SHL provides the inference acceleration of multiple data types (fp32/fp16/int8). Combined with the processor pipeline, instruction fusion, and high-speed cache technology, it offers deep assembly optimization for core operators in neural networks.


Convolution has been the most crucial operator in CNN models. Currently, im2col + GEMM and Winograd are supported in SHL to accelerate convolution calculations. The main steps of Winograd are:


  • Input padding
  • Input transformation
  • Input reordering
  • Batch GEMM operations
  • Output transformation
  • Output cropping


The core computing of the two algorithms is gemm. The following figure uses vlen128/fp16 as an example to show the calculation process of gemm.

image.png

Vector load (vle) is used for weight data, while scalar load (flh) for input data. This design takes 16*12 register blocks to improve computational efficiency by performing outer product matrix. We manually remove read-after-write and write-after-write data dependencies to adjust instruction flow. Last but not least, we have incorporated advanced instruction fusion technology to fully optimize performance of XuanTie C908. (The arrows in the figure indicate the arrangement order of the data in the memory.)


The list of optimized operators supported by SHL for XuanTie C908 is as follows:


  • conv2d
  • depthwiseconv2d
  • maxpool2d
  • avgpool2d
  • global_maxpool2d
  • global_avgpool2d
  • fullyconnected
  • relu
  • relu6
  • leaky_relu
  • prelu
  • sigmoid
  • softmax
  • concat
  • pad
  • elementwise_add
  • elementwise_mul
  • sum


2.3 Model deployment

Heterogenous Honey Badger (HHB) has been adapted to the latest XuanTie C908 processor. It supports weight symmetric, activation asymmetric int8 data type quantization and fp16 data type quantization. One only needs a simple command to generate the C code model file for inference on XuanTie C908. While calling on the SHL XuanTie C908 high-performance inference computing library, you can achieve the best performance experience of model inference on XuanTie C908.

hhb -C –calibrate-dataset ./cat.jpg –model-file ./mobilenetv1.prototxt ./mobilenetv1.caffemodel –data-scale 0.017 –data-mean ‘104 117 124’ –output . –board c908 –quantization-scheme=”int8_asym_w_sym” –pixel-format BGR –fuse-conv-relu –channel-quantization


3. Performance

We tested the AI inference performance of some common CNN models on XuanTie C908 using HHB and SHL. After adding the int8 vector dot product instruction, we improved XuanTie C908 performance by 3.35 times on mobilenet. This step enables us to expand the vector length to 256 results in a speedup ratio of 1.55 to 1.68. The AI performance provided by XuanTie C908 (@vlen128) has been increased by 3.75 to 4.57 times compared with that of the previous generation XuanTie C906 (@D1).

image.pngimage.pngimage.png

4. Conclusion

XuanTie C908 greatly improves AI computing power and performance. We have followed the standard RISC-V vector extension 1.0 and supported int8/int4 vector dot product extensions. Thus, we have provided 256-bit wide vector register configurable options for Xuantie C908. This article describes the specific steps of integration with micro-architecture and instruction characteristics of the XuanTie C908 processor. By doing so, we are able to accelerate the convolution operator in CNN and introduces the SHL high-performance computing library GEMM optimization ideas and the list of optimized operators. Moreover, we have compared the AI performance of XuanTie C908 and the previous generation XuanTie C906, further highlighting the potential and advantages of the RISIC-V processor architecture in the field of AIOT through software and hardware joint optimization.




相关文章
|
6天前
|
人工智能 定位技术 SEO
我学 GEO 第 15 天:终于知道AI GEO该如何做?
我是暴走的莉莉酱,边旅行边研究AI GEO的数字游民。专注普通人如何提升“AI可见度”——让AI在回答用户问题时准确识别、理解并推荐你。不讲玄学,只做可测、可调、可持续的GEO实践。
421 125
|
8天前
|
机器学习/深度学习 人工智能 调度
🐴 HappyHorse 1.1 现已上线阿里云百炼!快来查收模型使用指南,现在调用享 6 折~
HappyHorse 1.1 是新一代视频生成大模型,全面升级动态表现力、角色一致性、指令遵循、视觉质感与音画协同能力。支持I2V/T2V/R2V三类生成,适配短剧、电商广告、品牌营销等场景,提供高质、流畅、可控的AI视频生产力。
713 5
🐴 HappyHorse 1.1 现已上线阿里云百炼!快来查收模型使用指南,现在调用享 6 折~
|
6天前
|
缓存 人工智能 运维
阿里云618百炼大模型Qwen3.7-Max功能、免费试用、订阅计费、配置接入详解
Qwen3.7-MAX是阿里云百炼平台推出的通义千问3.7系列旗舰大语言模型,专为智能体时代复杂任务打造,依托阿里云全域算力与自研技术,在逻辑推理、长文本处理、代码工程、长周期自主执行等领域达到行业顶尖水平。2026年618期间,该模型推出多重免费试用权益、按量计费5折、订阅套餐优惠等专属福利,覆盖个人开发者、团队与企业全场景需求,以下从核心功能、免费试用、订阅计费、配置接入四方面展开详细解析。
416 123
|
4天前
|
人工智能 自然语言处理 API
阿里云Token Plan团队版解析:功能、三档套餐与省钱订阅指南
阿里云百炼平台推出的Token Plan团队版,是面向企业与团队的AI大模型订阅服务,以Credits为统一计量单位,整合文本与图像生成模型,提供团队管理、数据安全、多工具兼容等核心能力,解决团队零散订阅AI服务的管理混乱、成本失控、数据安全等痛点。本文将从核心定位、套餐详情、计费规则、团队管理、工具兼容、便宜订阅技巧等方面,全面解析Token Plan团队版,帮助企业与团队高效、低成本地使用AI服务。
309 108
|
5天前
|
存储 人工智能 数据可视化
别再手动复制 Skill 了:多 Agent 时代的 Skill 管理方案
多 Agent 场景下 Skill 的统一管理与同步。
261 123
|
19天前
|
缓存 测试技术 API
Qwen 3.7 Plus 与 Max 实测:性价比与多模态能力差异解析(2026)
2026 年 6 月 1 日,阿里悄无声息地发布了 Qwen 3.7 Plus,距 Qwen 3.7 Max 上线刚好 11 天。同样的 1M 上下文,同样的 35 小时自治上限。但价格才是头条:Plus 是 0.40/M输入,Max是 2.50/M——便宜约 6 倍——并且还能看图、看视频。Vision Arena 上 Plus 已经排到 #16。所以这周真正值得讨论的问题不是”要不要为视觉能力买单”,而是”Max 凭什么用 6 倍价格换来 2 个百分点的 benchmark 领先”。
|
12天前
|
缓存 人工智能 运维
GLM 5.2自托管全流程实战:硬件选型、vLLM/SGLang部署与成本盈亏测算
2026年智谱发布GLM 5.2超大混合专家模型,区别于以往仅开放API的闭源大模型,该模型权重以MIT开源协议对外发布,企业与开发者可完整下载、本地审计、私有化部署,实现数据不出环境、自定义微调、自主调度推理资源。GLM 5.2拥有753B总参数,原生支持百万级上下文窗口,在代码生成、长文档推理、数学逻辑等多项基准测试中对标国际顶尖商用模型,是首款可完整自托管的前沿代码向大模型。
939 0
|
13天前
|
Linux 程序员 数据格式
【2026最新】Notepad++下载、安装和使用一篇搞定(附中文版安装包)
Notepad++ 是一款免费开源、轻量高效的 Windows 文本编辑器,支持 C/Python/HTML 等 80+ 语言语法高亮、代码折叠、正则替换、编码转换及插件扩展,专为程序员与文本处理用户打造,完美替代系统记事本。(239字)