How to Deploy a Neural Network on TH1520

简介: How to Deploy a Neural Network on TH1520

Introduction

T-Head has recently introduced a high-performance SoC prototyping, i.e. TH1520, which is built on the Wujian600 chip development platform. With a quad-core XuanTie C910 CPU withbuilt-in 4-TOPS NPU, TH1520 engenders a new combination of CPU and AI computing.


In this blog, we will describe the process of how to deploy a neural network model on C910 and on C910 and NPU simultaneously.


Tools

T-Head offers two open-source deployment tools that enable seamless, highly efficient integration of NN frameworks and underlying hardware:

  • Heterogeneous Honey Badge)(HHB): It supports models from different NN frameworks, and provides quantization and graph optimization.
  • Structure of Heterogeneous Library (SHL): It is a common interface that is compatible with all hardware types, whil offering a reference schedule that facilitates software portability.

image.png

HHB

HHB is a collection of tools provided by T-Head to deploy neural network models on XuanTie processors. These tools can be incorporated for compilation, profiling, and simulation.


Its framework is based on Apache TVM, which is an end-to-end machine learning compiler structure. We have shared the source code on GitHub.


HHB supports models such as Caffe, TensorFlow, ONNX, and TensorFlow Lite. It can convert these models into unified intermediate expressions for graphing performance optimization.


In addition, HHB supports multiple quantization methods to handle various data types. This framework can automatically provide the optimal scheme for the specified XuanTie CPU platform. After quantization, HHB generates a graph structure in C code from the intermediate expression. Each node of the graph structure is constructed by calling the CSI-NN2 API.


Here is an example to use HHB in deploying MobileNet model on TH1520. The sample code shows the hhb command to compile the model:



hhb -C --board light --calibrate-dataset ./cat.jpg --model-file ./mobilenetv1.prototxt ./mobilenetv1.caffemodel --data-mean "103.94 116.98 123.68" --data-scale 0.007843 --output . --quantization-scheme="int8_asym" --pixel-format BGR


The following content describes the parameter options:

  • C: specifies to execute the main command until C code is generated.
  • board: emphasizes as the destination platform; light is an alias of TH1520.
  • calibrate-dataset: specifies the calibration image used for quantization.
  • model-file: specifies a MobileNet model downloaded to the current directory. A Caffe model is divided into two files. The files following the option are not sequence-sensitive.
  • data-mean: defines a mean.
  • data-scale: defines a scale.
  • output: describes the current directory as the path to store files that you need to generate.
  • quantization-scheme: identifies a quantization scheme.
  • pixel-format: identifies the input image format required by the model training.

After the command is executed, multiple files such as main.c and model.c are generated in the current directory:

  • main.c: the reference entry to the sample program.
  • model.c: a model structure file that describes the model.
  • hhb.hm: the weights converted to int8.
  • io.c: the helper function for reading and writing files.
  • io.h: the declaration of the helper function for above files.
  • process.c: the image preprocessing function.
  • process.h: the declaration of the above function.


After the HHB command generates code, the gcc command performs binary encoding.



riscv64-unknown-linux-gnu-gcc -O0 -g3 -march=rv64gcv0p7_zfh_xtheadc -mabi=lp64d -I/home -I/home/install_nn2/include -I/home/decode/install/include -o c_runtime  main.c model.c io.c process.c -L/home/install_nn2/lib -L/home/decode/install/lib/rv -ljpeg -lpng -lz -lstdc++ -lshl_rvv -lm -static -Wl,--gc-sections

The following content describes the parameter options:

  • O0 -g3: specifies the optimization option. In this example, you can use the debug-level O0 only.
  • march: identifies the architecture option for C910.
  • mabi: identifies the application binary interface (ABI) option for C910.
  • I: describes the location of the header file that is used during compilation.
  • o: describes the name of the executable file needed to generate.
  • main.c model.c io.c process.c: the source file yu for compilation.
  • L: specifies the path to store the specified library.
  • ljpeg: links to a JPEG decoding library.
  • lpng: links to a PNG decoding library.
  • lz: links to a zlib.
  • lstdc++: links to a standard C++ library.
  • lshl_rvv: links to an optimized version library of C910 in SHL.
  • lm: links to a standard math library.
  • static: a static link.
  • Wl,–gc-sections: recycles unused sections during linking.


After the compilation is complete, the c_runtime file is created under the current directory. Copy the hhb.bm file and the cat.jpg image that are generated by incorporating the hhb command and the c_runtime file to the development board of C910 to execute at a time:



./c_runtime hhb.bm cat.jpg


You can view the top 5 execution results on the terminal.


SHL

SHL, previously called CSI-NN2,  is a neural network acceleration library.


It abstracts various common neural network operators to form unified interfaces. SHL also implements an acceleration library for XuanTie CPU. This interface offers optimization code at the assembly level for the RISC-V Vector extension. The acceleration library has adapted to multiple data types of quantization schemes.


Combined with the automatic quantization function of HHB, SHL can quickly change the original model from the single-precision floating-point data type to optimal. As a result,the model can deliver the best performance on the development board.

The source code of SHL has been made available on GitHub.

SHL shares the specifications of RISCV-V Vector extension V0.7.1 in the implementation of the neural network operator on XuanTie C910. Considering the features of the CPU hardware (such as pipeline dependence, branch prediction, or cache), SHL fully excavates the parallel capabilities of the fp16 data format in the algorithm.

To balance performance and accuracy, some SoCs may have an NPU to accelerate some int8 neural network operators. SHL provides one reference schedule module to find the best processor for  operators.

image.png

C910 Performance

XuanTie C910 is a 64-bit high-performance processor based on the 64-bit RISC-V architecture. This processor adopts a state-of-the-art 12-stage and out-of-order multiple issue superscalar pipeline. On TH1520, it can clock up to 2.5GHz. It is also equipped with 128-bit vector operation units to deliver optimized performance.

The vector operation units of XuanTie C910 are designed following version 0.7.1 of RISC-V Vector Extension. C910 supports wide-ranging data formats, including int8, int16, int32, int64, bf16, fp16, fp32, and fp64. fp16 is the default format for deploying network models, with which Xuantie C910 can achieve its best performance.

We have tested various typical image classification models. The table below presents the performance of our deployment software on C910 at 1.85 GHz.


image.png

Conclusion

This article describes in details on how to deploy a neural network model on TH1520, We have also presented optimal performance of TH1520 in basic image classification tasks.

TH1520 has already been incorporated inside Alibaba’s ecosystem, which demonstrates the feasibility of RISC-V-based high-performance devices to deploy neural network models. In addition, the source code of deployment tools, HHB and SHL, has been open-sourced and shared on GitHub.

相关文章
|
机器学习/深度学习 人工智能 算法
|
10月前
|
开发工具
【HarmonyOS 5】Integrating WeChat Sharing into HarmonyOS Applications
【HarmonyOS 5】Integrating WeChat Sharing into HarmonyOS Applications
375 9
|
算法 测试技术 AI芯片
CPU反超NPU,llama.cpp生成速度翻5倍!LLM端侧部署新范式T-MAC开源
【9月更文挑战第7天】微软研究院提出了一种名为T-MAC的创新方法,旨在解决大型语言模型在资源受限的边缘设备上高效部署的问题。T-MAC通过查表法在CPU上实现低比特LLM的高效推理,支持混合精度矩阵乘法,无需解量化。其通过位级查表实现统一且可扩展的解决方案,优化数据布局和重用率,显著提升了单线程和多线程下的mpGEMV及mpGEMM性能,并在端到端推理吞吐量和能效方面表现出色。然而,表量化和快速聚合技术可能引入近似和数值误差,影响模型准确性。论文详见:[链接](https://www.arxiv.org/pdf/2407.00088)。
1018 10
|
传感器 vr&ar Android开发
基于Arduino的智能眼镜
基于Arduino的智能眼镜
833 0
|
SQL Java Apache
实时计算 Flink版操作报错合集之本地启动时,如何处理报错:The file STDOUT does not exist on the TaskExecutor
在使用实时计算Flink版过程中,可能会遇到各种错误,了解这些错误的原因及解决方法对于高效排错至关重要。针对具体问题,查看Flink的日志是关键,它们通常会提供更详细的错误信息和堆栈跟踪,有助于定位问题。此外,Flink社区文档和官方论坛也是寻求帮助的好去处。以下是一些常见的操作报错及其可能的原因与解决策略。
|
Ubuntu Linux Shell
树莓派ubuntu镜像备份详细教程
本文是关于树莓派Ubuntu镜像备份的详细教程,包括如何获取镜像、使用PiShrink工具减小镜像体积的步骤和命令,以及相关参数的说明。
685 0
|
缓存 并行计算 API
龚大视频学习笔记:上帝视角看GPU(4)-完整的软件栈
龚大视频学习笔记:上帝视角看GPU(4)-完整的软件栈
737 0
|
消息中间件 存储 缓存
Kafka概念及组件介绍
1、分布式消息队列系统,先入先出,同时提供数据分布式缓存功能 2、消息持久化:数据读取速度可以达到O(1)——预读,后写(按顺序,ABCDE,正读A,预读B;尾部追加写)对磁盘的顺序访问比内存访问还快)
953 1
Kafka概念及组件介绍
|
弹性计算 Ubuntu 关系型数据库
服务器迁移最佳实践
越来越多的企业上云后,在窗口期内快速、平滑迁移到阿里云上,用最优的方案来实践以及实现业务价值。
1256 0
服务器迁移最佳实践
|
人工智能 负载均衡 安全
玄铁RISC-V处理器入门与实战-平头哥玄铁CPU 系统-YoC
玄铁RISC-V处理器入门与实战-平头哥玄铁CPU 系统
883 1

热门文章

最新文章

下一篇
开通oss服务