How to Deploy a Neural Network on TH1520

简介: How to Deploy a Neural Network on TH1520

Introduction

T-Head has recently introduced a high-performance SoC prototyping, i.e. TH1520, which is built on the Wujian600 chip development platform. With a quad-core XuanTie C910 CPU withbuilt-in 4-TOPS NPU, TH1520 engenders a new combination of CPU and AI computing.


In this blog, we will describe the process of how to deploy a neural network model on C910 and on C910 and NPU simultaneously.


Tools

T-Head offers two open-source deployment tools that enable seamless, highly efficient integration of NN frameworks and underlying hardware:

  • Heterogeneous Honey Badge)(HHB): It supports models from different NN frameworks, and provides quantization and graph optimization.
  • Structure of Heterogeneous Library (SHL): It is a common interface that is compatible with all hardware types, whil offering a reference schedule that facilitates software portability.

image.png

HHB

HHB is a collection of tools provided by T-Head to deploy neural network models on XuanTie processors. These tools can be incorporated for compilation, profiling, and simulation.


Its framework is based on Apache TVM, which is an end-to-end machine learning compiler structure. We have shared the source code on GitHub.


HHB supports models such as Caffe, TensorFlow, ONNX, and TensorFlow Lite. It can convert these models into unified intermediate expressions for graphing performance optimization.


In addition, HHB supports multiple quantization methods to handle various data types. This framework can automatically provide the optimal scheme for the specified XuanTie CPU platform. After quantization, HHB generates a graph structure in C code from the intermediate expression. Each node of the graph structure is constructed by calling the CSI-NN2 API.


Here is an example to use HHB in deploying MobileNet model on TH1520. The sample code shows the hhb command to compile the model:



hhb -C --board light --calibrate-dataset ./cat.jpg --model-file ./mobilenetv1.prototxt ./mobilenetv1.caffemodel --data-mean "103.94 116.98 123.68" --data-scale 0.007843 --output . --quantization-scheme="int8_asym" --pixel-format BGR


The following content describes the parameter options:

  • C: specifies to execute the main command until C code is generated.
  • board: emphasizes as the destination platform; light is an alias of TH1520.
  • calibrate-dataset: specifies the calibration image used for quantization.
  • model-file: specifies a MobileNet model downloaded to the current directory. A Caffe model is divided into two files. The files following the option are not sequence-sensitive.
  • data-mean: defines a mean.
  • data-scale: defines a scale.
  • output: describes the current directory as the path to store files that you need to generate.
  • quantization-scheme: identifies a quantization scheme.
  • pixel-format: identifies the input image format required by the model training.

After the command is executed, multiple files such as main.c and model.c are generated in the current directory:

  • main.c: the reference entry to the sample program.
  • model.c: a model structure file that describes the model.
  • hhb.hm: the weights converted to int8.
  • io.c: the helper function for reading and writing files.
  • io.h: the declaration of the helper function for above files.
  • process.c: the image preprocessing function.
  • process.h: the declaration of the above function.


After the HHB command generates code, the gcc command performs binary encoding.



riscv64-unknown-linux-gnu-gcc -O0 -g3 -march=rv64gcv0p7_zfh_xtheadc -mabi=lp64d -I/home -I/home/install_nn2/include -I/home/decode/install/include -o c_runtime  main.c model.c io.c process.c -L/home/install_nn2/lib -L/home/decode/install/lib/rv -ljpeg -lpng -lz -lstdc++ -lshl_rvv -lm -static -Wl,--gc-sections

The following content describes the parameter options:

  • O0 -g3: specifies the optimization option. In this example, you can use the debug-level O0 only.
  • march: identifies the architecture option for C910.
  • mabi: identifies the application binary interface (ABI) option for C910.
  • I: describes the location of the header file that is used during compilation.
  • o: describes the name of the executable file needed to generate.
  • main.c model.c io.c process.c: the source file yu for compilation.
  • L: specifies the path to store the specified library.
  • ljpeg: links to a JPEG decoding library.
  • lpng: links to a PNG decoding library.
  • lz: links to a zlib.
  • lstdc++: links to a standard C++ library.
  • lshl_rvv: links to an optimized version library of C910 in SHL.
  • lm: links to a standard math library.
  • static: a static link.
  • Wl,–gc-sections: recycles unused sections during linking.


After the compilation is complete, the c_runtime file is created under the current directory. Copy the hhb.bm file and the cat.jpg image that are generated by incorporating the hhb command and the c_runtime file to the development board of C910 to execute at a time:



./c_runtime hhb.bm cat.jpg


You can view the top 5 execution results on the terminal.


SHL

SHL, previously called CSI-NN2,  is a neural network acceleration library.


It abstracts various common neural network operators to form unified interfaces. SHL also implements an acceleration library for XuanTie CPU. This interface offers optimization code at the assembly level for the RISC-V Vector extension. The acceleration library has adapted to multiple data types of quantization schemes.


Combined with the automatic quantization function of HHB, SHL can quickly change the original model from the single-precision floating-point data type to optimal. As a result,the model can deliver the best performance on the development board.

The source code of SHL has been made available on GitHub.

SHL shares the specifications of RISCV-V Vector extension V0.7.1 in the implementation of the neural network operator on XuanTie C910. Considering the features of the CPU hardware (such as pipeline dependence, branch prediction, or cache), SHL fully excavates the parallel capabilities of the fp16 data format in the algorithm.

To balance performance and accuracy, some SoCs may have an NPU to accelerate some int8 neural network operators. SHL provides one reference schedule module to find the best processor for  operators.

image.png

C910 Performance

XuanTie C910 is a 64-bit high-performance processor based on the 64-bit RISC-V architecture. This processor adopts a state-of-the-art 12-stage and out-of-order multiple issue superscalar pipeline. On TH1520, it can clock up to 2.5GHz. It is also equipped with 128-bit vector operation units to deliver optimized performance.

The vector operation units of XuanTie C910 are designed following version 0.7.1 of RISC-V Vector Extension. C910 supports wide-ranging data formats, including int8, int16, int32, int64, bf16, fp16, fp32, and fp64. fp16 is the default format for deploying network models, with which Xuantie C910 can achieve its best performance.

We have tested various typical image classification models. The table below presents the performance of our deployment software on C910 at 1.85 GHz.


image.png

Conclusion

This article describes in details on how to deploy a neural network model on TH1520, We have also presented optimal performance of TH1520 in basic image classification tasks.

TH1520 has already been incorporated inside Alibaba’s ecosystem, which demonstrates the feasibility of RISC-V-based high-performance devices to deploy neural network models. In addition, the source code of deployment tools, HHB and SHL, has been open-sourced and shared on GitHub.

相关文章
|
前端开发 芯片
【芯片前端】保持代码手感——握手协议ready打拍时序优化
【芯片前端】保持代码手感——握手协议ready打拍时序优化
559 0
|
应用服务中间件 JavaScript 虚拟化
阿里云香港轻量应用服务器介绍与测评:月付24元/30Mbps带宽/1TB流量
阿里云香港24是阿里云推出了一款非常优惠的香港的轻量应用服务器,每个月只需要24元,流量有1T,30M的带宽,国内延迟非常低,联通和移动是直连,电信去程ntt,回程cn2,性价比非常高。本文详细介绍这个方案的配置以及做一个简单的测评。
34466 0
|
机器学习/深度学习 人工智能 算法
|
存储 边缘计算 固态存储
玄铁RISC-V处理器入门与实战-平头哥玄铁CPU IP-面向高性能领域CPU
玄铁RISC-V处理器入门与实战-平头哥玄铁CPU IP-
2338 1
|
Ubuntu Linux Shell
Windows-下学习阿里平头哥-RISC-V-芯片开发平台-wujian100
上个月,在第六届互联网大会上,阿里的平头哥,对,就是那个人狠话不多的公司!他们宣布开源了 wujian100 这个芯片设计平台。搭载基于 RISC-V 架构的玄铁 902 处理器。
Windows-下学习阿里平头哥-RISC-V-芯片开发平台-wujian100
|
7月前
|
开发工具
【HarmonyOS 5】Integrating WeChat Sharing into HarmonyOS Applications
【HarmonyOS 5】Integrating WeChat Sharing into HarmonyOS Applications
308 9
|
机器学习/深度学习 人工智能 数据可视化
斯坦福博士图解AlphaFold 3:超多细节+可视化还原ML工程师眼中的AF3
【8月更文挑战第8天】AlphaFold 3作为AI领域的重大突破,革新了蛋白质结构预测。斯坦福博士通过图解详析了其内部机制,展示了多尺度建模与图神经网络技术如何提升预测精度。尽管存在数据依赖性和计算成本等挑战,AlphaFold 3仍极大地加速了生物学研究与药物开发进程。论文详情参见:https://www.nature.com/articles/s41586-024-07487-w
729 4
|
安全 Linux Shell
深入探索Linux命令pinentry:安全处理敏感信息的利器
`pinentry`是GnuPG的安全工具,用于安全获取密码等敏感信息,它在终端中显示隐藏输入的提示,确保信息安全。支持交互式操作、多种输入方式,且有丰富的选项,如`--tty`和`--window-title`。示例脚本展示了如何在bash中使用它获取密码。使用时注意环境安全,勿明文显示密码,结合其他工具增强安全性,熟悉参数选项,并考虑备份相关配置。
|
传感器 vr&ar Android开发
基于Arduino的智能眼镜
基于Arduino的智能眼镜
607 0
|
缓存 并行计算 API
龚大视频学习笔记:上帝视角看GPU(4)-完整的软件栈
龚大视频学习笔记:上帝视角看GPU(4)-完整的软件栈
657 0

热门文章

最新文章