CoreBolt——在倚天上基于 Coresight 做 BOLT 优化

本文涉及的产品
RDS MySQL Serverless 基础系列,0.5-2RCU 50GB
云数据库 RDS MySQL,集群系列 2核4GB
推荐场景:
搭建个人博客
云数据库 RDS PostgreSQL,集群系列 2核4GB
简介: CoreBolt 是一种倚天平台的性能优化解决方案。CoreBolt 通过 Coresight 在程序运行时采集程序运行信息,对程序的热代码和冷代码进行区分,并通过 BOLT 对程序进行代码段重排,从而提升程序代码的局部性,减少程序运行过程中由 CPU iCache miss 和 iTLB miss 引发的性能下降,提升程序的整体性能。

一、简介

CoreBolt 是一种倚天平台的性能优化解决方案。CoreBolt 通过 Coresight 在程序运行时采集程序运行信息,对程序的热代码和冷代码进行区分,并通过 BOLT 对程序进行代码段重排,从而提升程序代码的局部性,减少程序运行过程中由 CPU iCache miss 和 iTLB miss 引发的性能下降,提升程序的整体性能。CoreBolt 方案依赖于 Alibaba Cloud Linux 3 操作系统提供的 Coresight 硬件采集能力和 Alibaba Cloud Compiler 提供的 BOLT 优化 ARM 二进制的能力。关于 Coresight 和 BOLT 的详细介绍可以移步

《Arm Coresight 介绍》

《BOLT 二进制反馈优化技术》

适用场景

CoreBolt 解决方案依赖倚天硬件功能,优化过程必须在倚天上进行。优化后生成的二进制文件符合 ELF 标准,可以在大部分 ARM 平台上运行。

CoreBolt 方案适用于大部分场景,不同的应用优化效果不同,iCache Miss/iTLB Miss/FrontEnd stall 越高,优化效果越好。

二、Bolt/Coresight使用说明

程序构建

目标程序在构建时候需要对构建脚本做以下修改。

  • 程序构建需要关闭 asan 等 santilizer。

  • 链接器需要额外参数-Wl,--build-id=sha1 -Wl,--emit-relocs

  • 如果编译器是gcc(gcc8及以上)需要加编译参数-fno-reorder-blocks-and-partition。

采样环境

ECS 购买倚天裸金属,使用 Alibaba Cloud Linux 3.2104 LTS 64位 ARM版操作系统,在此文档编写时间 20231222 之后购买的此实例都支持 Coresight 采样。

采样环境应当只用做线下采样使用,应避免在线上环境直接采样。

环境准备

安装驱动

modprobe coresight

modprobe coresight-catu

modprobe coresight-funnel

modprobe coresight-tmc

modprobe coresight-cti

modprobe coresight-replicator

modprobe coresight-etm4x

modprobe coresight-tpiu

下线 64-127 core

#!/bin/sh
for i in $( eval echo {$1..$2} )
do
    echo $3 > /sys/bus/cpu/devices/cpu$i/online;
done

sh offline.sh 64 127 0

安装 ACC

yum install -y alibaba-cloud-compiler

perf采样

perf record -e cs_etm//u ./app

更多通过 perf 使用 Coresight 的方法见 《Arm Coresight》

perf data 的储存和转化

perf2bolt将inject.data转成fdata的形式

perf inject -i perf.data -o inj.x.data --itrace=i300000il128
/opt/alibaba-cloud-compiler/bin/perf2bolt -p inj.x.data -o perf.fdata libjvm.so

使用bolt进行优化

  • aarch64上如果只针对部分函数做bolt需要带 -no-scan

  • -split-all-cold 通常在采样数据充分的情况下更好

/opt/alibaba-cloud-compiler/bin/llvm-bolt libjvm.so -o libjvm.bolt.so -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -dyno-stats

三、以一个快排为例

代码

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define ARRAY_LEN 30000

static struct timeval tm1;

static inline void start() {
    gettimeofday(&tm1, NULL);
}

static inline void stop() {
    struct timeval tm2;
    gettimeofday(&tm2, NULL);
    unsigned long long t = 1000 * (tm2.tv_sec - tm1.tv_sec) +\
                           (tm2.tv_usec - tm1.tv_usec) / 1000;
    printf("%llu ms\n", t);
}

void bubble_sort (int *a, int n) {
    int i, t, s = 1;
    while (s) {
        s = 0;
        for (i = 1; i < n; i++) {
            if (a[i] < a[i - 1]) {
                t = a[i];
                a[i] = a[i - 1];
                a[i - 1] = t;
                s = 1;
            }
        }
    }
}

void sort_array() {
    printf("Bubble sorting array of %d elements\n", ARRAY_LEN);
    int data[ARRAY_LEN], i;
    for(i=0; i<ARRAY_LEN; ++i){
        data[i] = rand();
    }
    bubble_sort(data, ARRAY_LEN);
}

int main(){
    start();
    sort_array();
    stop();
    return 0;
}

编译后运行

root@iZ2ze8k8g2f1rg3pi03y0rZ ~/bolt# gcc -Wl,--build-id=sha1 -Wl,--emit-relocs -O3 ++sort.c++ -o ++sort++

root@iZ2ze8k8g2f1rg3pi03y0rZ ~/bolt# ./sort

Bubble sorting array of 30000 elements

939 ms

运行时间为 939 ms

使用 coresight 采集

root@iZ2ze8k8g2f1rg3pi03y0rZ ~/bolt# perf record -m ,16M -e cs_etm//u ++./sort++

Bubble sorting array of 30000 elements

941 ms

[ perf record: Woken up 2 times to write data ]

Warning:

AUX data lost 2 times out of 2!

[ perf record: Captured and wrote 32.012 MB perf.data ]

perf 数据转换成 BOLT 数据,转换时间有时较长

root@iZ2ze8k8g2f1rg3pi03y0rZ ~/bolt# perf inject -i ++perf.data++ -o ++perf.x.data++ --itrace=i300000il64

perf2bolt将inject.data转成fdata的形式

root@iZ2ze8k8g2f1rg3pi03y0rZ ~/bolt# /opt/alibaba-cloud-compiler/bin/perf2bolt -p ++perf.x.data++ -o ++perf.fdata++ ++sort++

PERF2BOLT: Starting data aggregation job for perf.x.data

PERF2BOLT: spawning perf job to read branch events

PERF2BOLT: spawning perf job to read mem events

PERF2BOLT: spawning perf job to read process events

PERF2BOLT: spawning perf job to read task events

BOLT-INFO: Target architecture: aarch64

BOLT-INFO: BOLT version:

BOLT-INFO: first alloc address is 0x400000

BOLT-INFO: creating new program header table at address 0x600000, offset 0x200000

BOLT-INFO: enabling relocation mode

BOLT-INFO: disabling -align-macro-fusion on non-x86 platform

BOLT-INFO: enabling strict relocation mode for aggregation purposes

BOLT-INFO: pre-processing profile using perf data aggregator

BOLT-INFO: binary build-id is: b9d4933d67e120c60a56b7f96fbf93e5a2961f98

PERF2BOLT: spawning perf job to read buildid list

PERF2BOLT: matched build-id and file name

PERF2BOLT: waiting for perf mmap events collection to finish...

PERF2BOLT: parsing perf-script mmap events output

PERF2BOLT: waiting for perf task events collection to finish...

PERF2BOLT: parsing perf-script task events output

PERF2BOLT: input binary is associated with 1 PID(s)

PERF2BOLT: waiting for perf events collection to finish...

PERF2BOLT: parse branch events...

PERF2BOLT: read 485570 samples and 30968485 LBR entries

PERF2BOLT: 0 samples (0.0%) were ignored

PERF2BOLT: traces mismatching disassembled function contents: 0 (0.0%)

PERF2BOLT: out of range traces involving unknown regions: 61 (0.0%)

PERF2BOLT: waiting for perf mem events collection to finish...

PERF2BOLT: processing branch events...

PERF2BOLT: wrote 15 objects and 0 memory objects to perf.fdata

启用 BOLT 优化

root@iZ2ze8k8g2f1rg3pi03y0rZ ~/bolt# /opt/alibaba-cloud-compiler/bin/llvm-bolt ++sort++ -o ++sort.bolt++ -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -dyno-stats

BOLT-INFO: Target architecture: aarch64

BOLT-INFO: BOLT version:

BOLT-INFO: first alloc address is 0x400000

BOLT-INFO: creating new program header table at address 0x600000, offset 0x200000

BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.

BOLT-INFO: enabling relocation mode

BOLT-INFO: disabling -align-macro-fusion on non-x86 platform

BOLT-INFO: pre-processing profile using branch profile reader

BOLT-INFO: Simple Rate Report

Simple Rate: 12 / 21 = 57.14%

Simple Profile data Rate: 0 / 0 = nan%

BOLT-INFO: number of removed linker-inserted veneers: 0

BOLT-INFO: 1 out of 15 functions in the binary (6.7%) have non-empty execution profile

BOLT-INFO: basic block reordering modified layout of 1 functions (100.00% of profiled, 4.76% of total)

BOLT-INFO: 0 Functions were reordered by LoopInversionPass

BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

           15446 : executed forward branches

            7274 : taken forward branches

           15446 : executed backward branches

           15446 : taken backward branches

               0 : executed unconditional branches

               0 : all function calls

               0 : indirect calls

               0 : PLT calls

          108762 : executed instructions

               0 : executed load instructions

               0 : executed store instructions

               0 : taken jump table branches

               0 : taken unknown indirect branches

           30892 : total branches

           22720 : taken branches

            8172 : non-taken conditional branches

           22720 : taken conditional branches

           30892 : all conditional branches

               0 : linker-inserted veneer calls

           15446 : executed forward branches (=)

               0 : taken forward branches (-100.0%)

           15446 : executed backward branches (=)

            7274 : taken backward branches (-52.9%)

            8043 : executed unconditional branches (+804200.0%)

               0 : all function calls (=)

               0 : indirect calls (=)

               0 : PLT calls (=)

          116805 : executed instructions (+7.4%)

               0 : executed load instructions (=)

               0 : executed store instructions (=)

               0 : taken jump table branches (=)

               0 : taken unknown indirect branches (=)

           38935 : total branches (+26.0%)

           15317 : taken branches (-32.6%)

           23618 : non-taken conditional branches (+189.0%)

            7274 : taken conditional branches (-68.0%)

           30892 : all conditional branches (=)

               0 : linker-inserted veneer calls (=)

BOLT-INFO: Starting stub-insertion pass

BOLT-INFO: Inserted 0 stubs in the hot area and 0 stubs in the cold area. Shared 0 times, iterated 1 times.

BOLT-INFO: padding code to 0xa00000 to accommodate hot text

BOLT-INFO: setting _end to 0xa00368

BOLT-INFO: setting __hot_start to 0x800000

BOLT-INFO: setting __hot_end to 0x800058

BOLT-INFO: patched build-id (flipped last bit)

运行优化后的二进制

root@iZ2ze8k8g2f1rg3pi03y0rZ ~/bolt# ./sort.bolt

Bubble sorting array of 30000 elements

685 ms

优化效果

在上述例子中,sort 程序被优化了 (941 - 685)/941 = 0.27

目录
相关文章
|
编解码 JSON 网络协议
透视RPC协议:SOFA-BOLT协议源码分析
最近在看Netty相关的资料,刚好SOFA-BOLT是一个比较成熟的Netty自定义协议栈实现,于是决定研读SOFA-BOLT的源码,详细分析其协议的组成,简单分析其客户端和服务端的源码实现。当前阅读的源码是2021-08左右的SOFA-BOLT仓库的master分支源码。
353 0
透视RPC协议:SOFA-BOLT协议源码分析
|
流计算 Java 安全