简介

本文探讨了如何为运行时程序添加剖析支持以优化性能。

该例子问题是一个程序需要找出平方根为整数且逆序平方等于其逆序的数字。我们首先展示了原始代码，然后使用runtime库进行优化，将计算和调用功能分离，同时记录CPU和内存使用情况，以此作为优化的依据。

1 在执行程序中加入剖析支持

我们有一个需求，需要找到10亿以内的某些特别的数字，它们满足以下特征：

            其平方根值为整数
            其平方根值的逆序的平方，等于其逆序

        
          
        
        
        
          
          AI 代码解读

例如，121 正好满足这个特点。

其平方根为 11，为整数.

平方根的逆序的平方 = 11 * 11 = 121

2 使用runtime去优化

某人实现了如下程序，可以完成千万级的计算，但是无法完成亿级别的运算。我们打算优化它，并且按照标准库runtime的指示去优化。

           func counts(n float64) {

        var ms interface{}
        for j := 1.0; j <= n; j++ {

            sj := fmt.Sprintf("%s", reverse(fmt.Sprintf("%v", j))) 
            ms = math.Sqrt(j)
            ins := fmt.Sprintf("%v", ms)  
            if strings.Count(ins, ".") > 0 {
                continue
            }

            switch ms.(type) {
            default:
                continue

            case float64: 
                sms := reverse(ins) 

                 sms str to int
                intsms, _ := strconv.ParseInt(sms, 10, 64) 

                j sj str to int
                intsj, _ := strconv.ParseInt(sj, 10, 64) 

        
          
        
        
        
          
          AI 代码解读

如果逆序平方等于原数字的逆序，输出到控制台

     if intsms*intsms == intsj {
                    fmt.Println("reverse j", intsj, "sqrt:=", intsms)
                }
            }
        }
    }

        
          
        
        
        
          
          AI 代码解读

在main函数调用这个计算函数count就可以进行计算。比如我们需要查找十亿内的全部特征数：

    func main() {
        counts(1000000000.0)
    }

        
          
        
        
        
          
          AI 代码解读

程序在执行一小会之后便停止了。为什么？

在这里做一些简单分析。我们不希望从工具链的test入口查看程序性能。因此我们选择在程序执行时嵌入代码，将性能数据输出到我们指定的文件。

首先这里把计算函数分裂为两个，

一个执行函数和一个调用函数，并且使用标准库的 strcov作为转换函数，而不是使用 fmt.Sprintf

    func reverse(s string) string {
            r := []rune(s)
            for i, j := 0, len(r)-1; i < len(r)/2; i, j = i+1, j-1 {
                    r[i], r[j] = r[j], r[i]
            }
            return string(r)
    }

    // sj 原数的逆序数
    // ins 原数的平方根的逆
    func CalcFloat(sj, ins string) {
            sms := reverse(ins)
            intsms, _ := strconv.ParseInt(sms, 10, 64)
            intsj, _ := strconv.ParseInt(sj, 10, 64)
            if intsms*intsms == intsj { 
                    println("find new:", intsj, "sqrt:=", intsms)
                    runtime.GC()
            }
    }

        
          
        
        
        
          
          AI 代码解读

然后我们定义两个指标，cpu使用和mem使用，并且输出到对应文件。

      cpuprofile = flag.String("cpuprofile", "./cpu.prof", "write cpu profile to `file`")
  memprofile = flag.String("memprofile", "./mem.prof", "write memory profile to `file`")

        
          
        
        
        
          
          AI 代码解读

在启动调用执行函数前，输出cpu性能数据

    flag.Parse()
if *cpuprofile != "" {
    f, err := os.Create(*cpuprofile)
    if err != nil {
        log.Fatal("could not create CPU profile: ", err)
    }
    defer f.Close()
    if err := pprof.StartCPUProfile(f); err != nil {
        log.Fatal("could not start CPU profile: ", err)
    }
    defer pprof.StopCPUProfile()
}

        
          
        
        
        
          
          AI 代码解读

然后调用计算函数，

Counts(1000000000.0)

        
          
        
        
        
          
          AI 代码解读

这样再输出内存使用数据,在runtime调用GC回收内存时，就获得最新的统计数据，

if *memprofile != "" {

  f, err := os.Create(*memprofile)
  if err != nil {
      log.Fatal("could not create memory profile: ", err)
  }
  defer f.Close() 
  runtime.GC()   
  if err := pprof.WriteHeapProfile(f); err != nil {
      log.Fatal("could not write memory profile: ", err)
  }

        
          
        
        
        
          
          AI 代码解读

}

最后执行该独立程序

 go run _pprof_main.go

        
          
        
        
        
          
          AI 代码解读

等待执行完成后，将获得两个性能数据文件，cpu.prof, mem.prof

3 分析性能

先看内存数据

go tool pprof .\mem.prof
File: pprof_main.test
Build ID:./pprof_main.test 2023-02-06 12:44:53.3670595 +0400 +04
Type: inuse_space
Time: Feb 6, 2023 at 12:45pm (+04)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1.72MB, 100% of 1.72MB total
      flat  flat%   sum%        cum   cum%
    1.72MB   100%   100%     1.72MB   100%  runtime/pprof.StartCPUProfile
         0     0%   100%     1.72MB   100%  pprof_main.TestMain
         0     0%   100%     1.72MB   100%  main.main
         0     0%   100%     1.72MB   100%  runtime.main

        
          
        
        
        
          
          AI 代码解读

内存占用没有太大问题。

我们再看cpu性能

top
Showing nodes accounting for 160ms, 100% of 160ms total
Showing top 10 nodes out of 62
      flat  flat%   sum%        cum   cum%
      50ms 31.25% 31.25%       50ms 31.25%  runtime.cgocall
      20ms 12.50% 43.75%       20ms 12.50%  fmt.(*pp).free
      20ms 12.50% 56.25%       30ms 18.75%  runtime.slicerunetostring
      10ms  6.25% 62.50%      150ms 93.75%  GolangTrain/41_graphs/19_pprof_main.counts
      10ms  6.25% 68.75%       30ms 18.75%  fmt.(*pp).printArg
      10ms  6.25% 75.00%       10ms  6.25%  runtime.convT64
      10ms  6.25% 81.25%       10ms  6.25%  runtime.encoderune
      10ms  6.25% 87.50%       10ms  6.25%  runtime.writeHeapBits.flush
      10ms  6.25% 93.75%       10ms  6.25%  strconv.computeBounds (inline)
      10ms  6.25%   100%       20ms 12.50%  strconv.ryuFtoaShortest

        
          
        
        
        
          
          AI 代码解读

我们有三个计算函数，某些函数调用没有在这个列表中，因此，我们希望看到消耗性能最高的前10位

查看消耗cpu时间性能的最高前10的函数

     top -cum
    Showing nodes accounting for 60ms, 37.50% of 160ms total
    Showing top 10 nodes out of 62
          flat  flat%   sum%        cum   cum% 
             0     0%     0%      160ms   100%  main.main
             0     0%     0%      160ms   100%  runtime.main
          10ms  6.25%  6.25%      150ms 93.75%  _pprof_main.counts
             0     0%  6.25%       60ms 37.50%  fmt.Sprintf
          50ms 31.25% 37.50%       50ms 31.25%  runtime.cgocall
             0     0% 37.50%       50ms 31.25%  syscall.SyscallN
             0     0% 37.50%       40ms 25.00%  log.(*Logger).Output
             0     0% 37.50%       40ms 25.00%  log.(*Logger).Println
             0     0% 37.50%       30ms 18.75%  _pprof_main.reverse (inline)

        
          
        
        
        
          
          AI 代码解读

可以很明显看到，我们能够控制的主执行函数 TestMain 占比100%，

继续查，__pprof_main.counts 占比93%，其次为 fmt.Sprintf 38%，逆序函数 reverse 占比 18%。

因此可以确定重点在 count是函数和 fmt.Sprintf 转换函数。

我们知道go的类型转换成本，比如在string 和byte之间，直接使用强制转换成本较高。

    Showing top 10 nodes out of 36
    flat  flat%   sum%        cum   cum%
    12.14s 87.28% 87.28%     12.25s 88.07%  runtime.cgocall
     0.09s  0.65% 87.92%      0.09s  0.65%  runtime.step
     0.09s  0.65% 88.57%      0.09s  0.65%  sync.(*Mutex).Unlock
     0.08s  0.58% 89.14%      0.23s  1.65%  runtime.gentraceback
     0.07s   0.5% 89.65%      0.12s  0.86%  runtime.makeslice
     0.07s   0.5% 90.15%      0.07s   0.5%  runtime.nextFreeFast (inline)
     0.06s  0.43% 90.58%      0.34s  2.44%  runtime.systemstack
     0.04s  0.29% 90.87%      0.31s  2.23%  runtime.callers
     0.04s  0.29% 91.16%      0.08s  0.58%  runtime.writeHeapBits.flush
     0.04s  0.29% 91.45%     12.33s 88.64%  syscall.SyscallN

        
          
        
        
        
          
          AI 代码解读

这一次我们查看前20个消耗cpu最高的函数

top 20 -cum
Showing nodes accounting for 12.52s, 90.01% of 13.91s total
Dropped 137 nodes (cum <= 0.07s)
Showing top 20 nodes out of 36
      flat  flat%   sum%        cum   cum% 
         0     0%     0%     13.80s 99.21%  main.main
         0     0%     0%     13.80s 99.21%  runtime.main
         0     0%     0%     13.77s 98.99%  _advice_math.counts
     0.03s  0.22%  0.22%     13.44s 96.62%  log.(*Logger).Println
     0.03s  0.22%  0.43%     13.28s 95.47%  log.(*Logger).Output
     0.01s 0.072%   0.5%     12.60s 90.58%  os.(*File).Write
         0     0%   0.5%     12.59s 90.51%  internal/poll.(*FD).Write
         0     0%   0.5%     12.59s 90.51%  os.(*File).write (inline)
     0.03s  0.22%  0.72%     12.46s 89.58%  internal/poll.(*FD).writeConsole
     0.02s  0.14%  0.86%     12.34s 88.71%  syscall.Syscall6
     0.01s 0.072%  0.93%     12.34s 88.71%  syscall.WriteConsole
     0.04s  0.29%  1.22%     12.33s 88.64%  syscall.SyscallN
    12.14s 87.28% 88.50%     12.25s 88.07%  runtime.cgocall
     0.01s 0.072% 88.57%      0.49s  3.52%  runtime.Caller
     0.06s  0.43% 89.00%      0.34s  2.44%  runtime.systemstack
     0.04s  0.29% 89.29%      0.31s  2.23%  runtime.callers
     0.02s  0.14% 89.43%      0.25s  1.80%  runtime.mallocgc
         0     0% 89.43%      0.23s  1.65%  runtime.callers.func1
     0.08s  0.58% 90.01%      0.23s  1.65%  runtime.gentraceback

        
          
        
        
        
          
          AI 代码解读

现在可以完成 10亿次计算任务，采样时间 470.35s

    top 20 -cum
    Showing nodes accounting for 313.78s, 66.71% of 470.35s total
    Dropped 402 nodes (cum <= 2.35s)
    Showing top 20 nodes out of 76
          flat  flat%   sum%        cum   cum% 
             0     0%     0%    416.83s 88.62%  main.main
             0     0%     0%    416.83s 88.62%  runtime.main
         6.81s  1.45%  1.45%    416.80s 88.61%  _advice_math.counts
         4.23s   0.9%  2.35%    302.43s 64.30%  strconv.FormatFloat       
        26.35s  5.60%  7.95%    178.17s 37.88%  strconv.genericFtoa       
        17.79s  3.78% 11.73%    108.66s 23.10%  strconv.ryuFtoaShortest   
        52.53s 11.17% 22.90%    102.12s 21.71%  runtime.mallocgc
         7.24s  1.54% 24.44%     96.01s 20.41%  _advice_math.reverse (inline)
        25.67s  5.46% 29.90%     70.37s 14.96%  strconv.ryuDigits
        33.36s  7.09% 36.99%     65.56s 13.94%  runtime.slicerunetostring 
         7.56s  1.61% 38.60%     62.49s 13.29%  runtime.slicebytetostring 
         3.56s  0.76% 39.35%     56.14s 11.94%  runtime.makeslice
         5.99s  1.27% 40.63%     42.16s  8.96%  strconv.formatDigits      
        39.08s  8.31% 48.94%     39.17s  8.33%  strconv.ryuDigits32       
         0.26s 0.055% 48.99%     37.02s  7.87%  runtime.systemstack       
        31.69s  6.74% 55.73%     36.12s  7.68%  strconv.fmtF
        29.90s  6.36% 62.09%     29.90s  6.36%  runtime.encoderune        
        21.70s  4.61% 66.70%     23.21s  4.93%  runtime.stringtoslicerune 
         0.06s 0.013% 66.71%     19.25s  4.09%  runtime.gcBgMarkWorker.func2

        
          
        
        
        
          
          AI 代码解读

4 评估fmt包的执行性能

我们使用以下简单程序，查看fmt的各个输出函数，对同一个字符串的输出性能

    var (
            benchStr = "hello, this is golang."
            sbe      = []byte(benchStr)
    )

    func FmtPrintfBytes(s []byte) {

            fmt.Printf("fmt print:%v", s)
    }

    func FmtPrintlnBytes(s []byte) {

            fmt.Println("fmt print:", s)
    }

    func PrintlnBytes(s []byte) {

            println("fmt print:", s)
    }

    func FmtSrintf(s []byte) {
            newStr := fmt.Sprintf("%v", s)
            fmt.Printf("%v", newStr)
    }

    func BenchmarkFmtPrintfBytes(b *testing.B) {

            for n := 0; n < b.N; n++ {
                    FmtPrintfBytes(sbe)
            }
    }
    func BenchmarkFmtPrintlnBytes(b *testing.B) {
            for n := 0; n < b.N; n++ {
                    FmtPrintlnBytes(sbe)
            }
    }

    func BenchmarkPrintlnBytes(b *testing.B) {
            for n := 0; n < b.N; n++ {
                    PrintlnBytes(sbe)
            }
    }

    func BenchmarkFmtSrintf(b *testing.B) {
            for n := 0; n < b.N; n++ {
                    FmtSrintf(sbe)
            }
    }

        
          
        
        
        
          
          AI 代码解读

执行这个程序性能测试

    go test -cpuprofile cpu.prof -memprofile mem.prof -bench .

        
          
        
        
        
          
          AI 代码解读

查看性能，内存使用几乎相同

          0     0%   100%   512.09kB 10.97%  command-line-arguments.BenchmarkFmtPrintfBytes 
     0     0%   100%   512.09kB 10.97%  command-line-arguments.BenchmarkFmtPrintlnBytes         0     0%   100%   512.09kB 10.97%  command-line-arguments.FmtPrintfBytes (inline) 
     0     0%   100%   512.09kB 10.97%  command-line-arguments.FmtPrintlnBytes (inline)

        
          
        
        
        
          
          AI 代码解读

cpu使用率各有差别，但是不大

      0.01s  0.19% 75.53%      1.57s 30.49%  command-line-arguments.BenchmarkFmtPrintlnBytes         0     0% 75.53%      1.56s 30.29%  command-line-arguments.FmtPrintlnBytes (inline)         0     0% 75.53%      1.56s 30.29%  fmt.Fprintln
     0     0% 75.53%      1.55s 30.10%  fmt.Println (inline)
 0.02s  0.39% 75.92%      1.43s 27.77%  command-line-arguments.BenchmarkFmtSrintf      
     0     0% 75.92%      1.40s 27.18%  command-line-arguments.FmtSrintf
     0     0% 75.92%      1.08s 20.97%  command-line-arguments.BenchmarkFmtPrintfBytes 
     0     0% 75.92%      1.08s 20.97%  command-line-arguments.FmtPrintfBytes (inline) 
     0     0% 75.92%      0.84s 16.31%  command-line-arguments.BenchmarkPrintlnBytes   
     0     0% 75.92%      0.84s 16.31%  command-line-arguments.PrintlnBytes (inline) 

        
          
        
        
        
          
          AI 代码解读

由以上性能数据可以得出简单结论，在输出同样的字节时，输出函数的性能相差不大，计算资源cpu消耗

    fmt.Println >  fmt.Sprintf + fmt.Printf  > fmt.Printf > println

        
          
        
        
        
          
          AI 代码解读

有一些简单的cpu资源使用率区别，但是区别不大。

5 小结

我们在这里介绍了如何不通过工具链的 test 入口，对独立程序嵌入性能数据提取。

并做了一个简单的对比，知道格式化输出的成本 Println 的接口转换有一定的成本，但是区别不大。

使用go的内置运行时库调试和优化程序

简介

1 在执行程序中加入剖析支持

2 使用runtime去优化

3 分析性能

4 评估fmt包的执行性能

5 小结

云原生

热门文章

最新文章

相关课程

相关电子书

相关实验场景

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

使用go的内置运行时库调试和优化程序

简介

1 在执行程序中加入剖析支持

2 使用runtime去优化

3 分析性能

4 评估fmt包的执行性能

5 小结

云原生

热门文章

最新文章

相关课程

相关电子书

相关实验场景