简介
本文探讨了如何为运行时程序添加剖析支持以优化性能。
该例子问题是一个程序需要找出平方根为整数且逆序平方等于其逆序的数字。我们首先展示了原始代码,然后使用runtime
库进行优化,将计算和调用功能分离,同时记录CPU和内存使用情况,以此作为优化的依据。
1 在执行程序中加入剖析支持
我们有一个需求,需要找到10亿以内的某些特别的数字,它们满足以下特征:
其平方根值为整数
其平方根值的逆序的平方,等于其逆序
例如,121 正好满足这个特点。
其平方根为 11,为整数.
平方根的逆序的平方 = 11 * 11 = 121
2 使用runtime去优化
某人实现了如下程序,可以完成千万级的计算,但是无法完成亿级别的运算。我们打算优化它,并且按照标准库runtime的指示去优化。
func counts(n float64) {
var ms interface{}
for j := 1.0; j <= n; j++ {
sj := fmt.Sprintf("%s", reverse(fmt.Sprintf("%v", j)))
ms = math.Sqrt(j)
ins := fmt.Sprintf("%v", ms)
if strings.Count(ins, ".") > 0 {
continue
}
switch ms.(type) {
default:
continue
case float64:
sms := reverse(ins)
sms str to int
intsms, _ := strconv.ParseInt(sms, 10, 64)
j sj str to int
intsj, _ := strconv.ParseInt(sj, 10, 64)
如果逆序平方等于 原数字的逆序,输出到控制台
if intsms*intsms == intsj {
fmt.Println("reverse j", intsj, "sqrt:=", intsms)
}
}
}
}
在main函数调用这个计算函数count就可以进行计算。 比如我们需要查找 十亿内的全部特征数:
func main() {
counts(1000000000.0)
}
程序在执行一小会之后便停止了。 为什么?
在这里做一些简单分析。 我们不希望从 工具链的test入口查看程序性能。 因此我们选择在程序执行时嵌入代码,将性能数据输出到我们指定的文件。
- 首先 这里把计算函数分裂为两个,
一个执行函数和一个调用函数,并且使用标准库的 strcov作为转换函数,而不是使用 fmt.Sprintf
func reverse(s string) string {
r := []rune(s)
for i, j := 0, len(r)-1; i < len(r)/2; i, j = i+1, j-1 {
r[i], r[j] = r[j], r[i]
}
return string(r)
}
// sj 原数的逆序数
// ins 原数的平方根的逆
func CalcFloat(sj, ins string) {
sms := reverse(ins)
intsms, _ := strconv.ParseInt(sms, 10, 64)
intsj, _ := strconv.ParseInt(sj, 10, 64)
if intsms*intsms == intsj {
println("find new:", intsj, "sqrt:=", intsms)
runtime.GC()
}
}
然后我们定义两个指标,cpu使用和mem使用,并且输出到对应文件。
cpuprofile = flag.String("cpuprofile", "./cpu.prof", "write cpu profile to `file`") memprofile = flag.String("memprofile", "./mem.prof", "write memory profile to `file`")
在启动调用执行函数前,输出cpu性能数据
flag.Parse()
if *cpuprofile != "" {
f, err := os.Create(*cpuprofile)
if err != nil {
log.Fatal("could not create CPU profile: ", err)
}
defer f.Close()
if err := pprof.StartCPUProfile(f); err != nil {
log.Fatal("could not start CPU profile: ", err)
}
defer pprof.StopCPUProfile()
}
然后调用计算函数,
Counts(1000000000.0)
这样再输出内存使用数据,在runtime调用GC回收内存时,就获得最新的统计数据 ,
if *memprofile != "" {
f, err := os.Create(*memprofile) if err != nil { log.Fatal("could not create memory profile: ", err) } defer f.Close() runtime.GC() if err := pprof.WriteHeapProfile(f); err != nil { log.Fatal("could not write memory profile: ", err) }
}
最后执行该独立程序
go run _pprof_main.go
等待执行完成后,将获得两个性能数据文件,cpu.prof, mem.prof
3 分析性能
先看内存数据
go tool pprof .\mem.prof
File: pprof_main.test
Build ID:./pprof_main.test 2023-02-06 12:44:53.3670595 +0400 +04
Type: inuse_space
Time: Feb 6, 2023 at 12:45pm (+04)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1.72MB, 100% of 1.72MB total
flat flat% sum% cum cum%
1.72MB 100% 100% 1.72MB 100% runtime/pprof.StartCPUProfile
0 0% 100% 1.72MB 100% pprof_main.TestMain
0 0% 100% 1.72MB 100% main.main
0 0% 100% 1.72MB 100% runtime.main
内存占用没有太大问题。
我们再看cpu性能
top
Showing nodes accounting for 160ms, 100% of 160ms total
Showing top 10 nodes out of 62
flat flat% sum% cum cum%
50ms 31.25% 31.25% 50ms 31.25% runtime.cgocall
20ms 12.50% 43.75% 20ms 12.50% fmt.(*pp).free
20ms 12.50% 56.25% 30ms 18.75% runtime.slicerunetostring
10ms 6.25% 62.50% 150ms 93.75% GolangTrain/41_graphs/19_pprof_main.counts
10ms 6.25% 68.75% 30ms 18.75% fmt.(*pp).printArg
10ms 6.25% 75.00% 10ms 6.25% runtime.convT64
10ms 6.25% 81.25% 10ms 6.25% runtime.encoderune
10ms 6.25% 87.50% 10ms 6.25% runtime.writeHeapBits.flush
10ms 6.25% 93.75% 10ms 6.25% strconv.computeBounds (inline)
10ms 6.25% 100% 20ms 12.50% strconv.ryuFtoaShortest
我们有三个计算函数,某些函数调用没有在这个列表中,因此,我们希望看到消耗性能最高的前10位
查看消耗cpu时间性能的最高前10的函数
top -cum
Showing nodes accounting for 60ms, 37.50% of 160ms total
Showing top 10 nodes out of 62
flat flat% sum% cum cum%
0 0% 0% 160ms 100% main.main
0 0% 0% 160ms 100% runtime.main
10ms 6.25% 6.25% 150ms 93.75% _pprof_main.counts
0 0% 6.25% 60ms 37.50% fmt.Sprintf
50ms 31.25% 37.50% 50ms 31.25% runtime.cgocall
0 0% 37.50% 50ms 31.25% syscall.SyscallN
0 0% 37.50% 40ms 25.00% log.(*Logger).Output
0 0% 37.50% 40ms 25.00% log.(*Logger).Println
0 0% 37.50% 30ms 18.75% _pprof_main.reverse (inline)
可以很明显看到,我们能够控制的 主执行函数 TestMain 占比100%,
继续查,__pprof_main.counts 占比93%,其次为 fmt.Sprintf 38%,逆序函数 reverse 占比 18%。
因此可以确定重点在 count是函数和 fmt.Sprintf 转换函数。
我们知道go的类型转换成本,比如在string 和byte之间,直接使用强制转换成本较高。
Showing top 10 nodes out of 36
flat flat% sum% cum cum%
12.14s 87.28% 87.28% 12.25s 88.07% runtime.cgocall
0.09s 0.65% 87.92% 0.09s 0.65% runtime.step
0.09s 0.65% 88.57% 0.09s 0.65% sync.(*Mutex).Unlock
0.08s 0.58% 89.14% 0.23s 1.65% runtime.gentraceback
0.07s 0.5% 89.65% 0.12s 0.86% runtime.makeslice
0.07s 0.5% 90.15% 0.07s 0.5% runtime.nextFreeFast (inline)
0.06s 0.43% 90.58% 0.34s 2.44% runtime.systemstack
0.04s 0.29% 90.87% 0.31s 2.23% runtime.callers
0.04s 0.29% 91.16% 0.08s 0.58% runtime.writeHeapBits.flush
0.04s 0.29% 91.45% 12.33s 88.64% syscall.SyscallN
这一次我们查看前20个消耗cpu最高的函数
top 20 -cum
Showing nodes accounting for 12.52s, 90.01% of 13.91s total
Dropped 137 nodes (cum <= 0.07s)
Showing top 20 nodes out of 36
flat flat% sum% cum cum%
0 0% 0% 13.80s 99.21% main.main
0 0% 0% 13.80s 99.21% runtime.main
0 0% 0% 13.77s 98.99% _advice_math.counts
0.03s 0.22% 0.22% 13.44s 96.62% log.(*Logger).Println
0.03s 0.22% 0.43% 13.28s 95.47% log.(*Logger).Output
0.01s 0.072% 0.5% 12.60s 90.58% os.(*File).Write
0 0% 0.5% 12.59s 90.51% internal/poll.(*FD).Write
0 0% 0.5% 12.59s 90.51% os.(*File).write (inline)
0.03s 0.22% 0.72% 12.46s 89.58% internal/poll.(*FD).writeConsole
0.02s 0.14% 0.86% 12.34s 88.71% syscall.Syscall6
0.01s 0.072% 0.93% 12.34s 88.71% syscall.WriteConsole
0.04s 0.29% 1.22% 12.33s 88.64% syscall.SyscallN
12.14s 87.28% 88.50% 12.25s 88.07% runtime.cgocall
0.01s 0.072% 88.57% 0.49s 3.52% runtime.Caller
0.06s 0.43% 89.00% 0.34s 2.44% runtime.systemstack
0.04s 0.29% 89.29% 0.31s 2.23% runtime.callers
0.02s 0.14% 89.43% 0.25s 1.80% runtime.mallocgc
0 0% 89.43% 0.23s 1.65% runtime.callers.func1
0.08s 0.58% 90.01% 0.23s 1.65% runtime.gentraceback
现在可以完成 10亿次计算任务,采样时间 470.35s
top 20 -cum
Showing nodes accounting for 313.78s, 66.71% of 470.35s total
Dropped 402 nodes (cum <= 2.35s)
Showing top 20 nodes out of 76
flat flat% sum% cum cum%
0 0% 0% 416.83s 88.62% main.main
0 0% 0% 416.83s 88.62% runtime.main
6.81s 1.45% 1.45% 416.80s 88.61% _advice_math.counts
4.23s 0.9% 2.35% 302.43s 64.30% strconv.FormatFloat
26.35s 5.60% 7.95% 178.17s 37.88% strconv.genericFtoa
17.79s 3.78% 11.73% 108.66s 23.10% strconv.ryuFtoaShortest
52.53s 11.17% 22.90% 102.12s 21.71% runtime.mallocgc
7.24s 1.54% 24.44% 96.01s 20.41% _advice_math.reverse (inline)
25.67s 5.46% 29.90% 70.37s 14.96% strconv.ryuDigits
33.36s 7.09% 36.99% 65.56s 13.94% runtime.slicerunetostring
7.56s 1.61% 38.60% 62.49s 13.29% runtime.slicebytetostring
3.56s 0.76% 39.35% 56.14s 11.94% runtime.makeslice
5.99s 1.27% 40.63% 42.16s 8.96% strconv.formatDigits
39.08s 8.31% 48.94% 39.17s 8.33% strconv.ryuDigits32
0.26s 0.055% 48.99% 37.02s 7.87% runtime.systemstack
31.69s 6.74% 55.73% 36.12s 7.68% strconv.fmtF
29.90s 6.36% 62.09% 29.90s 6.36% runtime.encoderune
21.70s 4.61% 66.70% 23.21s 4.93% runtime.stringtoslicerune
0.06s 0.013% 66.71% 19.25s 4.09% runtime.gcBgMarkWorker.func2
4 评估fmt包的执行性能
我们使用以下简单程序,查看fmt的各个输出函数,对同一个字符串的输出性能
var (
benchStr = "hello, this is golang."
sbe = []byte(benchStr)
)
func FmtPrintfBytes(s []byte) {
fmt.Printf("fmt print:%v", s)
}
func FmtPrintlnBytes(s []byte) {
fmt.Println("fmt print:", s)
}
func PrintlnBytes(s []byte) {
println("fmt print:", s)
}
func FmtSrintf(s []byte) {
newStr := fmt.Sprintf("%v", s)
fmt.Printf("%v", newStr)
}
func BenchmarkFmtPrintfBytes(b *testing.B) {
for n := 0; n < b.N; n++ {
FmtPrintfBytes(sbe)
}
}
func BenchmarkFmtPrintlnBytes(b *testing.B) {
for n := 0; n < b.N; n++ {
FmtPrintlnBytes(sbe)
}
}
func BenchmarkPrintlnBytes(b *testing.B) {
for n := 0; n < b.N; n++ {
PrintlnBytes(sbe)
}
}
func BenchmarkFmtSrintf(b *testing.B) {
for n := 0; n < b.N; n++ {
FmtSrintf(sbe)
}
}
执行这个程序性能测试
go test -cpuprofile cpu.prof -memprofile mem.prof -bench .
查看性能,内存使用几乎相同
0 0% 100% 512.09kB 10.97% command-line-arguments.BenchmarkFmtPrintfBytes
0 0% 100% 512.09kB 10.97% command-line-arguments.BenchmarkFmtPrintlnBytes 0 0% 100% 512.09kB 10.97% command-line-arguments.FmtPrintfBytes (inline)
0 0% 100% 512.09kB 10.97% command-line-arguments.FmtPrintlnBytes (inline)
cpu使用率各有差别,但是不大
0.01s 0.19% 75.53% 1.57s 30.49% command-line-arguments.BenchmarkFmtPrintlnBytes 0 0% 75.53% 1.56s 30.29% command-line-arguments.FmtPrintlnBytes (inline) 0 0% 75.53% 1.56s 30.29% fmt.Fprintln
0 0% 75.53% 1.55s 30.10% fmt.Println (inline)
0.02s 0.39% 75.92% 1.43s 27.77% command-line-arguments.BenchmarkFmtSrintf
0 0% 75.92% 1.40s 27.18% command-line-arguments.FmtSrintf
0 0% 75.92% 1.08s 20.97% command-line-arguments.BenchmarkFmtPrintfBytes
0 0% 75.92% 1.08s 20.97% command-line-arguments.FmtPrintfBytes (inline)
0 0% 75.92% 0.84s 16.31% command-line-arguments.BenchmarkPrintlnBytes
0 0% 75.92% 0.84s 16.31% command-line-arguments.PrintlnBytes (inline)
由以上性能数据可以得出简单结论,在输出同样的字节时,输出函数的性能相差不大,计算资源cpu消耗
fmt.Println > fmt.Sprintf + fmt.Printf > fmt.Printf > println
有一些简单的cpu资源使用率区别,但是区别不大。
5 小结
我们在这里介绍了如何不通过工具链的 test 入口,对独立程序嵌入性能数据提取。
并做了一个简单的对比,知道格式化输出的成本 Println 的接口转换有一定的成本,但是区别不大。