A Practical Guide to optimise Go programs for performance

This blog post will look at how we go about optimizing Go code through a real life example



This post explains some techniques you can use to significantly improve the performance of your program with little effort. I am deliberately ignoring techniques that require significant effort, or large changes to program structure.
. . .

1. Enable pprof


What is pprof?
Pprof is the standard way to profile go applications that comes built in to go. The pprof package can expose via HTTP both memory and CPU usage across your application, as well as the running command line and contention information.


How to enable pprof
You can enable pprof by adding the line import _ “net/http/pprof” into your application and starting the default http server.  Another option, which is what we do at SignalFx in order to explicitly control where the handler is attached, is to enable specific pprof endpoints explicitly in your code with something like the following.
import "github.com/gorilla/mux" import "net/http/pprof" var handler *mux.Router // ... handler.PathPrefix("/debug/pprof/profile").HandlerFunc(pprof.Profile) handler.PathPrefix("/debug/pprof/heap").HandlerFunc(pprof.Heap)


2. Find something to optimize 


What to execute
curl http://ingest58:6060/debug/pprof/profile > /tmp/ingest.profile go tool pprof ingest /tmp/ingest.profile (pprof) top7


What this does
Go includes a local pprof tool to visualize the output of pprof. The endpoint /debug/pprof/profile will collect a 30 second profile. What I do above is first save that output to a local file, then run pprof on the saved output. Note that the last parameter of pprof can be a URL directly, rather than a local file. The command top7 will show the 7 functions that consume the most CPU.
Result
12910ms of 24020ms total (53.75%) Dropped 481 nodes (cum <= 120.10ms) Showing top 30 nodes out of 275 (cum >= 160ms)      flat  flat%   sum%        cum   cum%    1110ms  4.62%  4.62%     2360ms  9.83%  runtime.mallocgc     940ms  3.91%  8.53%     1450ms  6.04%  runtime.scanobject     830ms  3.46% 11.99%      830ms  3.46%  runtime.futex     800ms  3.33% 15.32%      800ms  3.33%  runtime.mSpan_Sweep.func1     750ms  3.12% 18.44%      750ms  3.12%  runtime.cmpbody     720ms  3.00% 21.44%      720ms  3.00%  runtime.xchg  580ms  2.41% 23.86%      580ms  2.41%  runtime._ExternalCode


What this means
None of these functions are directly my own code. However, mallocgc, scanobject, and mSpan_Sweep all point to CPU usage during garbage collection. I could go deeper into each of these functions, but rather than optimize Go’s garbage collector itself, it’s a better use of time to instead optimize my use of Go’s garbage collector. In this case, creation of objects on the heap.


3. Investigate the causes of GC


What to execute
curl http://ingest58:6060/debug/pprof/heap > /tmp/heap.profile go tool pprof -alloc_objects /tmp/ingest /tmp/heap.profile (pprof) top3


What this does
Note that the endpoint I download is a similar URL to the profiler above, but instead ends in /heap. This will give me a summary of heap usage on the machine. I again save it to a file to later compare. The parameter -alloc_objects will visualize the number of allocated objects across the lifetime of the application.


Result
4964437929 of 7534904879 total (65.89%) Dropped 541 nodes (cum <= 37674524) Showing top 10 nodes out of 133 (cum >= 321426216)      flat  flat%   sum%        cum   cum% 853721355 11.33% 11.33%  859078341 11.40%  github.com/signalfuse/sfxgo/ingest/tsidcache/tsiddiskcache.(*DiskKey).EncodeOld 702927011  9.33% 20.66%  702927011  9.33%  reflect.unsafe_New 624715067  8.29% 28.95%  624715067  8.29%  github.com/signalfuse/sfxgo/ingest/bus/rawbus.(*Partitioner).Partition


What this means
This means 11.33% of my object allocations happened during the function EncodeOld of the object DiskKey, which was expected behavior. However, it was very unexpected that Partition() would take 8.29% of my total allocations since it should just be doing basic math, so I will focus my search there.


4. Discover why partitioner is using so much memory


What to execute
(pprof) list Partitioner.*Partition


What this does
This prints the source lines of the function I’m interested in, along with counts inside the function of which statements caused heap allocations. This is one of many commands you can execute inside pprof.

Another very useful one is peek to see callers and callees of functions. You can get a full list by typing “help” and experimenting with what you see.


Result
Total: 11323262665 ROUTINE ======================== github.com/signalfuse/sfxgo/ingest/bus/rawbus.(*Partitioner).Partition in /opt/jenkins/workspace/ingest/gopath/src/github.com/signalfuse/sfxgo/ingest/bus/rawbus/partitioner.go 927405893  927405893 (flat, cum)  8.19% of Total         .          .     64: if ringSize == 0 {         .          .     65: return 0, ErrUnsetRingSize         .          .     66: }         .          .     67: var b [8]byte         .          .     68: binary.LittleEndian.PutUint64(b[:], uint64(message.Key.(*partitionPickingKey).tsid)) 239971917  239971917     69: logherd.Debug2(log, "key", message.Key, "numP", numPartitions, "Partitioning")         .          .     70: murmHash := murmur3.Sum32(b[:])         .          .     71:         .          .     72: // 34026 => 66         .          .     73: setBits := uint(16)         .          .     74: setSize := uint32(1 << setBits)         .          .     75: shortHash := murmHash & (setSize - 1)         .          .     76: smallIndex := int32(shortHash) * int32(k.ringSize) / int32(setSize) 687433976  687433976     77: logherd.Debug3(log, "smallIndex", smallIndex, "murmHash", murmHash, "shortHash", shortHash, "Sending to partition")         .          .     78: return smallIndex, nil         .          .     79:}         .          .     80:



What this means
This shows calling debug logging is causing variables to escape from the stack to the heap.  Since this debug logging isn’t strictly needed, I could just remove those lines. But first, let’s confirm this hypothesis.

The function “logherd.Debug2” is a wrapper that looks like the following and exists to remove the heap overhead of logrus WithField objects in the event debug isn’t even enabled.
// Debug2 to logger 2 key/value pairs and message.  Intended to save the mem alloc that WithField creates func Debug2(l *logrus.Logger, key string, val interface{}, key2 string, val2 interface{}, msg string) {    if l.Level >= logrus.DebugLevel {         l.WithField(key, val).WithField(key2, val2).Debug(msg)    } }

It appears from this pprof inspection that passing the integers into Debug2 is causing allocations, but let’s confirm this hypothesis.


5. Discover why log statements are causing so many allocations


What to execute
go build -gcflags='-m' . 2>&1 | grep partitioner.go


What this does
The -m parameter to build will print to stderr optimization decisions made by the compiler. This includes if the compiler is able to allocate a variable on the stack or if it must move it to the heap. Go will move variables to the heap if the compiler can’t guarantee that a pointer to the variable won’t live somewhere else.

Result
./partitioner.go:63: &k.ringSize escapes to heap ./partitioner.go:62: leaking param: k ./partitioner.go:70: message.Key escapes to heap ./partitioner.go:62: leaking param content: message ./partitioner.go:70: numPartitions escapes to heap ./partitioner.go:77: smallIndex escapes to heap ./partitioner.go:77: murmHash escapes to heap ./partitioner.go:77: shortHash escapes to heap ./partitioner.go:68: (*Partitioner).Partition b does not escape ./partitioner.go:71: (*Partitioner).Partition b does not escape


What this means
The important parts here are on line 77 showing that smallIndex, murmHash, and shortHash all escape to the heap. Because the compiler is doing short lived heap allocations, I’m creating more objects on the heap than I really need.


6. Benchmark partition function


What to write
func BenchmarkPartition(b *testing.B) {      r := rand.New(rand.NewSource(0))      k := partitionPickingKey{}      msg := sarama.ProducerMessage {           Key: &k,      }      p := Partitioner{           ringSize: 1024,           ringName: "quantizer.ring",      }      num_partitions := int32(1024)      for i := 0; i < b.N; i++ {           k.tsid = r.Int63()           part, err := p.Partition(&msg, num_partitions)           if err != nil {                panic("Error benchmarking")           }           if part < 0 || part >= num_partitions {                panic("Bench failure")           }      } }


What this does
This benchmark simply creates B.N objects and does a basic check on the return value to make sure it’s not simply optimized away. Writing benchmarks are a recommended “before” step when programmers go on optimization hunts, to ensure you’re progressing in the correct direction.


7. Benchmark memory allocations in partition function


What to execute
go test -v -bench . -run=_NONE_ -benchmem BenchmarkPartition



What this does
The bench argument to test runs benchmarks that match the regex “.” and -benchmem will also have benchmarks track heap usage on average per iteration.  By passing _NONE_ to -run, I am saving myself a bit of time by telling test to only run unit tests with the “_NONE_” string inside them. In other words, run none of the unit tests but all of the benchmarks.


Result
PASS BenchmarkPartition-8 10000000       202 ns/op      64 B/op       4 allocs/op


What this means
Each loop of my benchmark took on average 202ns and, more importantly, allocated 4 objects per operation.

8. Remove the logging statements


What to write 
@@ -66,7 +65,6 @@ func (k *Partitioner) Partition(message *sarama.ProducerMessage, numPartitions i        }        var b [8]byte        binary.LittleEndian.PutUint64(b[:], uint64(message.Key.(*partitionPickingKey).tsid)) -       logherd.Debug2(log, "key", message.Key, "numP", numPartitions, "Partitioning")        murmHash := murmur3.Sum32(b[:])        // 34026 => 66 @@ -74,7 +72,6 @@ func (k *Partitioner) Partition(message *sarama.ProducerMessage, numPartitions i        setSize := uint32(1 << setBits)        shortHash := murmHash & (setSize - 1)        smallIndex := int32(shortHash) * int32(k.ringSize) / int32(setSize) -       logherd.Debug3(log, "smallIndex", smallIndex, "murmHash", murmHash, "shortHash", shortHash, "Sending to partition")        return smallIndex, nil }


What this does
My fix here was to remove the log lines. I had added them during testing/debugging and they never got removed. In this case, just removing the log lines is a simple enough solution.

9. Rerun build to evaluate if variables escape to heap


What to execute 
go build -gcflags='-m' . 2>&1 | grep partitioner.go

Result
./partitioner.go:62: &k.ringSize escapes to heap ./partitioner.go:61: leaking param: k ./partitioner.go:61: (*Partitioner).Partition message does not escape ./partitioner.go:67: (*Partitioner).Partition b does not escape ./partitioner.go:68: (*Partitioner).Partition b does not escape


What this means
Notice that smallIndex, murmHash, and shortHash no longer give messages about escaping to the heap.

10 Rerun benchmark to investigate new allocs per operation


What to execute
go test -v -bench . -run=_NONE_ -benchmem BenchmarkPartition


Result
PASS BenchmarkPartition-8 30000000        40.5 ns/op       0 B/op       0 allocs/op ok   github.com/signalfuse/sfxgo/ingest/bus/rawbus 1.267s



What this means
Notice that now each operation takes only 40ns, but more importantly, there are no allocations per operation. Since it was heap usage I was trying to optimize, this is the important part for me.

Conclusion

Pprof is a very useful tool to dissect performance issuing with go code. By combining this exploration with go’s built in benchmarks, you can get real numbers about how much code changes help or hurt.

Unfortunately, performance regressions tend to creep up over time. This makes a next step, and exercise for the reader, to store benchmark results in a database so you can view performance per commit over time.


On a mission to build Next-Gen Community Platform for Developers