Translation of an article about how the company transferred its infrastructure to Go 1.5 and reduced the garbage collector pauses from ~ 279ms to ~ 10ms.Customer-focused marketing systems depend on collecting and analyzing as many related events as possible. Customers are literally everywhere, and the amount of data grows exponentially. Go language plays an important role in our data collection system. Today FLXone handles 3+ billion requests per day with our application written from scratch.
Our way to achieve such performance began with identifying the key tasks of the interface between marketing and advertising with technology:
- huge amounts of data must be collected and processed
- customers can generate millions of events, increasing our workload in seconds
- responsiveness (latency) is the key to real-time data analysis
In 2013, we decided that Go (at that time 1.1 more) looks promising, and we wrote the first version of our application in less than 5 days, and only 2 programmers worked on it. Chips of the language, such as gorutiny and channels, greatly simplified the task of writing code, with abundant competitiveness (concurrency). Reaching thousands of requests per second on a Macbook Pro with minimal optimizations looked very promising.
The application, in fact, does the following: accepts requests, with a large number of URL parameters, an average of 1KB each. The server parsit requests, and sends a message to the distributed queue. Upon completion of this, it returns an empty response to the client.
')
We grow further
As soon as our business began to grow, we saw that the response time began to increase. We had an SLA of about 100ms per request. And when we grew even bigger, it became more and more of a problem. At first we decided that this was somehow related to the network connections to the servers, but even though we generated terabytes of data daily, the problem was something else.
Then we began to analyze the behavior of our Go program. On average, the application spent ~ 2ms on the request, and it was great! We still had 98ms on the network overhead, SSL handshake, DNS requests and everything else that keeps the Internet afloat.
Unfortunately, the standard deviation of response time was large, about 100ms. To meet our demand for SLA has become a gamble. Using Go “runtime” package, we made profiling of our application and realized that our problem was garbage collection, which resulted in a 95-percentile response time of 279 milliseconds ...
We decided to rewrite large chunks of our application so that they do not generate garbage at all. This greatly reduced the interval at which the garbage collector stopped the entire application to do its magical actions. But problems with the response time did not matter, so we decided to add more nodes to fit into our SLA. With peak loads of 80K requests per second, even minimal garbage can be a serious problem.
And this day has come
There has been a lot of talk about Go 1.5 in recent months. The compiler was completely rewritten from C to Go, which reminded me of the movie “Inception”. But more than that, the garbage collector was completely redone.
Last night (August 19), this moment finally arrived. Stable version of Go 1.5 came out with the statement:
The pause of the “stop of the world” collector will almost always be less than 10 ms, and in most cases, much less.
Just a few hours after the release, we rebuilt our application with Go 1.5 and ran our unit and function tests; everything went smoothly. It looked too good, so we checked the functionality also manually. After a few hours, we decided that it would be safe to roll out this build for one node in production.
We gave her 12 hours of work and analyzed the new values ​​of response time: the entire request, separate applications, and, no less important parameter, the time of the garbage collector pauses. In the graph below, you can see how the variation in values ​​and the average response time have decreased:
Two histograms of application response time (the only thing important to us). X axis: response time, Y axis: number of requests. Left: server running Go 1.4, right: server running Go 1.5, the difference is visible to the naked eye.The new version of Go has reduced our value of the 95-percentile garbage collector from 279 ms to only 10 ms. This is a fantastic reduction of the pause by 96% and this is exactly what
was stated in the release notes.
Garbage collection pauses decreased by 96%
We decided to close the new version to the rest of our infrastructure (12 data centers in 7 geographic areas) and saw that our average response time to requests decreased by 53%. This meant that we could easily fit into our 100ms, plus each node can now hold a large load.
Thanks to the efficiency and flexibility of our team, the release of Go 1.5 greatly improved our performance and this happened in 24 hours.