About the pros and cons of Go

In this article, I want to share the experience gained during the rewriting of a single project from Perl to Go. It will be more about the disadvantages than about the pros, because so much has been said about the merits of Go, but about the pitfalls that await new developers, it is often impossible to find out except from their own bumps. Fasting does not in any way pursue the goal of making Go language, although, I confess, I would be glad not to write some things. It also covers a relatively small slice of the entire platform, in particular, there will be nothing about patterns, retexps, unpacking / packing data, and similar functionality often used in web programming.

Since the post is not a member of the “I am PR” hub - I will describe the features of the project only briefly. This is a high-load web application that is currently processing about 600M hits per day (peak load is more than 10k requests per second). About 80% of requests can be sent from the cache, and the rest must be fully processed. The working data mainly lies in the database on PostgreSQL, part in binary files with a flat structure (that is, in fact, an array, but not in memory, but in a file). The cluster on Perl-e consisted of eight 24-nuclear machines with a practical exhausted performance margin, the cluster on Go will be already out of six with a confirmed more than threefold margin. And the bottleneck is not so much the processor as the OS and the rest of the hardware with the software - it is not physically easy to process 10k nontrivial requests in one second on one machine, no matter how productive the backend software is.

Development speed

My experience with Go before refactoring was minimal. For more than a year I have been looking at the language, I managed to study the specification from cover to cover, learned useful materials on the official website and beyond, and felt myself ready to roll up my sleeves and get to work. The initial estimate of the deadlines for work was 3-6 weeks. The working beta was ready just at the end of the 6th week, although towards the end I began to think that I would not have time. Clearing bugs and optimizing performance took another month.

At first it was especially difficult, but over time, the specification had to drop in less and less, and the code was clearer. If at first I had to use the functionality that I could code on Perl for an hour, spend Go on the whole day, then this gap was significantly reduced. But all the same, it is much longer to program Go than on Perl — you have to think over the structures, data types and interfaces you need for work, write all this in the code, take care of initializing the slices, maps and channels, prescribe checks for nil ... In Perl- This is all much simpler: for structures, you have to use hashes, there is no need to pre-declare the fields, and much more syntactic sugar for programmers. Compare at least sorting - in Go there is no way to specify a closure for comparing data, you need to prescribe individual functions to get the length, and in addition to the comparison function by indexes, you must also prescribe a separate function for exchanging elements in places in the array. Why all? Because there are no generics, and the sorting function is easier to call the specially declared Swap (i, j) than to figure out what was slipped into it and what offsets to make this exchange of values.
')
In addition to sorting, I was also struck by the absence of the Perl for / while () {...} continue {...} construction (the continue block will be executed even if the current iteration is interrupted early by the next operator). In Go, for this you have to use a non-kosher goto , which also forces you to write all the variable declarations in front of it, even those that are not used after the transition label:

var cnt int for ;; { goto NEXT a := int(0) // ./main.go:16: goto NEXT jumps over declaration of a at ./main.go:17 cnt += a NEXT: cnt ++ }

Also, the syntax unification paradigm for pointers and not pointers does not work completely - in the case of constructs, the compiler gives us the opportunity to use the same syntax, and for the map you already need to dereference and use parentheses, although the compiler could determine everything:

 type T struct { cnt int } s := T{} p := new(T) s.cnt ++ p.cnt ++

but

 m := make(map[int]T) p := new(map[int]T) *p = make(map[int]T) m[1] = T{} (*p)[1] = T{} p[1] = T{} // ./main.go:13: invalid operation: p[1] (type *map[int]T does not support indexing)

By the end of the work, it was still necessary to spend time rewriting that part of the functionality that was implemented at the very beginning, due to the wrong original architecture. The experience gained offers new architectural paradigms, but this experience must also be obtained))

By the way, the final amount of code in the characters almost coincided (only to align the code in Perl used two spaces, and in Go - one tab), but the lines in Go turned out to be 20% more. True, the functionality is slightly different, Go, for example, added work with GC, but Perl also takes into account a separate library for caching SQL queries to the external file cache (with access via mmap ()). In general, the code volume is almost equal, but Perl is still a bit more compact. But in Go there are less brackets and semicolons - the code looks laconic and easier to read.

In general, Go code is written quite quickly and accurately, much faster than, say, C / C ++, but for simple tasks without special performance requirements, I will continue to use Perl.

Performance

Let's face it, I have no particular complaints about Go in terms of performance, but I expected more. The difference with Perl (it depends a lot on the type of computation, in arithmetic, for example, Perl does not shine at all) is about 5-10 times. I did not have the opportunity to try gccgo, because on FreeBSD, it is not going to hard, but sorry. But now the backend software has ceased to be a bottleneck, their consumption of cpu is about 50% of one core, and with an increase in load, the problems will begin first with Nginx, PostgreSQL and the OS.

In the process of optimizing performance, the profiler showed that, besides my code, the active part of the CPU also consumes runtime (this is not just a runtime package).
Here is one example of top10 --cum:

 Total: 1945 samples 0 0.0% 0.0% 1309 67.3% runtime.gosched0 1 0.1% 0.1% 1152 59.2% bitbucket.org/mjl/scgi.func·002 1 0.1% 0.1% 1151 59.2% bitbucket.org/mjl/scgi.serve 0 0.0% 0.1% 953 49.0% net/http.HandlerFunc.ServeHTTP 3 0.2% 0.3% 952 48.9% main.ProcessHttpRequest 1 0.1% 0.3% 535 27.5% main.ProcessHttpRequestFromCache 0 0.0% 0.3% 418 21.5% main.ProcessHttpRequestFromDb 16 0.8% 1.1% 387 19.9% main.(*RequestRecord).SelectServerInDc 0 0.0% 1.1% 367 18.9% System 0 0.0% 1.1% 268 13.8% GC

As we see, only 49% of cpu consumed by the handler is spent on the processing of the actual scgi request, and as much as 33% is spent on System + GC

But just top20 from the same profile:

 Total: 1945 samples 179 9.2% 9.2% 186 9.6% syscall.Syscall 117 6.0% 15.2% 117 6.0% runtime.MSpan_Sweep 114 5.9% 21.1% 114 5.9% runtime.kevent 93 4.8% 25.9% 96 4.9% runtime.cgocall 93 4.8% 30.6% 93 4.8% runtime.sys_umtx_op 67 3.4% 34.1% 152 7.8% runtime.mallocgc 63 3.2% 37.3% 63 3.2% runtime.duffcopy 56 2.9% 40.2% 99 5.1% hash_insert 56 2.9% 43.1% 56 2.9% scanblock 53 2.7% 45.8% 53 2.7% runtime.usleep 39 2.0% 47.8% 39 2.0% markonly 36 1.9% 49.7% 41 2.1% runtime.mapaccess2_fast32 28 1.4% 51.1% 28 1.4% runtime.casp 25 1.3% 52.4% 34 1.7% hash_init 23 1.2% 53.6% 23 1.2% hash_next 22 1.1% 54.7% 22 1.1% flushptrbuf 22 1.1% 55.8% 22 1.1% runtime.xchg 21 1.1% 56.9% 29 1.5% runtime.mapaccess1_fast32 21 1.1% 58.0% 21 1.1% settype 20 1.0% 59.0% 31 1.6% runtime.mapaccess1_faststr

The calculations of my code are simply lost on the background of the tasks that runtime has to deal with (however, it should be so, I have no trace of heavy mathematics).

IMHO, there is still a huge reserve for optimizing the compiler and libraries. For example, I did not notice inlining, but all my mutexes are clearly visible in the scans of a stack of gorutin. The compiler optimization process is not in place (not so long ago, Dmitry Vyukov presented a significantly accelerated implementation of channels, for example), but fundamental changes are not often noticeable. For example, after switching from Go 1.2 to Go 1.3, I practically did not see any difference in performance at all.

Even during the optimization I had to abandon the math / rand package. The fact is that in the course of processing a request, pseudo-random numbers were often needed, but with data binding, and rand.Seed () used too much CPU (the profiler showed 13% of the total). Who needs it, that will generate a random pseudo random number generation method with Seed (), but all the same - for cryptographic purposes there is a crypto / rand package, and in math / rand you could not bother so much about high-quality mixing of bits during initialization.
By the way, I ended up with the following algorithm:

 func RandFloat64(seed uint64) float64 { seed ^= seed >> 12 seed ^= seed << 25 seed ^= seed >> 27 return float64((seed*2685821657736338717)&0x7fffffffffffffff) / (1 << 63) }

It is very convenient that all the calculations take place in one process, on Perl, separate workflow processes were used and I had to organize a shared cache — something through memcached, something through a file. On Go, this is much easier and more natural. But now, in the absence of an external cache, the problem of a cold start arises, then I had to tinker a bit - first I tried to limit to nginx (so that one hundred thousand gorutins would not start at the same time and not everything) the number of simultaneous requests to the upstream via the https: // module github.com/cfsego/nginx-limit-upstream , but it didn’t work very well (when the connection pool was clogged, it was somehow difficult for him to return to the normal mode, even after the load was removed). As a result, I patched the scgi module a bit and added a limiter on the number of simultaneously executed requests - until the processing of some of the current requests is finished - the new one will not be accepted Accept () - th:

 func ServeLimited(l net.Listener, handler http.Handler, limit int) error { if limit <= 0 { Serve(l, handler) } if l == nil { var err error l, err = net.FileListener(os.Stdin) if err != nil { return err } defer l.Close() } if handler == nil { handler = http.DefaultServeMux } sem := make(chan struct{}, limit) for { sem <- struct{}{} rw, err := l.Accept() if err != nil { return err } go func(rw net.Conn) { serve(rw, handler) <-sem }(rw) } }

The scgi module was also chosen for performance reasons - net / http / fcgi for some reason was ultimately slower than just net / http (and did not support persistent connection), and net / http additionally loaded the OS with generation of tcp packets and support for internal tcp connections (although it is technically possible to start it to listen on a unix-socket) - and if you could get rid of it, then why not get rid of it? Using nginx as a frontend gives its advantages - control of timeouts, logging, forwarding failed requests to other servers from the cluster - and all this with minimal additional load on the server. Another advantage of this approach is that by netstat -Lan , when the Accept queue grows on the scgi socket, it means that we have an overload somewhere and something needs to be done.

Code quality and debugging

The net / http / pprof package is a magic thing! This is something like the Apache server-status module, but for the Go daemon. And, by the way, I would not recommend including it in production if you use DefaultServeMux instead of the dedicated http handler - since the package is available to everyone via the link / debug / pprof /. I have no such problem, on the contrary, in order to get access to the package functions via http, you need to run a separate mini-server on localhost:

 go func() { log.Println(http.ListenAndServe("127.0.0.1:8081", nil)) }()

In addition to getting a profile on the processor and memory, this module allows you to browse through the stack a list of all currently running gortuin, the whole chain of functions that are currently running and in what state: / debug / pprof / goroutine \? Debug = 1 gives a list of different gorutin and their states, and / debug / pprof / goroutine \? debug = 2 gives a list of all launched rods, incl. and duplicate (i.e., in completely identical states). Here is an example of one of them:

 goroutine 85 [IO wait]: net.runtime_pollWait(0x800c71b38, 0x72, 0x0) /usr/local/go/src/pkg/runtime/netpoll.goc:146 +0x66 net.(*pollDesc).Wait(0xc20848daa0, 0x72, 0x0, 0x0) /usr/local/go/src/pkg/net/fd_poll_runtime.go:84 +0x46 net.(*pollDesc).WaitRead(0xc20848daa0, 0x0, 0x0) /usr/local/go/src/pkg/net/fd_poll_runtime.go:89 +0x42 net.(*netFD).accept(0xc20848da40, 0x8df378, 0x0, 0x800c6c518, 0x23) /usr/local/go/src/pkg/net/fd_unix.go:409 +0x343 net.(*UnixListener).AcceptUnix(0xc208273880, 0x8019acea8, 0x0, 0x0) /usr/local/go/src/pkg/net/unixsock_posix.go:293 +0x73 net.(*UnixListener).Accept(0xc208273880, 0x0, 0x0, 0x0, 0x0) /usr/local/go/src/pkg/net/unixsock_posix.go:304 +0x4b bitbucket.org/mjl/scgi.ServeLimited(0x800c7ec58, 0xc208273880, 0x800c6c898, 0x8df178, 0x1f4, 0x0, 0x0) /home/user/go/src/bitbucket.org/mjl/scgi/scgi.go:177 +0x20d main.func<C2><B7>008() /home/user/repo/main.go:264 +0x90 created by main.main /home/user/repo/main.go:265 +0x1f5c

It helped me to reveal a bug with locks (under certain conditions RUnlock () was called twice, and it is impossible to do this) - in the stack dump I saw the whole pack of locked gorutin and line numbers where there was a RUnlock () call.

The CPU profile is also quite good, I recommend putting gv (ghostview) and looking at the Xorg transition diagram between functions with counters - you can see what you should pay attention to and optimize.

go vet is a useful utility, but my main use has been reduced to warnings about missing format specifiers in all printf () - the compiler is unable to detect this. On clearly bad code

 if UintValue < 0 { DoSomething() }

vet does not respond.

The main work on checking the code is performed by the compiler. He regularly swears at unused variables and packages, but neither the compiler nor the vet react to the unused fields in the structures (at least with a warning), although there is something to pay attention to too.

It should be neat with the operator : = . I had a case when I had to calculate the difference between two uint, incl. correctly consider the negative difference as negative, and the code

  var a, b uint ... diff := a - b

thinks not what you expect - you need to use a cast to sign type (or not use unsigned).

There is also a good practice to name the same data types with different purposes by different names. For example:

 type ServerIdType uint32 type CustomerIdType uint32 var ServerId ServerIdType var CustomerId CustomerIdType

And now for the CustomerId variable, the compiler will not allow just writing the ServerId value (without a type conversion), despite the fact that it is there and there inside uint32. It helps from all sorts of typos, although now I often have to use type casting, especially when initializing variables.

Packages, libraries and bundle with C

An important role in Go's popularity was played by an effective (alas, not in terms of performance, with this while some problems) mechanism of interaction with C-libraries. By and large - a significant part of the Go-libraries are just wrappers over their C counterparts. For example, the github.com/abh/geoip and github.com/jbarham/gopgsqldriver packages are compiled with -lGeoIP and -lpq respectively (well, I use the native Go PostgreSQL driver - github.com/lib/pq).

For example, consider the almost standard crypt () function from unistd.h — this function is out of the box in many languages, for example, in the Nginx Perl module, it can be used without loading additional modules, which is useful. But not in Go, here you need to forward it to C yourself. This is done elementarily (in the example, the salt is cut off from the result):

 // #cgo LDFLAGS: -lcrypt // #include <unistd.h> // #include <stdlib.h> import "C" import ( "sync" "unsafe" ) var cryptMutex sync.Mutex func Crypt(str, salt string) string { cryptStr := C.CString(str) cryptSalt := C.CString(salt) cryptMutex.Lock() key := C.GoString(C.crypt(cryptStr, cryptSalt))[len(salt):] cryptMutex.Unlock() C.free(unsafe.Pointer(cryptStr)) C.free(unsafe.Pointer(cryptSalt)) return key }

Lock is needed because crypt () returns the same char * to the internal state, the resulting string must be copied, otherwise it will be overwritten by the next call, i.e. the function is not thread-safe.

database / sql

For each used Db handler, I recommend calling the prescribe maximum connection limit and specifying some non-zero limit of idle connections:

 db.SetMaxOpenConns(30) db.SetMaxIdleConns(8)

The first will avoid overloading the database and use it in maximum performance mode (with an increase in the number of simultaneous connections, database performance begins to fall at some point, there is an optimal number of simultaneous requests), and the second will remove the need to open a new connection on every request for PostgreSQL with its fork () mode is especially important. Of course, for PostgreSQL, you can also use pgpool or pgbouncer, but this is all an extra overhead for sending data to the kernel and additional delays - so it’s best to ensure consistency of connections right at the application level.

To exclude the overhead projector to parse the query and build the plan, use prepared statements instead of direct queries. But it must be borne in mind - in some cases the query planner may not use the most optimal plan, since it is built at the stage of parsing the query (and not its execution) and the planner does not always have enough data to know which index is preferable to use. By the way, placeholders for variables in the PostgreSQL Go driver use '$ 1', '$ 2', etc., instead of '?', As in Perl.

sql. (Rows) .Scan () has one feature — it does not understand renamed string types, for example, type DomainNameType string . You have to set up a temporary variable of the string type and load the data from the database into it, and then make an assignment with type conversion. With renamed numeric types, for some reason there is no such problem.

Channels and sync

There is a somewhat mistaken opinion that since we have channels in Go, then it is worth using them and only them. This is not entirely true - each task has its own tool. Channels are great for sending various kinds of messages, but it’s quite legal to use mutexes for working with shared resources, for example, the sql cache. To work with the cache through the channels, we will have to write a query manager, which ultimately limits the performance of accessing the cache to one core, adds even more work to the scheduler, and adds an overhead to copy and read data to the channel, plus you need to create a temporary channel each time back to the calling function. The code with the use of channels also often becomes many times larger and more complex than the code with mutexes (oddly enough). But with mutexes you need to be extremely careful not to get into deadlock.

In Go, there is a tricky feature like struct {} . Those. completely empty structure, borderless. It occupies zero space, an array of any size of such structures also occupies zero space, and the buffered channel of empty structures also occupies zero space (plus internal data, of course). Actually, this buffered channel of empty structures is a semaphore, in the compiler even a separate handler is made for it - if you need a semaphore with Go syntax - you can use chan struct {} .

A bit sad is the package sync. For example, there are no spinlockers, although they are very useful, as they are fast (although using GL spinlockers becomes a risky business). Yes, and the operations with mutexes do not inline (as far as I can tell). Even more frustrating is the inability to upgrade the RWMutex lock - if the lock is in the RLock status and it turned out that it is necessary to make changes - please do RUnlock (), then Lock () and check again whether there is still a need to make these changes or some kind of change. I managed to do everything. There is also no non-blocking function TryLock (), again it is not clear why - for some cases, it is extremely necessary. Here the developers of the language with their "we know better how you need to program," IMHO, have already gone too far.

In some cases, sync / atomic with its atomic operations helps to avoid using mutexes. For example, in my code, the current uint32 timestamp is often used - I keep it in a global variable and at the beginning of each query I just atomically keep the current value in it. I know that a little dirty approach could have been written to the helper function, but in the struggle for performance sometimes I have to make such sacrifices - I can now use this variable in arithmetic expressions without special restrictions.

There is another good optimization method for the case when some general data is updated only in one place (for example, periodically), and in other cases it is used in the “read only” mode. The bottom line is that there is no need to do RLock () / RUnlock () on read operations (and Lock () / Unlock () on updates) - the update function can load data into a new memory area and then atomically replace the pointer with old data with a pointer on new ones. True, in Go, the function of the atomic pointer entry requires the unsafe.Pointer type, and you have to construct such a construction:

 atomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(&Data)), unsafe.Pointer(&newData))

But you can use this data in any expression, without worrying about locks. This is of particular importance for Go, because seemingly short locks can actually be very long - and all because of the GC.

GC (garbage collector)

He drank my krovushka fairly; (. Imagine the situation - you start a load test - everything is ok. You let live traffic - everything is fine too. And then bam - and everything becomes bad or very, very bad - old requests hang, new ones arrive and arrive (several thousand per second), you have to restart the application, after which everything again stupid, because the cache has to be filled again, but at least it works and after some time returns to normal. I measured the execution time of each request processing stages - and I believe that periodically the execution time of all stages jumps up to three seconds and more, even those who do not use locks, do not use access to the database and files, but do only local calculations and usually fit into microseconds. It became clear that the source of the problem there was not some external factor, but the platform itself, more precisely, the garbage collector.

It's good that in Go you can see the statistics of the work of GC via runtime / debug.ReadGCStats () - there is something to be surprised about. In my case, the most unloaded server GC worked in the following mode:
0.06
0.30
2.00
0.06
0.30
2.00
...
The order of magnitude is preserved, although the numbers themselves varied slightly. This is the duration of the sleep application for the time of the GC, at the top of the most recent. Pause all work for 2 seconds - what? I'm afraid to even imagine what was happening on the busiest servers, but I did not touch them anymore, so as not to create unnecessary downtime.

The solution is to run GC () more often, for reliability it is better to independently from the program. You can even just periodically, but I got a little confused and made a query counter, as well as a forced launch of GC () after major cleansing of obsolete data. As a result, GC () began to run every ten to twenty seconds instead of every few minutes, but each pass occupies stably around 0.1s - quite another thing! 20. , , . GC, , — .

maps

, ( Perl) — . . , :

 valueType, ok := map_fetch(keyType) map_store(keyType, valueType) map_delete(keyType)

. — , (.. , , ) —

 type T struct { cnt int } m := make(map[int]T) m[0] = T{} m[0].cnt++ // ./main.go:9: cannot assign to m[0].cnt

m[0], cnt .

 m := make(map[int]*T) m[0] = new(T) m[0].cnt++

 m := make(map[int]T) tmp := m[0] tmp.cnt++ m[0] = tmp

, — ( )

, , map_store

 *valueType = map_allocate(keyType)

, .

map_allocate , , . , , — , .

, . — , . , , unsafe — , , — .

, . , — , ? , , , .

Total

, . , . , Go , , , , . Go , .

. C ( FreeBSD), Perl Shell-scripting ( ). Python, Ruby JS , — , . Go , . — .

Source: https://habr.com/ru/post/229169/

All Articles