📜 ⬆️ ⬇️

Go: multithreading and parallelism

I love Go, I love to praise him (it happens even, I instill a little), I love articles about him. I read the article “ Go: Two years in production, ” then comments. It became clear, on Habré - optimists! They want to believe in the best.

By default, Go runs on the same thread using its scheduler and asynchronous calls. (The programmer has a feeling of multithreading and parallelism.) In this case, the channels work very quickly. But if you specify Go to use 2 or more streams, then Go starts using locks and the performance of the channels may fall. I do not want to limit myself to the use of channels. Moreover, most third-party libraries use channels at every opportunity. Therefore, it is often effective to run Go with one thread, as is done by default.

channel01.go
package main import "fmt" import "time" import "runtime" func main() { numcpu := runtime.NumCPU() fmt.Println("NumCPU", numcpu) //runtime.GOMAXPROCS(numcpu) runtime.GOMAXPROCS(1) ch1 := make(chan int) ch2 := make(chan float64) go func() { for i := 0; i < 1000000; i++ { ch1 <- i } ch1 <- -1 ch2 <- 0.0 }() go func() { total := 0.0 for { t1 := time.Now().UnixNano() for i := 0; i < 100000; i++ { m := <-ch1 if m == -1 { ch2 <- total } } t2 := time.Now().UnixNano() dt := float64(t2 - t1) / 1000000.0 total += dt fmt.Println(dt) } }() fmt.Println("Total:", <-ch2, <-ch2) } 


 users-iMac:channel user$ go run channel01.go NumCPU 4 23.901 24.189 23.957 24.072 24.001 23.807 24.039 23.854 23.798 24.1 Total: 239.718 0 

')
Now let's activate all the kernels by re-commenting the lines.

  runtime.GOMAXPROCS(numcpu) //runtime.GOMAXPROCS(1) 


 users-iMac:channel user$ go run channel01.go NumCPU 4 543.092 534.985 535.799 533.039 538.806 533.315 536.501 533.261 537.73 532.585 Total: 5359.113 0 


20 times slower? What's the catch? The default channel size is 1.

  ch1 := make(chan int) 


We put 100.

  ch1 := make(chan int, 100) 


result 1 thread
 users-iMac:channel user$ go run channel01.go NumCPU 4 9.704 9.618 9.178 9.84 9.869 9.461 9.802 9.743 9.877 9.756 Total: 0 96.848 


result 4 threads
 users-iMac:channel user$ go run channel01.go NumCPU 4 17.046 17.046 16.71 16.315 16.542 16.643 17.69 16.387 17.162 15.232 Total: 0 166.77300000000002 


Only two times slower, but it is not always possible to use it.

Channel Channel Example


 package main import "fmt" import "time" import "runtime" func main() { numcpu := runtime.NumCPU() fmt.Println("NumCPU", numcpu) //runtime.GOMAXPROCS(numcpu) runtime.GOMAXPROCS(1) ch1 := make(chan chan int, 100) ch2 := make(chan float64, 1) go func() { t1 := time.Now().UnixNano() for i := 0; i < 1000000; i++ { ch := make(chan int, 100) ch1 <- ch <- ch } t2 := time.Now().UnixNano() dt := float64(t2 - t1) / 1000000.0 fmt.Println(dt) ch2 <- 0.0 }() go func() { for i := 0; i < 1000000; i++ { ch := <-ch1 ch <- i } ch2 <- 0.0 }() <-ch2 <-ch2 } 


result 1 thread
 users-iMac:channel user$ go run channel03.go NumCPU 4 1041.489 

result 4 threads
 users-iMac:channel user$ go run channel03.go NumCPU 4 11170.616 

Therefore, if you have 8 cores and you write the server on Go, you should not completely rely on Go in parallelizing the program, or maybe run 8 single-threaded processes, and in front of them a balancer, which can also be written on Go. In our production, there was a server that, when switching from a single-core server to 4x, began to process 10% fewer requests.

What do these numbers mean? We were faced with the task of processing 3000 requests per second in one context (for example, giving each request consecutive numbers: 1, 2, 3, 4, 5 ... maybe a little more complicated) and the performance of 3000 requests per second is limited primarily by channels. With the addition of threads and cores, performance is not growing as much as one would like. 3000 requests per second for Go is a certain limit on modern equipment.

Night Update: How not to optimize



Comments from the article “ Go: Two years in production ” prompted me to write this article, but the comments of this one surpassed the comments of the first.

Cybergrind hacker suggested the following optimization. She has already liked 8 other habrazhiteli. I do not know if they read the code or maybe they are divers and they do everything intuitively, but I will explain. So the article will become more complete and informative.
Here is the code:

 package main import "fmt" import "time" import "runtime" func main() { numcpu := runtime.NumCPU() fmt.Println("NumCPU", numcpu) //runtime.GOMAXPROCS(numcpu) runtime.GOMAXPROCS(1) ch3 := make(chan int) ch1 := make(chan int, 1000000) ch2 := make(chan float64) go func() { for i := 0; i < 1000000; i++ { ch1 <- i } ch3 <- 1 ch1 <- -1 ch2 <- 0.0 }() go func() { fmt.Println("TT", <-ch3) total := 0.0 for { t1 := time.Now().UnixNano() for i := 0; i < 100000; i++ { m := <-ch1 if m == -1 { ch2 <- total } } t2 := time.Now().UnixNano() dt := float64(t2 - t1) / 1000000.0 total += dt fmt.Println(dt) } }() fmt.Println("Total:", <-ch2, <-ch2) } 


What is the essence of this optimization?

1. Added channel ch3. This channel blocks the second gorutina, until the end of the first gorutina.
2. Since the second one does not read from channel ch1, it blocks the first one when filling. Therefore, ch1 is increased to the required 1,000,000

That is, the code is no longer parallel, works in series, and the channel is used as an array. And of course this code is not able to use the second core. In the context of this code, one cannot speak of “perfect acceleration N times“.

The main thing is that such a code will work only with a initially defined amount of data and is not able to work constantly, to process live information for as long as desired.

Update 2: Tests for Go 1.1.2



test number one with buffer 1 (channel01.go)

  ch1 := make(chan chan int, 1) 


1 thread
 go runchannel01.go NumCPU 4 66.0038 66.0038 67.0038 66.0038 67.0038 66.0038 65.0037 67.0038 67.0039 76.0043 Total: 0 673.0385000000001 


4 threads
 go run channel01.go NumCPU 4 116.0066 186.0106 112.0064 117.0067 175.01 115.0066 114.0065 148.0084 133.0076 153.0088 Total: 0 1369.0782 

Conclusion: much better. Why put buffer 1 is difficult to imagine, but perhaps there is a use for such a buffer.

test number one with buffer 100 (channel01.go)

  ch1 := make(chan chan int, 100) 


1 thread
 go run channel01.go NumCPU 4 16.0009 17.001 16.0009 16.0009 16.0009 16.0009 17.001 16.0009 17.001 16.0009 Total: 0 163.00930000000002 


4 threads
 go runchannel01.go NumCPU 4 66.0038 66.0038 67.0038 66.0038 67.0038 66.0038 65.0037 67.0038 67.0039 76.0043 Total: 0 673.0385000000001 

Conclusion: Twice as bad as version 1.0.2.

test number two (channel03.go)

1 thread
 go run channel03.go NumCPU 4 1568.0897 


4 threads
 go run channel03.go NumCPU 4 12119.6932 


About the same as version 1.0.2, but slightly better. 1: 8 against 1:10

Source: https://habr.com/ru/post/195574/


All Articles