Multithreaded Go programming

There was a problem: we have a compiler of our own programming language, with which we compile some BASIC dialect into source code in C.

Unfortunately, for historical reasons, we did not have a clear regression test for this compiler. But now, on the basis of the source code of a business application written on this basic, they decided to do a full test.

The plan is this: to accept some current version of the compiler, for which there are no open complaints from customers, as a reference. Compile with this version a decent amount of sources, save the result, and then every time you make changes to the compiler, run all these sources and see if exactly the same output is generated. This will not protect against errors in general, but at least there will be confidence that the existing business code is still compiled correctly.
')
Easy task. Only there is one "but." The number of sources that are planned to be used as a reference is about 15 thousand files, with a total volume of slightly less than gigabytes (for convenience, they are wrapped in one TAR). Such a “run” can be very long. And there is a natural desire to make the test as fast as possible using a multiprocessor machine, because the task is perfectly parallelized.

Alternatively, you can make a Makefile and run it with the "-j" key in GNU Make. But if you write a specialized multi-threaded program, you can achieve better performance.

So, it is obvious: instead of sequential execution, you need to start compiling each file in parallel threads. But since there are many files (~ 15 thousand), it is inefficient to just simultaneously run so many threads. It would be more reasonable to have a pool of threads, where their number will be determined, for example, by the number of processors (for example, multiplied by 2). The pool will assign the next task to the free stream, and if all the threads are busy, it will block until the free one appears.

Thus, we will keep busy N threads, ensuring optimal CPU utilization, without wasting time on unnecessary context switches and constantly creating and destroying threads.

At first I decided to write everything in C ++ and pthreads. After several hours of dancing around functors, mutexes, semaphores and conditional variables, nothing really worked out for me. And then I remembered Go. Do not believe it - after an hour of work I had the first version ready, including small things like working with TAR, the command line and starting the external process.

So: this program takes a TAR with source codes, unpacks it, and runs each file through the compiler.

I’ll say at once that the purpose of what I am writing here is to demonstrate this (and no more than that) how easy and convenient it is for Go to write multi-threaded imperative programs.

The main concept that is used in this program is the channels . Channels can synchronously transfer data and functions between threads ( Go-routines ).

Further, you can simply look at the source. The most interesting place where you can see how the “compile ()” function can be called from several threads without any changes.

package main import ( "archive/tar" "container/vector" "exec" "flag" "fmt" "io" "os" "strings" ) //  :     . var jobs *int = flag.Int("jobs", 0, "number of concurrent jobs") var compiler *string = flag.String("cc", "bcom", "compiler name") func main() { flag.Parse() os.Args = flag.Args() args := os.Args ar := args[0] r, err := os.Open(ar, os.O_RDONLY, 0666); if err != nil { fmt.Printf("unable to open TAR %s\n", ar) os.Exit(1) } // defer -   "finally {}",   //     . defer r.Close() //   TAR. fmt.Printf("- extracting %s\n", ar) //    . tr := tar.NewReader( r ) tests := new(vector.StringVector) //    ,     //   . for { //      . hdr, _ := tr.Next() if hdr == nil { break } name := &hdr.Name //     ,  . if !strings.HasPrefix(*name, "HDR_") { tests.Push(*name) } //   . w, err := os.Open("data/" + *name, os.O_CREAT | os.O_RDWR, 0666) if err != nil { fmt.Printf("unable to create %s\n", *name) os.Exit(1) } //     . io.Copy(w, tr) w.Close() } fmt.Printf("- compiling...\n") *compiler , _ = exec.LookPath(*compiler) fmt.Printf("- compiler %s\n", *compiler) if *jobs == 0 { //  "compile()" ,   . fmt.Printf("- running sequentially\n") for i := 0; i < tests.Len(); i++ { compile(tests.At(i)) } } else { //  "compile()"   . fmt.Printf("- running %d concurrent job(s)\n", *jobs) //  :        , //   . -runner'   //    .     // .   ,    // ,   runner' . tasks := make(chan string, *jobs) //     -runner'. //    ,   runner'  //   .     . done := make(chan bool) //  runner'. for i := 0; i < *jobs; i++ { go runner(tasks, done) } //       .  //    ,   //  . for i := 0; i < tests.Len(); i++ { tasks <- tests.At(i) } //        //       . for i := 0; i < *jobs; i++ { tasks <- "" <- done } } } // -runner. func runner(tasks chan string, done chan bool) { //  . for { //    . ,   //   . name := <- tasks //   ,   . if len(name) == 0 { break } //  . compile(name) } //  ,   . done <- true } func compile(name string) { //  . c, err := exec.Run(*compiler, []string{*compiler, name}, os.Environ(), "./data", exec.DevNull, exec.PassThrough, exec.PassThrough) if err != nil { fmt.Printf("unable to compile %s (%s)\n", name, err.String()) os.Exit(1) } c.Wait(0) }

Makefile:

 target = tar_extractor all: 6g $(target).go 6l -o $(target) $(target).6

I drove this stuff under a 64-bit Linux on an eight-processor blade. During testing, I was on the machine alone, so that the results of different runs can be compared. The file “huge.tar” contains ~ 15 thousand sources and is one gigabyte in size.

This is how the CPU load looks when the machine does nothing (all processors are almost 100% idle):

  Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.0%us, 0.3%sy, 0.0%ni, 99.3%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Run in sequential mode (-jobs 0):

  make && time -p ./tar_extractor -jobs 0 huge.tar

Working hours:

  real 213.81 user 187.32 sys 61.33

Almost all processors do 70-80% nothing (I took all the pictures during the compilation stage):

  Cpu0 : 11.9%us, 4.3%sy, 0.0%ni, 82.5%id, 1.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 9.6%us, 2.7%sy, 0.0%ni, 87.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 4.3%us, 1.3%sy, 0.0%ni, 92.7%id, 1.7%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 16.0%us, 6.0%sy, 0.0%ni, 78.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 12.6%us, 4.3%sy, 0.0%ni, 82.7%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 11.6%us, 3.3%sy, 0.0%ni, 85.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 4.7%us, 1.3%sy, 0.0%ni, 94.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 16.6%us, 6.3%sy, 0.0%ni, 77.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Total processor load - 2.7%:

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15054 tester 18 0 41420 4980 1068 S 2.7 0.1 0:02.96 tar_extractor

Now we start with a thread pool, but with only one channel (-jobs 1).

Time:

  real 217.87 user 191.42 sys 62.53

Processors:

  Cpu0 : 5.7%us, 1.7%sy, 0.0%ni, 92.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 13.3%us, 5.3%sy, 0.0%ni, 81.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 7.0%us, 2.7%sy, 0.0%ni, 89.3%id, 0.7%wa, 0.0%hi, 0.3%si, 0.0%st Cpu3 : 15.3%us, 5.7%sy, 0.0%ni, 77.7%id, 1.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 6.0%us, 2.0%sy, 0.0%ni, 92.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 14.3%us, 7.3%sy, 0.0%ni, 78.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 7.0%us, 2.3%sy, 0.0%ni, 90.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 15.3%us, 6.6%sy, 0.0%ni, 78.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

It is clear that the picture is the same, since in reality we also drive one thread.

And now we turn on the thread pool (-jobs 32):

  make && time -p ./tar_extractor -jobs 32 huge.tar

Work time fell almost seven times:

  real 38.38 user 195.55 sys 69.92

The total processor load (during the compilation stage) increased to 23%:

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 17488 tester 16 0 45900 9732 1076 S 23.6 0.1 0:06.40 tar_extractor

It can be seen that all processors are really busy:

  Cpu0 : 56.3%us, 26.3%sy, 0.0%ni, 17.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 55.5%us, 27.9%sy, 0.0%ni, 15.6%id, 1.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 56.1%us, 25.9%sy, 0.0%ni, 15.0%id, 0.7%wa, 0.3%hi, 2.0%si, 0.0%st Cpu3 : 58.1%us, 26.2%sy, 0.0%ni, 15.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 57.2%us, 25.8%sy, 0.0%ni, 17.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 56.8%us, 26.2%sy, 0.0%ni, 16.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 59.0%us, 26.3%sy, 0.0%ni, 13.0%id, 1.7%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 56.5%us, 27.2%sy, 0.0%ni, 16.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

This testing is intended solely for understanding how the simplest parallel code speeds up everything at times. And also to demonstrate how simple and relatively safe it is to program parallel computing in Go.

Go ahead. Let's compare the effectiveness of Go-rutin compared to native streams, for example in C ++.

I admit, I am not a fighter in boost and a new C ++, but in C ++ the solution is quite elegant.

It was interesting to compare the productivity of streams in both languages in terms of speed from creating and assigning them work. As I understand it, this is a battle between pthreads and the go-rutin system, which are not operating system threads. As stated in the documentation :

If so, it would be a rule, for example, to continue, and to continue. Their design and development.

I took the last boost, and conducted an experiment on the same eight processor machine.

The program will need to do a lot of work of the same type (in fact, call a function). Tasks will be multiplexed between several parallel streams. The function itself will be elementary and fast. I hope these will be able to focus testing specifically on the flow subsystem, rather than on the payload.

So, the program on Go:

 package main import ( "flag" "fmt" ) var jobs *int = flag.Int("jobs", 8, "number of concurrent jobs") var n *int = flag.Int("tasks", 1000000, "number of tasks") func main() { flag.Parse() fmt.Printf("- running %d concurrent job(s)\n", *jobs) fmt.Printf("- running %d tasks\n", *n) tasks := make(chan int, *jobs) done := make(chan bool) for i := 0; i < *jobs; i++ { go runner(tasks, done) } for i := 1; i <= *n; i++ { tasks <- i } for i := 0; i < *jobs; i++ { tasks <- 0 <- done } } func runner(tasks chan int, done chan bool) { for { if arg := <- tasks; arg == 0 { break } worker() } done <- true } func worker() int { return 0 }

Makefile to run on a series of parameters:

 target = go_threading all: build build: 6g $(target).go 6l -o $(target) $(target).6 run: (time -p ./$(target) -tasks=$(args) \ 1>/dev/null) 2>&1 | head -1 | awk '{ print $$2 }' n = \ 10000 \ 100000 \ 1000000 \ 10000000 \ 100000000 test: @for i in $(n); do \ echo "`printf '% 10d' $$i`" `$(MAKE) args=$$i run`; \ done

C ++ program:

 #include <iostream> #include <boost/thread.hpp> #include <boost/bind.hpp> #include <queue> #include <string> #include <sstream> class thread_pool { typedef boost::function0<void> worker; boost::thread_group threads_; std::queue<worker> queue_; boost::mutex mutex_; boost::condition_variable cv_; bool done_; public: thread_pool() : done_(false) { for(int i = 0; i < boost::thread::hardware_concurrency(); ++i) threads_.create_thread(boost::bind(&thread_pool::run, this)); } void join() { threads_.join_all(); } void run() { while (true) { worker job; { boost::mutex::scoped_lock lock(mutex_); while (queue_.empty() && !done_) cv_.wait(lock); if (queue_.empty() && done_) return; job = queue_.front(); queue_.pop(); } execute(job); } } void execute(const worker& job) { job(); } void add(const worker& job) { boost::mutex::scoped_lock lock(mutex_); queue_.push(job); cv_.notify_one(); } void finish() { boost::mutex::scoped_lock lock(mutex_); done_ = true; cv_.notify_all(); } }; void task() { volatile int r = 0; } int main(int argc, char* argv[]) { thread_pool pool; int n = argc > 1 ? std::atoi(argv[1]) : 10000; int threads = boost::thread::hardware_concurrency(); std::cout << "- executing " << threads << " concurrent job(s)" << std::endl; std::cout << "- running " << n << " tasks" << std::endl; for (int i = 0; i < n; ++i) { pool.add(task); } pool.finish(); pool.join(); return 0; }

Makefile:

 BOOST = ~/opt/boost-1.46.1 target = boost_threading build: g++ -O2 -I $(BOOST) -o $(target) \ -lpthread \ -lboost_thread \ -L $(BOOST)/stage/lib \ $(target).cpp run: (time -p LD_LIBRARY_PATH=$(BOOST)/stage/lib ./$(target) $(args) \ 1>/dev/null) 2>&1 | head -1 | awk '{ print $$2 }' n = \ 10000 \ 100000 \ 1000000 \ 10000000 \ 100000000 test: @for i in $(n); do \ echo "`printf '% 10d' $$i`" `$(MAKE) args=$$i run`; \ done

In both languages, the number of threads will be equal to the number of processors - 8. The number of tasks that are run through these eight threads will vary.

Run the program in C ++:

 make && make -s test g++ -O2 -I ~/opt/boost-1.46.1 -o boost_threading \ -lpthread \ -lboost_thread \ -L ~/opt/boost-1.46.1/stage/lib \ boost_threading.cpp (time -p LD_LIBRARY_PATH=~/opt/boost-1.46.1/stage/lib ./boost_threading \ 1>/dev/null) 2>&1 | head -1 | awk '{ print $2 }' 10000 0.03 100000 0.35 1000000 3.43 10000000 29.57 100000000 327.37

Go now:

 make && make -s test 6g go_threading.go 6l -o go_threading go_threading.6 10000 0.00 100000 0.03 1000000 0.35 10000000 3.72 100000000 38.27

The difference is obvious.

By the way, if you set the GOMAXPROCS = 8 variable, than to tell Go to use all 8 processors (the default is only one, regardless of how many physical processors there are), then the speed of the Go program drops wildly and becomes almost equal to the C ++ program time. It turns out that the lightweight Go threads, being multiplexed into native threads, work faster.

Maybe I am comparing salt with red, and the results are simply inadequate. I would be very grateful for the hint in which parrots to measure correctly.

Posts on Go:

Source: https://habr.com/ru/post/117842/

All Articles

Multithreaded Go programming

More articles: