Work-stealing scheduler in Go

The task of the scheduler in Go is to distribute running gorutin between OS threads, which can be executed by one or more processors. In multithreaded computing, two paradigms emerged in planning: work sharing and work stealing.

Work-sharing : When a processor generates new threads, it tries to migrate them to other processors, in the hope that they will go to an idle or underloaded processor.
Work-stealing : An insufficiently loaded processor actively searches for threads of other processors and "steals" some of them.

Migration of flows occurs less frequently with the work stealing approach than with work sharing . When all processors are busy, threads do not migrate. As soon as an idle processor appears, a migration option is considered.

In Go, starting with version 1.1, the scheduler is implemented according to the work stealing scheme and was written by Dmitry Vyukov. This article explains in detail the device work stealing planners and how it works in Go.

Planning Basics

Go scheduler is made according to M: N scheme and can use multiple processors. At any time, M gorutin must be distributed among the N threads of the OS, which run on the maximum of GOMAXPROCS processors. The Go scheduler uses the following terminology for gorutin, threads, and processors:

G: gorutina
M: OS thread (M from Machine)
P: processor

Next we have two queues that are specific to P. Each M must be assigned to its own P. The Ps may not have M if they are blocked or waiting for the end of the system call. At any time, there can be a maximum of GOMAXPROCS processors — P. At any time, only one M can be executed for each P. More than M can be created by the scheduler, if required.

Each planning cycle consists of finding a gorutina that is ready to be up and running. With each cycle, the search occurs in the following order:

runtime.schedule() { //  1/61   ,    G //   ,    //   , : //     P //   ,    //     ,  (poll)  }

As soon as a ready-to-execute G is found, it is executed until it is blocked.

Note: It may seem that a global queue has an advantage over a local one, but regular checking of a global queue is critical in order to avoid M using only Gorutin from a local queue.

"Theft" (Stealing)

When a new G is created or an existing G becomes ready for execution, it is placed in the local queue of ready-for-execution Gorutin current P. When P ends G's execution, it tries to pull out (pop) G from its turn. If the list is empty, P randomly chooses another processor (P) and tries to steal half of the gorutin from his turn.

In the example above, P2 cannot find ready for execution gorutin. Therefore, he randomly chooses another processor (P1) and steals three gorutines in turn. P2 will now be able to start them and the work will be more evenly distributed among the processors.

Spinning threads

The scheduler always wants to distribute as many as possible M gorutin ready for execution in order to use all the processors, but at the same time, we should be able to suspend (park) highly voracious processes in order to save CPU resources and energy. And at the same time, the scheduler must also be able to scale for tasks that really require a lot of processing power of the processor and greater performance.

Constant preemption is both expensive and problematic for high-performance programs, where performance is most critical. Gorutiny should not constantly jump between the threads of the OS, so this leads to increased latency. In addition to everything, when system calls are called, the thread must be constantly blocked and unblocked. This is expensive and leads to large overheads.

To reduce these gorutin jumps back and forth, the Go scheduler implements the so-called spinning threads. These threads use slightly more processor power, but reduce the crowding out of threads. A thread is considered looped if:

M assigned to P is looking for a gorutin to be launched.
M not assigned to P, searches for available P.
the scheduler also starts an additional thread and loops it when it prepares a new gorutin and there is an idle P and there are no other looped threads.

At any time, there can be a maximum of GOMAXPROCS looped M. When the looped thread finds work, it goes out of the looped state.

An idle thread assigned to any P is not blocked if there are other M not assigned to P. If a new gorutina is created or an M is blocked, the scheduler checks and ensures that there is at least one looped M. This ensures that all the gorutines can be launched if it is possible and avoids unnecessary locks / unlocks M.

findings

Go scheduler does a lot to avoid overcrowding threads, distributing them to underused processors with theft method, and also implementing looped threads to avoid frequent transitions from blocking to nonblocking state and back.

Planning events can be tracked using the execution tracer . You can get to the bottom of everything that happens inside the scheduler, especially if you think that in your case the processor is not using efficiently.

Links

If you have any suggestions or comments, write @rakyll .

Source: https://habr.com/ru/post/333654/

All Articles