Gearman - distribution framework for tasks, introduction

In this article, I would like to consider one of the unusual ways to optimize the application, namely using the Gearman project for task distribution. Gearman is a framework for building such systems. There are no examples of code in the article, the article is more introductory, although it contains enough practical information.

In any project that is rather complicated in terms of functionality and loading, sooner or later the question of optimization and scalability arises. There are many approaches to solving this complex of tasks, ranging from the banal increase in the computing power of the entire system or especially its loaded parts, to complex solutions in the form of specialized software and hardware systems. Combining these solutions is not only the goal - faster, higher, stronger, but also the approach to solving them - finding and identifying weak points. Often, bottleneck is a resource-intensive task associated with the processing of graphic information, encryption, archiving, heavy database queries, processing and / or returning a large amount of information.

Gearman is an open source project that was developed by guys from Danga Interactive. The name is nothing more than an anagram for the word manager, and it is the role of the manager that is a brief description of the functionality of this application - management, control, and distribution of various tasks. Originally gearman was implemented in Perl, but over time it was rewritten in C using the libevent library, the presence of which is necessary for the operation of the main part - the task server. Installation for any * nix system is not a big deal, and in most linux distributions the gearman package is included in the standard repository.
')
Here it is worth explaining why the author used the word “framework” in the title of this article. The fact is that the use of a gearman, although relatively simple, at the same time, an effective solution of the task will require quite serious development - this is not a ready-made solution.

An application that uses Gearman in its work works with 3 main components:

task server is a central part of Gearman, it is here that tasks and results from clients and performers will come, and it is from here that tasks and results will be sent to performers and clients, respectively.
Executive - is the main part where the implementation of any functionality is necessary. The contractor accepts and tries to perform the task from the server.
the client is also implemented separately, creates and sends the task to the server and, in some cases (about them later), gets the result.

The client and the performer use the Gearman API to call and implement the necessary functions. A diagram that illustrates the above architecture is shown on the main project site:

Consider a classic example of how such a system works - the user uploaded a photo to the site, and wants to show it to all his friends. It is necessary to quickly prepare the various sizes of this photograph, for use throughout the site. Yes, until a certain moment all the work on graphics generation can be carried out in the main system code, and then the habra effect will come, and the whole system will collapse under the influx of applicants. Now we will consider how this task can be solved using Gearman - when we receive a photo, we create a task for the gearman server, which includes the necessary information about the photo (in principle, it is possible to include even the photo itself in the task, but it seems to me that it will be expensive use the address) and send it to the execution. The task server selects (round robin) a free performer (as you probably already guessed, there may be several executors) and sends the task for execution. The contractor receives all the necessary information, processes it as we would like it to - and sends the answer about the successfully completed task and, if the task was synchronous, the result of the work. Yes, there are two types of tasks - synchronous, when the client waits for the sent task, and accordingly the result, and asynchronous, when the client acts only as the initiator of the task. Now we add here the fact that the communication protocol between the parts of the gearman application is TCP / IP, that is, it is assumed that each part (server, client, performer) can be located on a separate machine. Servers and clients can be several. The client and artist implementations are independent of each other (the available languages in which the API is implemented to work with gearman include perl, php, python, java, C, MySQL UDF, etc.). The project is actively developing - new versions with bug fixes and improvements are released every month. At the exit, we get a rather rosy picture, where the range of tasks is limited only by the imagination and experience of the developer. For example, a few common tasks that were successfully solved and solved using the gearman:

the above task with processing photos, and you should not be limited only to the generation of various sizes
asynchronous invalidation / update in the cache, without the use of crones and delays from the client
archiving large amounts of data
working with data in the format - (key => value), using encryption on the side of the contractor

Now for some practical information. The main object of work in gearman is the task - task, which consists of the type (synchronous, asynchronous), the name of the task (for example, resize), and the parameters - workload. Each task is identified by a name / parameters pair. That is, if the server receives two identical tasks with the same parameters, only the first will be executed, the second server will drop. Because of this nuance, a lot of interesting things happen, so you should always remember - is it necessary to perform a specific task for one hundred percent? - add a unique value to the parameters.

The server is a daemon process. It has a lot of configuration parameters, along with the standard user-port-interface settings for such applications, there are also several specific options. The number of attempts to perform the task is configured with the -j, --job-retries = parameter. To use multiple threads, there is a parameter -t, --threads =. The server has the ability to save the queue of tasks in any storage, so that when restarting it was possible to restore the entire process without losing data. MySQL / drizzle, memcached, PostgreSQL or sqllite can serve as storage. More than interesting is the option of using third-party protocols in the client -> task server communication process. The only such protocol implemented at the moment is HTTP. With this option, you can configure the gearman server to receive tasks on a specific port via an HTTP request (requested uri is translated to the task name, http body to workload, http headers respectively to the task type), that is, the client implementation is reduced to the implementation of a simple HTTP client on the caller. It does not impose any restrictions on the type / size of the task. Read more about using HTTP in gearman here . Using the HTTP protocol and Gearman, building efficiently balanced REST services can be as easy as ever. You can create tasks for gearman from virtually any programming language, since the wrappers for API function calls are written for most environments.

The performing part is accordingly in the so-called performer. Here the choice of means of implementation is also very extensive, ranging from standard php, python, java and ending with user functions in mysql. The only condition is the presence of support for the gearman library, which may require additional compilation of any modules. The so-called worker (worker) is a background process, after starting it registers all the tasks that it can perform and, while in memory, waits for tasks from the server. The number of servers with which a particular artist keeps in touch is unlimited.

Let's talk about the shortcomings, there are several of them, and they are of varying degrees of seriousness. Since the executor is a long-lived process, any change in its functionality implies a process of restarting the executor himself, the same applies to adding / deleting tasks that this performer can perform. That is, if you need to add a new function, or change an existing one, you need to restart all the processes of the performers that are associated with the implementation of this task. This process, when executors are scattered across different servers, in different quantities, is not trivial. And not all languages are designed to write effective demon processes.

The distribution of tasks by the server takes place only in one single algorithm - round robin. Yes, there are tasks with two levels of priority - high and low. But more precise control over the execution of tasks is not available. Moreover, it is impossible to know exactly which of the executors will perform the task, and it is also impossible to indicate directly to which node to send it, which greatly complicates the debugging process.

From the shortcomings described above, the main one follows - the absence of imputed means of managing the implemented system. Everything has to be done manually, and all the time keep in mind the structure of the gearman application.

Of course, for all the problems described there are solutions, the implementation of which I will try to write in the next article.

Now a few links to sources of information and other interesting things about Gearman:

proper main site: gearman.org
project page on launchpad: launchpad.net/gearmand
google discussion groups: groups.google.com/group/gearman/topics
examples of use in PHP: http://highload.com.ua/
Gearman Server Startup Service: gearmanhq.com

Source: https://habr.com/ru/post/123451/

All Articles

Gearman - distribution framework for tasks, introduction

More articles: