Distributed cron and batch scheduler in a Prun cluster

Prologue

Often there is a need to manage tasks on a set of compute nodes. If the task can be automated by writing a script, then there is still the need to start, monitor the execution, stop this script in the cluster. The task can be anything: get the file via wget, create a dump of the local database, run a load test, archive old files, etc.

There are many automated task management systems in the cluster. These systems include centralized batch task schedulers in distributed computing systems, such as htcondor , torque , slurm, and others. Centralized schedulers usually consist of one or more processes of the wizard, which manages a set of processes of workers (running) running on each of the cluster hosts, on which the tasks start. Such systems have taken root in the field of high-throughput computing (HTC) , high-performance computing (HPC) , but are rarely used in the daily tasks of server administration and in relatively small clusters, largely due to the difficulty in installing, using and supporting such systems.

General description of the scheduler Prun

Prun provides control over the distributed execution of batch tasks on UNIX-like operating systems. Prun can be attributed to centralized schedulers, but it has a simplified interface for describing and managing task execution. The prun wizard provides a mechanism for queuing tasks, scheduling execution based on task priorities and computing resources (CPU, RAM), a mechanism for ensuring the fault tolerance of executable tasks, as well as scheduling tasks on a schedule (similar to cron). The master process ( pmaster ) runs on one of the machines in the cluster.

The entire master control process takes place using the administration utility ( prun ). An instance of the worker must be running on each of the hosts. A worker consists of two processes: the worker's own process ( pworker ), which accepts wizard commands, responds to periodic heartbeat signals, maintains the status of a completed task; The prexec process, which always runs as an unprivileged user and runs tasks (using fork + exec to spawn a process), while simultaneously monitoring the execution status of the process.
')

Creating and running the “Hello, world!” Tasks

To run a task using prun, you need to create a script file on one of the supported PL (shell, python, ruby, js, java) and .job file describing this task. Here are examples of the simplest shell script hello_world.sh and a file with a description of the hello_world.job task:

#!/bin/sh echo "Hello, Shell!" echo "taskId="$taskId", numTasks="$numTasks, "jobId="$jobId

 { "script" : "/home/nobody/jobs/hello_world.sh", "language" : "shell", "send_script" : true, "priority" : 4, "job_timeout" : 120, "queue_timeout" : 60, "task_timeout" : 15, "max_failed_nodes" : 10, "num_execution" : -1, "exec_unit_type" : "host" "max_cluster_instances" : -1, "max_worker_instances" : 1, "exclusive" : false, "no_reschedule" : false }

A job file is a set of JSON key-value pairs. Description of some of the properties of the task:

script - the path to the script file. If the file is located in the file system of the wizard, then when the “send_script” value is set, the script file will be sent to each of the workers for execution.
language - the language of the script. Specifies which interpreter (python, ruby, js, java, shell) to use to execute the script.
priority - the integer priority value of the task. The smaller the value, the higher the priority of the task.
job_timeout - timeout value for the task execution on the whole cluster from the moment of the first execution on one of the workers until the last scheduled execution.
queue_timeout - the maximum allowable waiting time in the task queue.
task_timeout - timeout value for one running task on the worker.
max_failed_nodes - the maximum number of tasks that can be completed with an error and / or machines of workers that became unavailable during the execution of this task. When this limit is reached, the wizard stops the execution of all instances of this task on the cluster.
num_execution - the number of scheduled launches of task instances (tasks). If the value is negative and the value exec_unit_type == host, then the number of tasks will be equal to the number of all available workers in the cluster.
max_cluster_instances - the limit of simultaneously running number of tasks on the entire cluster.
max_worker_instances - the limit of the number of tasks simultaneously running on one worker machine.
exclusive - if it is set to true, then during the execution of a task on a worker, other tasks will not be scheduled for execution on the same worker at the same time as this task.

There are other properties in the description of the task: the schedule as in the crontab, the unique name of the task, the black and white list of workers and others.
In the prun terminology, a task ( job ) is conditionally divided into a number of tasks, where the number of tasks determines the number of runs of the script for this task in the cluster. Each task is uniquely defined by a pair (job_id, task_id), while each task has its own unique job_id, and the value of task_id lies in the half-interval [0, num_execution). The last property can be used to understand which part of the task to perform, which is similar to the rank of a process in terms of MPI .

The launch of the task, or rather the formulation of the task in the queue, can be done with the command

 ./prun -c "run /home/nobody/jobs/hello_world.job"

If the worker is started in the terminal not in daemon mode, then as a result of starting the task to standard output it will print:

 Hello, shell! taskId=0 numTasks=1 jobId=0

Wizard configuration

Each worker belongs to one and only one group of workers. The groups file (/ etc / pmaster / groups) contains a list of groups of workers, where the name of the group must match the name of the file that contains the list of addresses of the hosts on which the workers are running.

For example, if there are 8 machines that need to be placed in two different groups, then you should create 2 files (my_group1 and my_group2) and write host names (or ip'shniki) into them. Then write two lines my_group1 and my_group2 to the groups file.

You can also create groups, delete groups, and add and remove workers from a group at any time after running the wizard.

The master and worker configs are located in the master.cfg (/etc/pmaster/master.cfg) and worker.cfg (/etc/pworker/worker.cfg) files, respectively.

Example of creating a cron task

The process of creating a task with a scheduled execution is not different from the example described above, “Hello, world!”. To do this, you must add the name and cron parameters. The name of the task must be unique, since it serves as an identifier for the cron of the task. The format of the schedule description corresponds to the format from the crontab (man 5 crontab). For example, in order for the task to start every Sunday at 3 o'clock after midnight, the value of the cron key must be "* 3 * * 0".

An example of a task with a schedule can be viewed in jobs / cron.job in the project repository. From the description of this task, it follows that it will be run on every cluster worker exactly once every minute.

An example of creating a meta-task represented by a DAG-graph of dependencies between tasks

Sometimes there is a requirement to perform some task A only after performing one or several other tasks (say, B and C). Such dependencies form a directed graph without cycles . In the terminology of prun, such tasks are called meta-tasks or a group of tasks, and the dependency graph between tasks is described in the .meta file. The dependencies between tasks are described in the .meta file with an adjacency list. An example of a meta-task that should be performed every minute on a schedule can be viewed in jobs / cron.meta in the project repository.

Task management

In addition to adding a task using the run command, there is the following set of commands:

stop <job_id> - stop the task with a unique identifier job_id.
stop <job_name> - stop the task by its unique name.
stopg <group_id> - stop the task group (meta task).
stopall - stop all tasks in the cluster and clear the queues of waiting tasks.
add [<host_name> <group_name>] * - adds a worker to the group of workers.
delete <host_name> - delete a worker with the forced termination of all tasks running on it.
addg <path_to_group_file> - add a group of workers.
deleteg <group_name> - deletes a group of workers.
info <job_id> - displays execution statistics for a specific task.
stat - general statistics for the entire cluster.
ls - brief statistics for each worker.
cron - shows scheduled cron tasks.

fault tolerance

The wizard periodically sends heartbeat messages to all workers. If the worker does not respond to the message a certain number of times, then he is considered unavailable, and all tasks performed on it are scheduled for other available workers. Similarly, tasks that returned an error code or crashed are scheduled for other available workers.

The master can save its status by running and queued tasks to an external database. Now it is possible to save the state in one of the following databases: LevelDB , Cassandra , Elliptics . In the case of restarting the wizard, the wizard restores all queued tasks and tasks that were not fully completed at the time of the previous wizard stop, and also stops all previously started tasks on workers.

Conclusion

The main requirements at the development stage of prun were: high performance, the minimum number of dependencies on external libraries, convenience for daily use and reliability of all components. Prun is written in C ++ and the only dependency needed is the Boost library.

Must mention the use of prun in real-world tasks. Now prun is used in the organization of load testing on a small cluster of 20 hosts. The essence of the load testing task is to deploy the load test application, configure it, and run tests on a schedule simultaneously across the entire cluster.

The maximum cluster on which the work of the scheduler was checked consisted of 140 hosts in AWS, and the hosts were of different capacities (several large instances, the rest were micro).

In the future, it is planned to keep statistics on the workload of workers on many parameters in real time to optimize task scheduling by the master, since at the moment only 2 statistics are used: the ratio of the number of running tasks to the number of cores / processors and the total RAM. There are many directions for the further development of the planner.

I remind you that the project provided in this article is open and located on GitHub: github.com/abudnik/prun

Source: https://habr.com/ru/post/265327/

All Articles