Process tracking and error handling, part 1

0 Preamble

Agree pleasantly when everything is under control and everything is in order in the household, each thing stands in its place and clearly fulfills its universal purpose. Today we will consider the organization of order in a huge variety of Erlang processes. Basic concepts about erlang processes can be found in this post .

1 You follow me - I follow you

Anyone who somehow got acquainted with the Erlang, heard the phrase: "Let the process fall, while the other will do something about it or deal with the problem." You must admit that when something breaks is bad, and if we still don’t know about it for a long time, it’s doubly bad. Your cat broke its bowl of milk and hid this terrible fact from you - bad! Let the bowl watch the cat, and the cat behind the bowl. Forgive the readers of the author for such a rough comparison. So let's get down to business.

Communication processes to monitor the status of each other - this is one of the basic concepts of Erlang. In a complex and well-designed system, no process should “hang in the air”. All processes should be built into the control tree, the leaves of which are workflows, and the internal nodes monitor workers (controllers) [2 | see OTP principles (Open Telecom Platform)]. Although it is possible to have two workers connected.

Picture 1
')
If you do not rise to the level of abstraction that OTP provides, there are two mechanisms for communication of processes in an erlang:

Link (link) - bidirectional communication between two processes.
Monitors - unidirectional communication of the observer process and the observed.

1.1 Links

To create connections between processes, the following functions are used:

erlang: link / 1 - creating a link between the caller of the function and another process;
erlang: spawn_link / 1/2/3/4 (there is also the proc_lib alias: spawn_link / 1/2/3/4) - creating a new process and linking it to the process that calls the function;
erlang: unlink / 1 - remove the connection between the process calling the function and the one specified in the arguments;
pool: pspawn_link / 3 - creating a new process on one of the nodes in the pool and linking it to the process that calls the function.

What does bidirectional communication between processes give us? Links determine the path of error propagation. One process died, the second found out about it and in a number of cases, which we will consider below, it will also complete its work, having sent a signal to all other processes that are also attached to it. This mechanism allows you to remotely handle errors, i.e. A handler can be a separate process (controller) to which all these errors will “flow” along the links, and the handler process can be located on another node. And all these goodies are almost free - everything is already implemented in the platform, we only need to correctly build our mega-super-fail-distributed system.

Figure 2

When a process crashes (see Figure 2), an output signal is sent to all linked processes, this signal contains information about which process and for what reason died in battle. The signal is a tuple {'EXIT', Pid, Reason}.

There are two predefined values for the Reason variable:

normal - the given value of the cause is set if the process has performed all the work with which we have loaded it, i.e. simply reached the end of the function with which it was called. In this case, the processes that are linked to it will not complete their work.
kill is a non-intercepted signal that always kills a process, even a system one, is used to force termination of failed processes.

In order for a process to intercept output signals, it must be made systemic by setting the trap_exit flag by calling the process_flag function (trap_exit, true).

So, enough of the theory, let's try everything in practice. Open our favorite editor and create a small module. Let's first test the normal completion of the process. For simplicity, the experiment as one of the processes we will have a shell.

-module(links_test). -export([start_n/1, loop_n/1]). start_n(Sysproc) -> %% test normal reason process_flag(trap_exit, Sysproc), io:format("Shell Pid: ~p~n", [self()]), Pid = spawn_link(links_test, loop_n, [self()]), io:format("Process started with Pid: ~p~n", [Pid]). loop_n(Shell) -> %% loop for test normal reason receive after 5000 -> Shell ! timeout end.

Two functions are defined in the module: the first start_n creates a new process and links it with the calling process (in our case it will be a shell), takes the value boolean as the parameter, which makes the process systemic. The second loop_n is the body of the process being created; as an argument we pass to it the Pid of the calling process (shell). 5 seconds after starting the process, he sends a timeout message to the shell. Compile and run our process.

 (emacs@aleksio-mobile)2> links_test:start_n(false). Shell Pid: <0.36.0> Process started with Pid: <0.43.0> ok (emacs@aleksio-mobile)3> flush(). Shell got timeout ok (emacs@aleksio-mobile)4>

Call the function links_test: start_n with the parameter false, i.e. The shell is not a system process and it cannot catch exit signals. We see that the process was successfully created, because There is no tail recursion in the loop_n function, it will successfully execute and the process will end. Call the flush () function to reset all messages from the shell mailbox, and see that a message was received from our “Shell got timeout” process. We do not see any output signals, since the flag for processing this type of signals has not been set. Now we will make the shell a system process.

 (emacs@aleksio-mobile)5> links_test:start_n(true). Shell Pid: <0.36.0> Process started with Pid: <0.51.0> ok (emacs@aleksio-mobile)6> flush(). Shell got timeout Shell got {'EXIT',<0.51.0>,normal} ok (emacs@aleksio-mobile)8>

After executing the function, we see that, in addition to the timeout message, a message about normal termination was received from our process {'EXIT', <0.51.0>, normal}. A wonderful mechanism allows us to save on the amount of code when you need to know that the process has done its work (do not send the “I have done everything” signal myself).

Now let's try to generate an error other than normal. Modify the module code as in the listing below.

 -module(links_test). -export([start_n/1, loop_n/1]). start_n(Sysproc) -> %% test abnormal reason process_flag(trap_exit, Sysproc), io:format("Shell Pid: ~p~n", [self()]), Pid = spawn_link(links_test, loop_n, [self()]), io:format("Process started with Pid: ~p~n", [Pid]). loop_n(Shell) -> %% loop for test abnormal reason receive after 5000 -> Shell ! timeout, 1 / 0 end.

We are very harsh and decided to divide by zero, the compiler will naturally warn us that we are wrong, but we simply ignore its warning.

 (emacs@aleksio-mobile)33> links_test:start_n(false). Shell Pid: <0.117.0> Process started with Pid: <0.120.0> ok (emacs@aleksio-mobile)34> ** exception error: bad argument in an arithmetic expression in function links_test:loop_n/1 (emacs@aleksio-mobile)34> =ERROR REPORT==== 25-Feb-2011::16:22:48 === Error in process <0.120.0> on node 'emacs@aleksio-mobile' with exit value: {badarith,[{links_test,loop_n,1}]} (emacs@aleksio-mobile)34> flush(). ok (emacs@aleksio-mobile)35> self(). <0.122.0> (emacs@aleksio-mobile)36>

Note Shell Pid = <0.117.0>. After 5 seconds, an error falls out explaining that after all we were wrong. Let's try to see what is in the queue at the shell, and there it is empty. Where is our timeout letter? Execute the self () command, Shell Pid is now <0.122.0> - this means that our failed process sent a shell exit signal with a reason {badarith, [{links_test, loop_n, 1}]}, and since the shell is in this example not a system process, it safely crashed and was restarted by some controller (as we may consider in the following articles). Now turn on the output signal processing flag.

 (emacs@aleksio-mobile)40> links_test:start_n(true). Shell Pid: <0.132.0> Process started with Pid: <0.139.0> ok (emacs@aleksio-mobile)41> =ERROR REPORT==== 25-Feb-2011::16:34:19 === Error in process <0.139.0> on node 'emacs@aleksio-mobile' with exit value: {badarith,[{links_test,loop_n,1}]} (emacs@aleksio-mobile)41> flush(). Shell got timeout Shell got {'EXIT',<0.139.0>,{badarith,[{links_test,loop_n,1}]}} ok (emacs@aleksio-mobile)42> self(). <0.132.0> (emacs@aleksio-mobile)43>

I think comments on the results are unnecessary, everything is clear.

We analyzed four cases:

Trap_exit signal	Process reason	The action process, which remained "alive"
true	normal	The message comes in the mailbox {'EXIT', Pid, normal}
false	normal	The process continues its work
true	Anything other than normal and kill	The message comes in the mailbox {'EXIT', Pid, Reason}
false	Anything other than normal and kill	The process dies, sending out the output signal to all of its connections (i.e., error propagates)

Conclusion

In the following articles ( part 2 , part 3 ) we will look at the mechanism of monitors and shoot at the processes with the kill signals. I would like to hear the opinions of habrobitel, articles on what topics erlang you would be most interesting?

Bibliography

1. Excellent online documentation .
2. Principles of OTP .
3. ERLANG Programming by Francesco Cesarini and Simon Thompson.
4. Programming Erlang: Software for a Concurrent World by Joe Armstrong.

Source: https://habr.com/ru/post/114620/

All Articles