📜 ⬆️ ⬇️

Emergency resuscitation epmd

(The problem seems to be extremely exotic, but in terms of “how it is arranged inside,” it is quite informative.)

Here, for example, an application written on Erlang works for you (well, let's say, the same ejabberd). It has been working for a long time, it works well, but one day you are trying to run a control script (ejabberdctl, respectively), and it gives you “nodedown” or something else terrible in this spirit, they say, no one responds. At the same time, the application itself responds perfectly to all client requests and never heard that it is down. On a sudden influx, you run epmd -names and - oh, horror! - get an empty list.

Erlang programs use node@host notation to communicate with each other, physically, each node (read, the system process) opens a random high port for this. The task of the epmd service is to interconnect logical addressing by name and physical addressing by port number. A kind of DNS analogue, with the difference that without the registry the epmd cluster on the Erlang is falling apart into a handful of individual deaf-and-dumb nodes - which is what has just happened for some mysterious reason. You can, of course, start looking for the guilty, but at first it would be nice to lift the system into place.
')
What to do in this situation? You can, of course, simply restart the application by force, but, on the one hand, customers will fall off, on the other hand, it’s a pity for such a beautiful uptime ... Now, if you could somehow restore the registry on a live system, huh? ..

Never fear! Crumbs of information, dug up on the Internet, reported that epmd supposedly enough to throw only one correct package; For example, we quickly need to program a C program ... Well, we don’t have to go to a low level, it does a good job with sockets and on the Erlang itself, but it would be even easier. It would be desirable - please: we come across the erl_epmd module and we see in it - glory to open source codes! - the magic function register_node/2 , which accepts the node name and port number.

Using netstat we will find out which port is used by our application - let it be 23456 - and run the emulator in the terminal:

$ erl

The node must be anonymous, because now we will manually register the name, and this (see the code of the same erl_epmd ) can be done only once per launch.

> erl_epmd:start().

... but, since the node is still anonymous, you first need to manually start the gene server. Now we call our magic function:

> erl_epmd:register_node( ejabberd, 23456 ).

Now we can again communicate with the lost node. To check, we run epmd -names and rada again ...

So, stop, do not rejoice. The thing is, even though epmd accepts the registration of any free name on any earl port used, it also remembers from which connection this registration request came, and as soon as the connection has fallen off, the name will be released again. Therefore, now for the correct operation of the system, it is necessary to leave an unnecessary temporary emulator running. Disorder? Disorder. We should somehow get inside the application and call register_node from there ...

Elementary, Watson (when you already know how). We start the second emulator, this time with the name:

$ erl -sname repair

... and ping the node with the application:

repair@hostname> net_adm:ping( ejabberd@hostname ).

(If you received a pong, everything is according to plan. If you receive a pang, we close the article that has become useless and start running around the room with shouts.)
And now the focus. It lies in the fact that the connection between the nodes, being established — which we have just done — remains open further, and the epmd registry epmd no longer needed to communicate with it. Therefore, we now go back to the first emulator and nail it down:

> halt().

What did we just do? We, among other things, closed the connection to epmd , and he released the name ejabberd for re-registration. (Look again at epmd -names if you do not believe.)
Now, through our open connection, which remained hanging under the ceiling without insurance, we can, using the excellent rpc module, go to the application node and brilliantly end a simultaneous game session on three processes:

repair@hostname> rpc:call( ejabberd@hostname, erl_epmd, register_node, [ejabberd, 23456] ).
repair@hostname> halt().

So that is all. We close the terminals and do something for which at the very beginning they even got into the system.

Questions, clarifications, corrections are welcome.

Source: https://habr.com/ru/post/49535/


All Articles