Increase system fault tolerance by nodejs

3 years ago, I believed in the future of nodejs and started a campaign to implement this language into the most “problematic” services of our project. Everything worked out for us - the load was falling, stability was increasing. But still there was a rake, which I wanted to tell.

This is not an exhaustive guide to action, I just share my experience, if you are a pro in nodejs you can add your recommendations in comments, which I will gladly refer to in the article.

1. Nodejs - Single Flow

This will be a bit unusual, because if you have 4 cores, then a running node will only load one. Those. to load all 4 cores you need to run 4 instances (copies) of nodejs. Now, before these 4 instances, you need to install a balancer that will distribute the load. Better not just a balancer, but a proxy server (with the possibility of balancing):

This will allow some of your node's responses to be placed in the proxy cache and given away even without accessing the node itself.
This will allow you to distribute static content to you using specially designed software and not the “self-made” part of the JavaScript code.
This will give the opportunity to give the client some answer when something went wrong with the node.
You do not have to significantly change anything in the system when you have another node of the same service.

2. What to put in front of the node?

Here, there are many options, for example:

Run another instance node and use the cluster module
Use proxy balancing:

Do not put apache, he also knows how everything, but not very optimal in terms of the resources involved.
We chose nginx .
')

3. Instance under load, which works for a long time begins to "slow down"

This is due not only to the feature of the kernel, which lowers the priority of a process that has been running for too long, it’s also part of the problems with garbage collection in the node itself, it’s also part of the problems made in Javascript code. And we have not found anything better than to rebuild instances with a certain periodicity. The frequency of rebooting (depending on different reasons) from 1 time per hour to 1 time per day.

In order not to suspend the service, you need to run 2 instances (copies) of the service. During the reboot of the instance in the balancer, you need to remove the load from it.
We do this by making changes to the config and reload of nginx before and after restarting the instance.

4. Noda under load requires increasing the limit nofile limit (LimitNOFILE)

In many distributions, by default there are very modest numbers. I recommend betting more than 16000 (I have 131070 for me). This can be set with the ulimit -n 131070 command , or to fix /etc/security/limits.conf

We describe the server in the systemd standard, where the limit is set to the LimitNOFILE variable and it looks like this (file /usr/lib/systemd/system/nodejs1936.service):

[Unit] Description=Nodejs instance 1936 After=syslog.target network.target [Service] PIDFile=/var/run/nodejs1936.pid Environment=NODE_ENV=production Environment=NODE_APP_INSTANCE=1936 WorkingDirectory=/var/www/myNodejsApp # node_program supervisor ExecStart=/usr/bin/supervisor --harmony -- /var/www/myNodejsApp/index.js -p 1936 User=node Group=node LimitNOFILE=131070 PrivateTmp=true [Install] WantedBy=multi-user.target

5. Instance node memory is limited and this restriction is not very large

In nodejs, the default limit is set to the maximum amount of memory that each instance can “eat”. Quote Faq on v8 :
Currently, it has a memory limit of 512MB on 32-bit systems, and 1.4GB on 64-bit systems.

But there is a way to change it with the key - max-old-space-size , we specify the memory in M, for example, to increase to 4G we write - max-old-space-size = 4096

You can also influence the stack size with the - stack -size key, for example - stack-size = 512

Increasing memory can be useful when writing a periodically starting process that is designed to make the most of RAM to work with data. Well, for example, the script mailing letters, log analyzer, etc.

6. The code is created in the form of a large monoblock connected

Do not give “three in one” or “ten in one” - this may work, but any abnormal situation will fill up the entire project. On the contrary - divide everything into 3, 5, 10 independent services. The simpler and easier the service, the more stable its work. “Departure” of one service will result in the departure of part of the functional, but not the entire project.

Use decomposition, divide complex services into several simple ones. Interact between services using the REST protocol. This will be a technological foundation for the growth of your project. The “swollen” service can always be easily relocated to a more productive server without changing the application architecture.

There is the other side of the coin, simplifying the services themselves, we complicate the connections between them and this can also be a problem when the application has dozens or even hundreds of services, there can be hundreds or even thousands of links between them. This increases the resiliency of the system but also increases the complexity of the development, deployment of the project as well as the threshold for the developer to enter the project.

7. Use software that can restart the dropped node process.

No matter how well the code is written, that moment will come. when it falls, after an unforeseen error. In order to solve this standard problem, many cli utilities are written that monitor the node process and, if necessary, overload it:

We use supervisor .

The same functionality can be organized using the service settings in Systemd (recently there was an article).

8. Independently collect the latest releases of the node

We do not wait when the node will be updated in your distribution. It will be extremely inoperative. I recommend learning how to build packages for your Linux distribution yourself, especially since there is nothing complicated about it.

We collect fresh versions of the node in the form of rpm packages for the Fedora distribution in literally 1-3 days after the new release. After short tests we transfer all production services to the new node.

Do not forget to rebuild the node_modules , when changing the node version (or rather the major or minor version), problems may occur with modules, part of which code is written on p.

9. Do not be afraid to use ECMAScript 2015 (ES6)

Now we have nodejs 5.0.0 installed on production servers, and a year ago it was 0.11.6.

Back in 4.2.2, in order to add to the end of the arr1 array, the elements of the arr2 array, it was necessary to write like this

 var arr1 = [0, 1, 2]; var arr2 = [3, 4, 5]; // Append all items from arr2 onto arr1 Array.prototype.push.apply(arr1, arr2);

In 5.0.0 we can do it like this

 var arr1 = [0, 1, 2]; var arr2 = [3, 4, 5]; arr1.push(...arr2);

This begs the question: "How does this affect stability?". The future has already come, generators, classes, promises ... this is what makes the code more understandable and predictable, the clearer the code, the easier it is to notice the abnormality in it and eliminate it. Using ECMAScript 2015 (ES6) helped us to completely get rid of CallBack hell and significantly reduce the amount of code.

10. We test (well, at least part of the code)

Do not rush to throw at me all that is now at your fingertips. Yes, I am also not a fan of testing at the initial stage of development, but we are already at the stage of stabilizing the product once we read such articles, isn’t it? :)

I do not test everything. Usually I write simple tests that give me confidence that I have everything Ok with “external” services, and critical functions of the service itself. For example: I wrote down, edited and deleted the key in memcached or redis, similarly with mysql, mongodb, some external service.

Often there is a temptation to say: “so I’m monitoring MySQL with Zabiks, why should I check something from the service itself”. Not once in my practice there was a situation when the service seemed to be Ok, everything was OK with MySQL, but there was a problem with the connection between the service and MySQL, for example, the admin accidentally added a not quite correct rule in iptables, the lace between the server and the switch went down or hooked and it does not work well and errors occur during transmission over the network, too “smart” kernel based on MySQL decided that the service produces a SYN-flood and started dropping packets, etc.

You can test, by and large, without the framework, you simply return some value (0 or 1), or some number to your zabiks, and when you receive a value outside the allowable range, it sends you an SMS about the problem.

If there is a need for a framework, then look at these:

This is not all I wanted to say, but already enough for the article. In the next series, we will look at kernel tuning for the needs of nodes, virtualization rakes, how to properly cache nginx responses from nodejs instances.

Source: https://habr.com/ru/post/270391/

All Articles