Imgur photoservice engineer Jacob Greenleaf (Jacob Greenleaf) published in the blog on Medium
material , in which he outlined several tips on creating highly accessible code for fault-tolerant systems. We decided to look at the expert opinion.
/
photo EpicantusJacob has been working at Imgur for over two years now. His experience allows us to highlight a number of basic approaches that help create fault-tolerant systems. One of these approaches is to understand the adequate framework of the system. For example, the use of an infinite lock or still add timeout when connecting to the service over the network.
')
So, in the Imgur service, there is a cron task, which is called “the killer of long queries” - it scans active salaries in MySQL, which are executed on behalf of user requests, and checks how long they are executed, killing those that exceed the specified threshold.
Sometimes a simple analysis of the development process itself, for example, using the Value Stream Map, which helped to
visualize the real state of affairs as part of the SoundCloud development process. Weaknesses can be in the elementary overloading of third-party services and the simple waiting by one department of completion of work on the task of another department of the company.
Of course, there are bugs that pop up only from time to time, so you can increase the availability of systems by adding the restart function. However, here you need to be careful not to arrange DDoS on your systems with your own hands. Here you can simply follow the first advice and use a limited number of attempts to restart and make the system wait more and more time with each attempt to restart. This will help distribute the load.
Erlang is often used to create software with high availability requirements. It has a control pattern (supervisors): each task that the program performs is structured in such a way as to work under the supervision of the monitoring process (supervisor). If the supervisor detects an unexpected completion of a task, it restarts it (as in the previous rule) from a known error-free state.
By the way, earlier we previously
talked about what you can learn from the Whatsapp team. Just with the beginning of the era of messengers, Erlang opened a second wind. But Erlang has its drawbacks: the language is known for a relatively small number of developers; products at Erlang are not so easy to integrate with existing infrastructure.
It often helps to override the basic settings at the operating system level, as
happened in the case of Pinterest, which revealed the influence of the Linux kernel version on the optimization of the equipment. As part of this analysis, more than 60 different test configurations were tested.
The DevOps community is crazy about CoreOS and LXC-style lightweight containers. However, when Jacob Greenleaf tried to work with him, it turned out that the core component (FleetD) contained a
bug affecting the work with the scheduler , which could have caused the system to stop working. Greenleaf was able to deal with the problem after many hours of debugging, which ended at two in the morning. New technologies can often contain unknown errors and reasons for failure.
In addition, newer projects have a common feature - they are often too raw to work in production. For example, Golang has
no official debugger , and an alternative open debugger
appeared just a few months ago. Also, the Go monitoring and tracing tools are no match for Java JMX and Erlang.
PS A little more about the work of our virtual infrastructure provider: