System administrators of the Netflix service found an interesting solution that improved the service architecture and reduced the impact of technical problems on end users.
The company has written and launched an internal service called “Chaos Monkey” (Chaos Monkey (not to be confused with House!)), Which randomly kills AWS instances or processes on servers serving the service. Strangely enough, such an approach does not harm, but it helps technicians to improve the quality of service and increase uptime, killing several hare monkeys - Netflix systems undergo a round-the-clock test that: ')
All system nodes have redundant duplication.
The fall of a single server or process does not lead to problems in providing a service, even a minor one — for example, errors or debugging messages on the site.
System administrators know exactly what happens when each server crashes and how it affects the entire system.
System administrators have extensive experience solving server problems, almost every problem already has a documented solution
This original (if not paradoxical) approach saved a huge amount of money and company time. And what about this habraguru?