Facebook software development director Robert Johnson offered a formal apology
for the fact that they had to disconnect Facebook from the Internet for 2.5 hours. But there was no other way out.
Not only was the web interface unavailable, but the API did not work and the Like
buttons on 350,000 sites across the Internet turned off. The fall of Facebook is a big deal. According to Johnson, this is Facebook's largest downtime in the four and a half years of its existence.
The reason for the failure was the incorrect operation of the automatic system for checking configuration settings, during which it began to cause much more harm than good.
The operation of this system consists in searching for erroneous settings in the cache and replacing them with new settings from the database (persistent store). The problem is that as soon as changes were made to the database, which the automatic system perceived as incorrect, mass “error correction” began everywhere, with the result that the database cluster was overloaded with hundreds of thousands of queries per second.
Even after replacing the value in the persistent store, the flow of requests did not stop, because the automatic system had already managed to remove the “wrong” values in the cache. Since the servers could not cope with the processing of all requests, all new “incorrect” values appeared in the cache, which generated the next requests. The process is looping: “We had entered a feedback loop
,” says Robert Johnson.
To break the circle of requests, we had to resort to a painful procedure: block all traffic to the database cluster, that is, completely disconnect Facebook from the Internet. Users instead of the site saw the “DNS error”.
When the databases were restored and the initial cause of the failure was eliminated, then they began to gradually connect users.
Now the automatic system for checking settings is completely disabled. Programmers are pondering over its new design to avoid possible loops.