Drop Stack Overflow: What Happened

On the night from Thursday to Friday, the Stack Overflow resource was unavailable . After some time, the work was restored, however, on Friday morning the site again " fell ."

The SO notes that the failure caused a series of problems that appeared simultaneously. The system has failed to cope with a growing number of connections. Now the site is working normally.

Under the cut we will tell about the reasons that caused the failure.
')

/ photo by Hamza Butt CC

Stack Overflow runs on 9 web servers, each of which processes from 200 to 500 requests per second. As Architecture Lead of the Stack Overflow platform notes Nick Craver, on Friday the problem affected two servers: ny-web01 and ny-web04, which began to “bombard” the Stack Overflow database with a huge number of requests. This led to a depletion of the IIS thread pool and an increase in the waiting time for processing requests from the database.

It turned out that the new requests were waiting for a response from the thread pool, while not allowing the old one to finish. There was a deadlock. According to Nick, restricting traffic in theory would solve the problem, but this did not happen due to an error in the work of the load balancer.

Problem with load balancer

Ideally, HAProxy should have turned off the two “problem” servers automatically and before administrator intervention was required. But ASP.NET in Stack Overflow redirected from the home page to / error, and HAProxy received back the response code 302, which was interpreted as “success”. Therefore, no attempt to turn off the servers was made.

Nick Craver notes that they already have a solution for this problem. The team will make HAProxy expect only certain status codes and stop redirecting users from the home page. Nick implemented this feature for quite some time, but it was not added to production. Now its introduction is scheduled for next week.

Nick notes that the team has so far not been able to establish the exact cause that led to an increase in the number of SQL queries (in the thread on the SO he published a graph - it shows large bursts of activity). SO are working on this and plan to keep the resident platform informed.

Past outages

Note that the "fall" Stack Overflow happened in the past - there was a shutdown in 2014. However, then the problem was caused by a massive DDoS attack on the network service provider with which the platform cooperates. At that time, the problem was solved in an hour.

Disclaimer: as new information becomes available regarding the “fall” of SO-analytics, IT infrastructure or information security solutions, we will supplement this material.

PS What we write about in the First blog about corporate IaaS:

Source: https://habr.com/ru/post/349824/

All Articles

Drop Stack Overflow: What Happened

Problem with load balancer

Past outages

More articles: